One Statistician’s View of Big Data

Recently I've had several questions about using machine learning models with large data sets. Here is a talk I gave at Yale's Big Data Symposium on the subject.

I believe that, with a few exceptions, less data is more. Once you get beyond some "large enough" number of samples, most models don't really change that much and the additional computation burden is likely to cause practical problems with model fitting.

Off the top of my head, the exceptions that I can think of are:

  • class imbalances
  • poor variability in measured predictors
  • exploring new "spaces" or customer segments

Big Data may be great as long as you are adding something of value (instead of more of what you already have). The last bullet above is a good example. I work a lot with computational chemistry and we are constantly moving into new areas of "chemical space" making new compounds that have qualities that had not been previously investigated. Models that ignore this space are not as good as ones that do include them.

Also, new measurements or characteristic of your samples can make all the difference. Anthony Goldbloom of Kaggle has a great example from a competition for predicting the value of used cars:

The results included for instance that orange cars were generally more reliable - and that colour was a very significant predictor of the reliability of a used car.
"The intuition here is that if you are the first buyer of an orange car, orange is an unusual colour you're probably going to be someone who really cares about the car and so you looked after it better than somebody who bought a silver car," said Goldbloom.
"The data doesn't lie - the data unearthed that correlation. It was something that they had not taken into account before when purchasing vehicles."

​My presentation has other examples of adding new information to increase the dimensionality of the data. The final quote sums it up:

The availability of Big Data should be a trigger to really re-evaluate what we are trying to solve and why this will help.

Recent Changes to caret

Here is a summary of some recent changes to caret.

Feature Updates:

  • train was updated to utilize recent changes in the gbm package that allow for boosting with three or more classes (via the multinomial distribution)

  • The Yeo-Johnson power transformation was added. This is very similar to the Box-Cox transformation, but it does not require the data to be greater than zero.

New models referenced by train:

  • Maximum uncertainty linear discriminant analysis (Mlda) and factor-based linear discriminant analysis (RFlda) from the HiDimDA package were added.

  • The kknn.train model in the kknn package was added. This is basically a more intelligent K-nearest neighbors model that can use distance weighting, non-Euclidean distances (via the o Minkowski distance) and a few other features.

  • The extraTrees function in the package of the same name was added. This generalizes the random forest model by adding randomness to the predictors and the split values that are evaluated at each split point.

Numerous bugs were also fixed in the last few releases.

The new version is 5.16-04. Feel free to email me at mxkuhn@gmail.com if you have any feature requests or questions.

Projection Pursuit Classification Trees

I've been looking at this article for a new tree-based method. It uses other classification methods (e.g. LDA) to find a single variable use in the split and builds a tree in that manner. The subtleties of the model are:

  • The model does not prune but keeps splitting until achieving purity
  • With more than two classes, it treats the data as a two-class system in some parts of the algorithm (but predictions are still based on the original classes)

It is similar to oblique trees. These trees look for linear combinations of predictors to use in a split. The similarity between oblique trees and PPtree is the method of finding splits. In each case, a more parametric model can be used for this purpose. Some implementations of oblique trees use PLS, L2 regularization or linear support vector machines to find the optimal combination. Here, the authors use basic discriminant functions but using only a single predictor at a time. This connection wasn't mentioned in the paper (and comparisons were not made to these methods). They compared to CART and random forests. That's disappointing because there are a lot of other tree-based models and we have no idea how this model ranks among them (see Hand's "Classifier Technology and the Illusion of Progress").

My intuition tells me that the PPtree model is somewhat less likely to over-flit the data. While it lacks a pruning algorithm, the nature of the splitting method might make it more robust to small fluctuations in the data. One way to diagnose this is using more comprehensive cross-validation and also assessing whether bagging helps this model. The splitting approach should also reduce the potential problem of bias towards predictors that are more granular. One other consequence of their tree-growing phase is that it eliminates the standard method of generating class probabilities (since it splits until purity).

PPtrees might do a better job when there are a few linear predictors that drive classification accuracy. This could have been demonstrated using simulation of some sort.

A lot of tree methods have sub-algorithms for grouping categorical predictors. This model only works with such data as a set of disconnected dummy variables. This isn't good or bad since I have found a lot of variation in which type of encoding works with different tree methods.

The bad news: the method is available in an R package, but there are big implementation issues (to me at least). The package strikes me as a tool for research only (as opposed to software that would enable PPtrees to be used in practice). For example:

  • It ignores basic R conventions (like returning factor data for predictions)
  • It also ignores object oriented programming. For example, there is no predict method. That function is named PP.classify.
  • Speaking of PP.classify, you have to trick the code into giving predictions on unknown samples. That is a big red flag to me.
  • Little things are missing (e.g. no print or plot method). They could have used the partykit package to beautifully visualize the tree.

I've ranted about these issues before and the package violates most of my checklist. Maybe this is just part of someone's dissertation and maybe they didn't know about this list etc. However, most of the items above should have been avoided.