We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discuss
- down-sampling: sample the majority class to make their frequencies closer to the rarest class.
- up-sampling: the minority class is resampled to increase the corresponding frequencies
- hybrid approaches: some methodologies do a little of both and possibly impute synthetic data for the minority class. One such example is the SMOTE procedure.
Here is an image from the book that shows the results of sampling a simulated data set:
The down-side to down-sampling is that information in the majority classes is being thrown away and this situation becomes more acute as the class imbalance becomes more severe.
Random forest models have the ability to use down-sampling without data loss. Recall that random forests is a tree ensemble method. A large number of bootstrap samples are taken form the training data and a separate unpruned tree is created for each data set. This model contains another feature that randomly samples a subset of predictors at each split to encourage diversity of the resulting trees. When predicting a new sample, a prediction is produced by every tree in the forest and these results are combined to generate a single prediction for an individual sample.
Random forests (and bagging) use bootstrap sampling. This means that if there are n training set instances, the resulting sample will select n samples with replacement. As a consequence, some training set samples will be selected more than once.
To incorporate down-sampling, random forest can take a random sample of size c*nmin, where c is the number of classes and nmin is the number of samples in the minority class. Since we usually take a large number of samples (at least 1000) to create the random forest model, we get many looks at the data in the majority class. This can be very effective.
The R package for the book contains scripts to reproduce almost of the analyses in the text. We mistakenly left out the code to down-sample random forests. I'll demonstrate it here with a simulated data set and then show code for the caravan policy data use din the chapter.
Let's create simulated training and test sets using this method:
Now we will train two random forest models: one using down-sampling and another with the standard sampling procedure. The area under the ROC curve will be used to quantify the effectiveness of each procedure for these data.
> ctrl <- trainControl(method = "cv", + classProbs = TRUE, + summaryFunction = twoClassSummary) > > set.seed(2) > rfDownsampled <- train(Class ~ ., data = training, + method = "rf", + ntree = 1500, + tuneLength = 5, + metric = "ROC", + trControl = ctrl, + ## Tell randomForest to sample by strata. Here, + ## that means within each class + strata = training$Class, + ## Now specify that the number of samples selected + ## within each class should be the same + sampsize = rep(nmin, 2)) > > > set.seed(2) > rfUnbalanced <- train(Class ~ ., data = training, + method = "rf", + ntree = 1500, + tuneLength = 5, + metric = "ROC", + trControl = ctrl)
Now we can compute the test set ROC curves for both procedures:
> downProbs <- predict(rfDownsampled, testing, type = "prob")[,1] > downsampledROC <- roc(response = testing$Class, + predictor = downProbs, + levels = rev(levels(testing$Class))) > > unbalProbs <- predict(rfUnbalanced, testing, type = "prob")[,1] > unbalROC <- roc(response = testing$Class, + predictor = unbalProbs, + levels = rev(levels(testing$Class)))
And finally, we can plot the curves and determine the area under each curve:
> plot(downsampledROC, col = rgb(1, 0, 0, .5), lwd = 2)Call: roc.default(response = testing$Class, predictor = downProbs, levels = rev(levels(testing$Class))) Data: downProbs in 701 controls (testing$Class Class2) < 4299 cases (testing$Class Class1). Area under the curve: 0.9503> plot(unbalROC, col = rgb(0, 0, 1, .5), lwd = 2, add = TRUE)Call: roc.default(response = testing$Class, predictor = unbalProbs, levels = rev(levels(testing$Class))) Data: unbalProbs in 701 controls (testing$Class Class2) < 4299 cases (testing$Class Class1). Area under the curve: 0.9242> legend(.4, .4, + c("Down-Sampled", "Normal"), + lwd = rep(2, 1), + col = c(rgb(1, 0, 0, .5), rgb(0, 0, 1, .5)))
This demonstrates an improvement using the alternative sampling procedure.
One last note about this analysis. The cross-validation procedure used to tune the down-sampled random forest model is likely to give biased results. If a single down-sampled data set is fed to the cross-validation procedure, the resampled performance estimates will probably be optimistic (since the unbalance was not present). In the analysis shown here, the resampled area under the ROC curve was overly pessimistic:
> getTrainPerf(rfDownsampled)TrainROC TrainSens TrainSpec method 1 0.8984348 1 0.07142857 rf> auc(downsampledROC)Area under the curve: 0.9503
For the caravan data in Chapter 16, this code can be used to fit the same model:
set.seed(1401) rfDownInt <- train(CARAVAN ~ ., data = trainingInd, method = "rf", ntree = 1500, tuneLength = 5, strata = training$CARAVAN, sampsize = rep(sum(training$CARAVAN == "insurance"), 2), metric = "ROC", trControl = ctrl) evalResults$RFdownInt <- predict(rfDownInt, evaluationInd, type = "prob")[,1] testResults$RFdownInt <- predict(rfDownInt, testingInd, type = "prob")[,1] rfDownIntRoc <- roc(evalResults$CARAVAN, evalResults$RFdownInt, levels = rev(levels(training$CARAVAN)))
For this entry, the code formatting was created by Pretty R at inside-R.org
ASA Talk in New York This Thursday (11/14)
I'll be giving a talk on predictive modeling for the American Statistical Association next Thursday (the 14th) :
Predictive Modeling: An Introduction and a Disquisition in Three Parts
The primary goal of predictive modeling (aka machine learning) (aka pattern recognition) is to produce the most accurate prediction for some quantity of interest. In this talk, I will give a brief introduction as well as as a discussion on three topics: the friction between interpretability and accuracy, the role of Big Data and the current unmet needs.
The Basics of Encoding Categorical Data for Predictive Models
Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: "Is it bad to feed it non-numerical data such as factors?" As usual, I will try to make my answer as complex as possible.
(I've heard the old wives tale that eskimos have 180 different words in their language for snow. I'm starting to think that statisticians have at least as many ways of saying "it depends")
BTW, we cover this in Sections 3.6, 14.1 and 14.7 of the book.
My answer: it depends on the model. Some models can work with categorical predictors in their nature, non-numeric encodings. Trees, for example, can usually partition the predictor into distinct groups. For a predictor X with levels a through e, a split might look like
if X in {a, c, d} the class = 1
else class = 2
Rule-based models operate similarly.
Naive Bayes models are another method that does not need to re-encode the data. In the above example, the frequency distribution of the predictor is computed overall as well as within each of the classes (a good example of this is in Table 13.1 for those of you that are following along).
However, these are the exceptions; most models require the predictors to be in some sort of numeric encoding to be used. For example, linear regression required numbers so that it can assign slopes to each of the predictors.
The most common encoding is to make simple dummy variables. The there are C distinct values of the predictor (or levels of the factor in R terminology), a set of C - 1 numeric predictors are created that identify which value that each data point had.
These are called dummy variables. Why C - 1 and not C? First, if you know the values of the first C - 1 dummy variables, you know the last one too. It is more economical to use C - 1. Secondly, if the model has slopes and intercepts (e.g. linear regression), the sum of all of the dummy variables wil add up to the intercept (usually encoded as a "1") and that is bad for the math involved.
In R, a simple demonstration for the example above is:
> pred1 <- factor(letters[1:5]) > pred1 [1] a b c d e Levels: a b c d e
The R function model.matrix is a good way to show the encodings:
> model.matrix(~pred1) (Intercept) pred1b pred1c pred1d pred1e 1 1 0 0 0 0 2 1 1 0 0 0 3 1 0 1 0 0 4 1 0 0 1 0 5 1 0 0 0 1 attr(,"assign") [1] 0 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$pred1 [1] "contr.treatment"
A column for the factor level a is removed (since it excludes the first level of the factor). This approach goes by the name of "full-rank" encoding since the dummy variables do not always add up to 1.
We discuss different encodings for predictors in a few places but fairly extensively in Section 12.1. In that example, we have a predictor that is a date. Do we encode that as the day or the year (1 to 365) and include it as a numeric predictor? We could also add in predictors for the day of the week, the month, the season etc. There are a lot of options. This question of feature engineering is important. You want to find the encoding that captures the important patterns in the data. If there is a seasonal effect, the encoding should capture that information. Exploratory visualizations (perhaps with lattice or ggplot2) can go a long way to figuring out good ways to represent these data.
Some of these options result in ordered encodings, such as the day of the week. It is possible that the trends in the data are best exposed if the ordering is preserved. R does have a way for dealing with this:
> pred2 <- ordered(letters[1:5]) > pred2 [1] a b c d e Levels: a < b < c < d < e
Simple enough, right? Maybe not. If we need a numeric encoding here, what do we do?
There are a few options. One simple way is to assign "scores" to each of the levels. We might assign a value of 1 to a and think that b should be twice that and c should be four times that and so on. It is arbitrary but there are whole branches of statistics dedicated to modeling data with (made up) scores. Trend tests are one example.
If the data are ordered, one technique is to create a set of new variables similar to dummy variables. However, their values are not 0/1 but are created to reflect the possible trends that can be estimated.
For example, if the predictor has two ordered levels, we can't fit anything more sophisticated to a straight line. However, if there are three ordered levels, we could fit a linear effect as well as a quadratic effect and so on. There are some smart ways to do this (google "orthogonal polynomials" if you are bored).
For each ordered factor in a model, R will create a set of polynomial scores for each (we could use the fancy label of "a set of basis functions" here). For example:
> model.matrix(~pred2) (Intercept) pred2.L pred2.Q pred2.C pred2^4 1 1 -0.6325 0.5345 -3.162e-01 0.1195 2 1 -0.3162 -0.2673 6.325e-01 -0.4781 3 1 0.0000 -0.5345 -4.096e-16 0.7171 4 1 0.3162 -0.2673 -6.325e-01 -0.4781 5 1 0.6325 0.5345 3.162e-01 0.1195 attr(,"assign") [1] 0 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$pred2 [1] "contr.poly"
Here, "L" is for linear, "Q" is quadratic and "C" is cubic and so on. There are five levels of this factor and we can create four new encodings. Here is a plot of what those encodings look like:
The nice thing here is that, if the underlying relationship between the ordered factor and the outcome is cubic, we have a feature in the data that can capture that trend.
One other way of encoding ordered factors is to treat them as unordered. Again, depending on the model and the data, this could work just as well.