Up Sampling imbalanced dataset's minor classes

i am using scikit-learn to classify my data, at the moment i am running a simple DecisionTree classifier. I have three classes with a big imbalanced problem. The classes are 0,1 and 2. The minor classes are 1 and 2. To give you an idea about the number of samples of the classes:

0 = 25.000 samples 1 = 15/20 less or more 2 = 15/20 less or more

so minor classes are about 0.06% of the dataset. The approach that i am following to solve the imbalance problem is the UPSAMPLING of the minor classes. Code:

from sklearn.utils import resample, resample(data, replace=True, n_samples=len_major_class, random_state=1234)

If I upsample the minor classes and then divide my dataset in two groups one for training and one for testing. the accuracy is:

 precision recall f1-score support 0 1.00 1.00 1.00 20570 1 1.00 1.00 1.00 20533 2 1.00 1.00 1.00 20439 avg / total 1.00 1.00 1.00 61542

very good result.

If I ONLY upsample the training data and leave the original data for testing, the result is:

 precision recall f1-score support 0 1.00 1.00 1.00 20570 1 0.00 0.00 0.00 15 2 0.00 0.00 0.00 16 avg / total 1.00 1.00 1.00 20601

as you can see the global accuracy is high, but the accuracy of the class 1 and 2 is zero.

I am creating the classifier in this way:

DecisionTreeClassifier(max_depth=20, max_features=0.4, random_state=1234, criterion='entropy')

I also have tried adding the class_weight with balanced value, but it makes no difference.

I should only upsample the training data, why am i getting this strange problem?

asked Nov 9, 2018 at 0:55 4,566 16 16 gold badges 77 77 silver badges 112 112 bronze badges

3 Answers 3

The fact that you obtain that behavior is quite normal when you do the re-sampling before the splitting; you are inducing a bias in your data.

If you oversample the data and then split, the minority samples in the test won't be anymore independent from the samples in the training set because they are generated together. In your case they are exact copies of the samples in the training set. Your accuracy is 100% because the classifier is classifying samples that have already been seen in the training.

Since your problem is strongly umbalanced I would suggest to use an ensemble of classifiers to handle it. 1) Split your dataset in training set and test set. Given the size of the dataset you can sample 1-2 samples from the minority class for test and leave the other for training. 2) From the training you generate N datasets containing all the remaining samples of the minority class and under-samples from the majority class (i would say 2*number of samples in the minority class). 3) For each one of the dataset obtained you train a model. 4) Use the test set to obtain the prediction; the final prediction will be the results of a majority vote of all the predictions of the classifiers.

To have robust metrics perform different iterations with different initial splitting test/training.