The Techniques to Handle Imbalanced Data
What’s imbalanced dataset?
There’s a common occasion that we need to build the binary classification model on the dataset which contains 99% of the negative labels and only 1% of the data is positive ex: fraud detection, spam detection, or when we care about the inference results of the 5% quantile when solving the regression problem.
Choose evaluation metrics
First step to dealing with imbalanced dataset is choose the right evaluation metrics. The best practice is using recall, precision and F-measure, which can reflect the performance of unbalanced side label by define it as positive label.
DownSampling
Assume we are doing on a fraud detection task that 99.9% of the data is non-fraud and only 0.1% of the data is fraud. In Down-sampling strategy, we sample the data from the majority class to achieve 50/50 split for non-fraud and fraud.
Note: Due to down-sampling would loss the information from majority class, so it usually combine with the ensemble technique. Each sub-model is training on the different subset from the original dataset by down-sampling
UpSampling
In up-sampling, we overrepresented the minority class by both replicating and generating the additional synthetic examples.
- SMOTE: An algorithm that constructing the synthetic examples by the analyzing the feature space and generate the examples by nearest neighbor approach.
Data Leakage: When we performing Up-Sampling, it is suggested not to do it before cross validation, because “data leakage” may cause the problem when we are evaluating on validation set.
Reframing and Cascade
Assume we are preforming a regression task that we also care about the results for part of outliers. The first step we can bucketize the output by observing the output distribution, second step we cascade the regression model and training it on the bucket of the data it correspond to.
Why it works: When we are training the model, we optimize it by minimizing the loss according to the dataset we are looking at, this means that the majority of the data distribution would reflect heavily on our model. Bucketing the data and training on the corresponding model can let us optimize for different data groups.
Reference
[1] Machine Learning Design Patterns