source: megapixl.com
In this article, we would — state the appropriate criteria for applying the k-fold cross-validation on an imbalanced class distribution problem; and demonstrate how to implement that in python through Jupyter notebook.
This article will cover the following sections:
A Quick Overview Of The K-Fold Cross Validation
Overview Of Data Source
Rules For Correctly Applying The K-Fold Cross Validation On An Imbalanced Class Distribution Model
How To Apply A K-Fold Cross Validation On An Imbalanced Classification Problem In Python
A Quick Overview Of The K-Fold Cross-Validation
It is statistically unreliable to evaluate the performance of a model just once. It is best to repeat the performance evaluation multiple times to create a distribution of performance evaluation values, and then take a summary statistics (say mean and standard deviation)of the distribution. This ensures we get a true representative estimate of the true performance value of the model, together with the variance. This concept is called repeated random sampling.
The k-fold cross-validation (k-fold cv)makes use of the repeated random sampling technique to evaluate model performance by dividing the data into 5 or 10 equal folds and thereafter evaluating the performance of the model on each fold. Specifically, it evaluates the performance of a model by carrying out the following ordered steps:
- Shuffle the main data,
- Divide the main data into k groups without replacement,
- Take the first group as a test data, and the remaining k-1 group as a training data set, then perform model performance evaluation,
- Repeat step 3 above, for every other group (second group, third group… kth group) E.g Take the second group as a test data, and the remaining k-1 group as training data set, then perform model performance evaluation,
- Score model performance( on accuracy, roc_auc, etc)for each k-fold cross-validation evaluation, and
- Take the mean and variance of the distribution of scores to estimate the overall performance of the model.
The figure below highlights the number of cross-validation evaluations (5 evaluations carried out) carried out on a 5-fold split to evaluate a model performance:
In general and as observed from the figure above, each group of a k group split would be a test group once, and a member of a training data set k-1 times during model performance cross-validation evaluation.
There are four types of k-fold cross-validation, namely: Train/test split, LOOCV, stratified, and repeated.
Overview Of Data Source
The data used in this article is a free sourced secondary data of the demographic and health survey (DHS) for Nigeria, called NDHS 2018. It was conducted in 2018 hence the name. The DHS is a nationally representative survey that is being conducted in developing countries. The data used in this article is the sample of the rural respondents from the whole dataset. This subsample emanated from the preprocessing carried out in my article — Basic example of a machine learning prediction
To request any country-level free sourced data from the DHS program, kindly visit the DHS Program, sign-up, and request.
Rules For Correctly Applying The K-Fold Cross Validation On An Imbalanced Class Distribution Model
The rule of thumb when using the k-fold cross-validation is to directly split the data into 10 or 5 folds or groups. In general, the k-fold cross-validation performance evaluation method relies on the assumption that — each fold data is a representative sample of the main data and reflects the class distribution of the target feature in the main data. Ultimately, it assumes the class distribution of the target feature in the main data is 50:50.
However, applying this rule to an imbalanced classification problem poses a distribution problem that might result in a biased estimate or overfitting in favor of the majority class. The correct use of the k-fold cross-validation in an imbalanced class distribution problem, require:
- That each k-fold data is stratified to capture the imbalanced class distribution of the target feature in the main data. This can be achieved using the stratified k-fold cross-validation;
- That, at each cross-validation evaluation, only the training set is oversampled (using synthetic minority oversampling technique or other class balancing techniques). This can be achieved using a machine learning pipeline. Setting a pipeline helps prevents data leakage;
- That, at each cross-validation evaluation, the test data is not oversampled i.e it is unaffected by the oversampling, though it maintains the imbalanced class distribution of the target feature as in the main data;
- That the oversampling is never done on the main data but the training data set, during each k-fold cross-validation evaluation.
As an illustration, let us assume we are dealing with a contraceptive use binary classification problem, with a sample size of 5000 observations or index. If the percentage of those using contraceptives is 10% (sample value of 500), and those not using is 90% (sample value of 4500); then we will have a 1:9 ratio imbalanced classification problem.
We would apply the stratified k-fold cross-validation in this instance to split the 5000 into 10 folds, with each having a sample size of 500.
The stratified k-fold cross validation ensures each fold’s sample is randomly selected without replacement, to reflect the 1:9 ratio imbalance distribution of the target feature in the main data.
At each cross-validation evaluation (periods, where one group is used as a test data and the remaining 9, is used as the training data, to evaluate the performance of a model) only the training data set is oversampled. Though the test data is not oversampled, it maintains the inherited 1:9 imbalance ratio distribution of the target feature in the main data.
Also, the oversampling is never done on the main data, but on the training data set, at each cross-validation evaluation. Also, A pipeline is deployed to ensure that the oversampling occurs only during each cross-validation evaluation on the training data. Using a pipeline helps prevent data leakage.
How To Apply A K-Fold Cross Validation On An Imbalanced Classification Problem In Python
In this section, we would carry out a k-fold cross-validation evaluation on an imbalanced binary classification data
First, we will import and load data into python through Jupyter notebook
path=r'B:Rural_Nig_data_only.csv'
Rural_data_only=pd.read_csv(path)
The sample size of the data is 24,837, and the total number of columns is 35.
Rural_data_only.shape
‘v313_Using modern method’ is the Target feature for this dataset. It is the feature that indicates modern contraceptive use status of respondents
Rural_data_only.info()
‘v313_Using modern method’ has 2 response-category, namely — ‘currently using modern contraceptive’, denoted as 1; and ‘currently not using modern contraceptive’, denoted as 0. ‘v313_Using modern method’ has an imbalance classification as can be seen below:
Rural_data_only['v313_Using modern method'].value_counts()
Rural_data_only['v313_Using modern method'].value_counts(normalize=True) * 100
Let us define the dataset and select the X (Predictor features) and y (Target feature)
# define dataset
array=Rural_data_only.values
#select the X, and y
X=array[:,0:34]
y=array[:,34]
Next, we would define a pipeline to set the oversampling (using SMOTE) of the training data, at each cross-validation evaluation process.
steps = [('over', SMOTE()), ('model', LogisticRegression())]
pipeline = Pipeline(steps=steps)
Then we will use the stratified k-fold cross-validation to divide our main data into a representative 10 folds. The stratified k-fold cross validation ensures that training and test data in each fold reflect the imbalanced distribution of the target feature in the main data. Recall, oversampling will be carried out only on the training data group, and not on the test data, at each cross-validation evaluation.
We will evaluate the logistic model performance on our imbalanced dataset with the pipeline and the stratified k-fold cross validation below:
# evaluate pipeline
for scoring in["accuracy", "roc_auc"]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=0)
scores = cross_val_score(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1)
print("Model", scoring, " mean=", scores.mean() , "stddev=", scores.std())
From the above, the estimated average accuracy value of the logistic model is 84% ± 0.7, and the area under the receiver operating characteristic curve (ROC_AUC) has an estimated average value of 77 ± 1.1 while controlling for the imbalance characteristics of the classification.
The Jupyter notebook used for this analysis is available here
Now that’s it. Hope you find this useful? Please drop your comments and follow me on LinkedIn at Ayobami Akiode LinkedIn
References
SMOTE for Imbalanced Classification with Python – Machine Learning Mastery
https://scikit-learn.org/0.15/modules/cross_validation.html#stratified-k-fold
https://www.bmc.com/blogs/create-machine-learning-pipeline/
https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/
How to carry out k-fold cross-validation on an imbalanced classification problem was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.