Basic example of a machine learning model prediction: Predicting modern contraceptive use in rural…

A basic example of a machine learning prediction: Predicting contraceptive use in rural Nigeria

In this article, we used the real-world example data of the demographic and health survey for Nigeria to develop a machine learning model that will predict modern contraceptive use in the rural area of Nigeria using the logistic regression classification model.

This article will have the following sections:

Overview of analysis, and data source

Loading data into python

Data and features description

Selecting out rural respondents sample from the whole dataframe

Univariate (single feature) exploratory data analysis

Feature Engineering

Bivariate (two feature) exploratory data analysis and test of association

Checking features for missing values

Dummy coding the features

Class imbalance correction and train/test data split

Feature selection for modelling

Data Modelling

Model Performance evaluation

Model features importance

k-fold Cross Validation

Saving model for future use

Overview of analysis, and data source

This article will make use of the latest free sourced secondary data of the demographic and health survey (DHS) for Nigeria, called NDHS 2018. It was conducted in 2018 hence the name. The DHS is a nationally representative survey that is being conducted in developing countries. To request any country-level free sourced data from the DHS program, kindly visit the DHS Program, sign-up, and request.

In this analysis, we would make use of python libraries to implement different sections of the study. Specifically, we will use:

a. Pandas — to read and manipulate our dataset into python and for analysis;

b. Matplotlib and Seaborn — to carry out all visualizations;

c. Numpy and Scipy — to carry out all scientific, mathematical, and statistical computing;

f. Sklearn — to provide the required algorithms for model building;

g. Eli5 — to identify the order of importance of features in the model; and

h. Joblib — to save the model for future use and deployment.

The overarching aim of this analysis is to develop a binary logistic model that will determine if a woman of reproductive age ( between 15–49 years) that lives in the rural area of Nigeria is currently using a modern contraceptive method or not, by looking at her socio-demographic and proximate characteristics, namely — age, working status, marital status, religion, the region of residence, parity, education, wealth index, and sexual activeness.

A binary logistic model, as the name suggests, is a machine learning classification model that is suitable for predicting a categorical target feature with only two response-category, with one of the response-category, numerically, coded as 1, and the other as 0. At any time, the outcome of predicting either 1 or 0 will be determined by a set of identified, selected, and eligible predictor features. In selecting predictor features, it is imperative to be guided by domain knowledge.

Loading data into python

As provided by the DHS program, the NDHS 2018 dataset is in different formats (.csv, .sav, .dta, .xls etc). In this analysis, we would load the NDHS 2018 data into the python pandas using the .dta (STATA extension) format.

As a start, we would first import all the required python libraries required for this analysis. Note: we are using a jupyter notebook, from anaconda, as our integrated development environment (IDE) for this analysis.

Next, we will import the NDHS dataset into pandas.

path=r"B:NGIR7ADTNGIR7AFL.dta"
DHS=pd.read_stata(path)
Variables_needed=['v024', 'v130', 'v502', 'v025', 'v106', 'v190', 'v201', 'v313', 'v536', 'v013', 'v714']
Total_Sample=DHS[Variables_needed]

The total sample is 41821 (the total respondents’ data), and the total features imported from the NDHS dataset into pandas for our analysis is 11. (Note: The NDHS 2018 dataset has 5394 features/variables in total).

The data structure of the NDHS 2018 dataset in the pandas dataframe can be seen below:

Data and features description

The table below shows the features/variables that will be used in this analysis and their characteristics:

**List of features to be used in this analysis**

Dependent or Target feature/variable

In this analysis, the feature — v313 is the dependent or target feature. It has the response-category: no method, modern method, traditional method, and folkloric method. As we would be using the logistics regression model in this analysis, it will be recoded, by feature engineering, from a four response-category feature to two.

We will group the four response-category into — “Using Modern Contraceptive” coded as 1, and “Not Using Modern Contraceptive” coded as 0.

The predictor features

In this analysis, the predictor or independent features are v024, v106, v130, v190, v013, v201, v714, v502, v536. Feature — v025 will be used to select out the rural sample, and delete the urban sample from the dataframe. After using v025 for this function, we will delete it from the list of features in the dataframe.

Selecting out rural respondents sample from the whole dataframe

To derive the needed sample of rural respondents from the whole data (41,821), we would use the v025 feature. Below is the response-category and frequency distribution of the v025 feature:

We would drop data of urban respondents from the dataset. So that only rural respondents’ data will be in the dataframe.

Rural_data_only=Total_Sample.drop(Total_Sample[Total_Sample['v025']=='urban'].index)

We could also check Rural_data_only features information to see if the v025 feature is still among the features.

Univariate (single feature) exploratory data analysis (EDA)

In this section, we would attempt to understand the distribution of features at both univariate (individual features) using frequency distribution, percentage distribution, and visualization.

We would start by attempting to understand our target feature i.e v313. Assessing the frequency and percentage distribution can guide and suggest any required feature engineering.

As shown above and as discussed earlier, our predicted feature has four response-category, we need to group this category into just two so that the use of a logistic regression model condition can be met. We will address this here.

Also, we observe that the percentage of rural women using a modern method is just 8%; this implies that the ratio of modern contraceptive users to non-contraceptive users is about 1:9. This is an indication that an imbalance classification problem exists. We will resolve this in the class imbalance correction section.

Below are univariate exploratory analysis of the remaining features:

Based on the above univariate exploratory analysis and domain knowledge, we would need to recode the response-category of some features. We would do this in the feature engineering section.

Feature Engineering

In this section, we would conduct recoding and grouping on some features’ response-category as informed by the univariate exploratory analysis. This is essential, before conducting a bivariate exploratory analysis.

We would need to group the response-category of the following features: Contraceptive use (v313), Religion(v130), Education(v106), Wealth index (v190), Parity (v201), Sexual activeness (v536) i.e For each of these selected features, we would need to collapse some of their response-category.

As an illustration, for feature — religion(v130), we would collapse “Other Christian”, and “Catholic” into one category as “Christian”. Also, we would collapse “Traditionalist” and “Other” into a single category as “Others” With this, the religion feature (v130) will now have three response-category namely: Muslim, Christian, and others.

Below is the engineering of each feature as required:

Independent_recode_v130 =  {"v130":
                            
                            {"other christian": 'christians',  
                             "catholic": 'christian',  
                             "other": 'others',  
                             "traditionalist": 'others',  
                             "islam": 'muslim'}
                           }

Rural_data_only.replace(Independent_recode_v130, inplace=True)

Rural_data_only['v130'].value_counts(normalize=True) * 100

Independent_v201 = {
                "v201": {0: '0',  1: '1-5',  2: '1-5',  3:'1-5',  
                         4: '1-5', 5: '1-5', 6: '6', 7: '7+',  
                         8: '7+',  9: '7+',  10: '7+',  11: '7+',  
                         12: '7+',  13: '7+',  14: '7+',  15: '7+',  
                         16: '7+', 17: '7+'}
                    }
Rural_data_only.replace(Independent_v201, inplace=True)
Rural_data_only['v201'].value_counts(normalize=True) * 100

Bivariate (two features) exploratory data analysis (EDA) and test of association

Now, we would conduct bivariate exploratory analysis between each predictor feature, and the dependent or target feature through data visualization to understand the pattern of distribution of the dependent feature among the response-category of each independent feature.

Marital Status and Modern Contraceptive Use

From the analysis above, it appears respondents that were married (formerly or currently) use modern contraceptive methods than their unmarried counterpart. This suggests that marital status could be a good predictor of modern contraceptive use. The chi-square statistic test of association between the two features indicated the same (χ2 = 33.657, p-value = 0.000)

tab1=pd.crosstab(index=Rural_data_only.v502, columns=Rural_data_only.v313)
print('The two-way table between marital status and modern contraceptive use status is: n' , tab1)

chi2, p, dof, expected=chi2_contingency(tab1.values)
print('Chi-square statistic is %0.3f,  p_value is %0.3f' %(chi2,p))

Religion and Modern Contraceptive Use

From the analysis above, it appears respondents that were Christians, use modern contraceptives, three times more than those that are Muslims, this could mean that religion could be a strong predictor of modern contraceptive use. The statistical test also indicated same — χ2 = 795.017, p-value = 0.000

Wealth Index and Modern Contraceptive Use

From the analysis above, the comparison between rich (rich and average respondents in the dataset) and poor respondents show an increased use of modern contraceptive methods between the former than the latter. the chi-square statistical test also suggested same — χ2 = 374.002, p-value = 0.000.

Working status and Modern Contraceptive Use

From the analysis above, it appears respondents that are currently working are significantly using modern contraceptive than those that are not. This could mean that working status may be a significant predictor of modern contraceptive use. This is the chi-square test of association — χ2 = 186.850, P-value = 0.000

Age and Modern Contraceptive Use

From the analysis above, it appears there is an increase in modern contraceptive use as age increases. This appears to peak for the cohort of respondents between age 25–39 years and then begins to drop. This suggests that age may be a significant predictor of modern contraceptive use. The chi-square statistic test of association also suggested same (χ2 = 365.577, p-value = 0.000) as seen below

Parity (Number of children ever born) and Modern Contraceptive Use

From the analysis above, it appears respondent who has not given birth ever before do not use a modern contraceptive, while those that have given birth one to five times use modern contraceptive more, also respondents that have given birth six or more times use modern contraceptive significantly less than those who have given birth one to five times. This suggests that parity may be a significant predictor of modern contraceptive use. This is also suggested by the chi-square test of association — χ2 = 177.454, P-value = 0.000, as seen below

Sexual activeness and Modern Contraceptive Use

From the analysis above, it is observed that only respondents who have ever had sex are the significant users of modern contraceptives. Also, those who were recently active used more contraceptives than those who were not recently active. This suggests that sexual activeness is highly associated with modern contraceptive use. The chi-square statistic also shows that there is an association between sexual activeness and modern contraceptive use.

Region of residence and Modern Contraceptive Use

Checking features for missing values

All the features in the Rural_data_only dataframe do not have any missing value

Dummy coding the features

Machine learning supervised models only feed on numerical data otherwise they will not run properly. By implication, our model in this analysis may not run if we feed it with our current crop of categorical features. To ensure that the model in this analysis run, we would need to turn our categorical features into dummy features. This involves turning each response-category of each feature into a new column(feature) with assigned values 1 0r 0. Now, we would create a dummy feature from each of the 10 features in our dataframe.

Create_dummies=pd.get_dummies(Rural_data_only)
Create_dummies.columns.values
Concat_Rural_data_only=pd.concat([Rural_data_only, Create_dummies], sort=False)
Concat_Rural_data_only.columns.values

Observe that the original categorical features are still in the Rural_data_only data, along with the newly created dummy features. We would need to remove the original features (since they are categorical and non-numerical) so that we would only have the dummy features (since they are numerical)

Concat_Rural_data_only=Concat_Rural_data_only.drop(['v130', 'v502', 'v024', 'v106', 'v190', 'v201', 'v536', 'v013', 'v714', 'v313', 'v313_Not using modern method'], axis=1)

Concat_Rural_data_only.fillna
Concat_Rural_data_only = Concat_Rural_data_only.dropna()
Concat_Rural_data_only.dtypes
Concat_Rural_data_only=Concat_Rural_data_only.astype(int)
Concat_Rural_data_only.dtypes
Concat_Rural_data_only.columns.values

After dummy coding, we would denote the list of our Independent or predictor features as X, and that of our target or dependent feature as y:

X=Concat_Rural_data_only.drop('v313_Using modern method',axis=1)
y=Concat_Rural_data_only['v313_Using modern method']
y=y.to_frame()

Class imbalance correction and train/test data split

Recall that our dependent feature has an imbalanced classification problem i.e the proportion of respondents using modern contraceptives was significantly less than those not using not. We would resolve this imbalance before modelling. Standard modelling requires that the proportion of the response-category of the dependent feature, when compared, must be close to equal (45:55) if not equal (50:50). We would resolve this problem using SMOTE (Synthetic Minority Oversampling Technique) algorithm. SMOTE will help balance up the two classes.

Also, observe that we have splitted our data into training and test dataset; with the training dataset taking 70% of the whole data, and test data 30%. We would train our model with the training data and test the trained model with the test data for accuracy and precision evaluation.

os = SMOTE(random_state=0, ratio='auto')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
X_train.columns.values
y_train.columns.values
os_data_X,os_data_y=os.fit_sample(X_train,y_train.values.ravel())
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['y'])

print("length of oversampled data is ",len(os_data_X))
print("Number of non modern contraceptive users in oversampled data",len(os_data_y[os_data_y['y']==0]))
print("Number of modern contraceptive users in the oversampled data",len(os_data_y[os_data_y['y']==1]))
print("Proportion of non modern contraceptive users data in oversampled data is ",len(os_data_y[os_data_y['y']==0])/len(os_data_X))
print("Proportion of modern contraceptive users data in oversampled data is ",len(os_data_y[os_data_y['y']==1])/len(os_data_X))

From the output above, it can be observed that the dependent feature response-category is now balanced up. Also, the SMOTE algorithm was carried out only on the training data only. This is to ensure that no observation in the test data is being used to create synthetic observations, as this can affect model prediction accuracy and precision negatively.

Feature selection for modelling

In this section, we would check and select features that are best suited to be fed to our model for prediction. We would use the recursive feature elimination (RFE) technique. The RFE adopts a simulated repeated model prediction process to identify and indicate suitable and unsuitable features that should be included in our modelling. It indicates suitable features as True or 1, while those identified as unsuitable are indicated as False or >1 value.

Note: The order of the features in the array of features is the same as their RFE scores in the boolean and number list. For example, from the code output below, the feature — ‘v024_north central’ is the first (from the top) on the list of features, and takes values — ‘False’ and ‘6’ values in the boolean and numbering list. Furthermore, feature — ‘v714_no’ is second to the last (from the top) on the list of features, It is in the same position on the boolean and number list. It takes ‘True’ and ‘1’ values on the boolean and number list.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='lbfgs', max_iter=1000, class_weight='balanced')
rfe = RFE(logreg, 20)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
X.columns.values

Based on the RFE evaluation, we will exclude features that have ‘False’ value on the boolean list, and greater than one (>1) value on the number list from the modelling. Only features with boolean value — ‘True’ and number value — ‘1’ will be included in the model prediction.

Summary of the features with their boolean and number values in the figure below:

Based on these criteria, the independent features that will be excluded in the model prediction implementation are:

X=X.drop(['v024_north central', 'v024_north east', 'v024_north west','v024_south east', 'v024_south south', 'v024_south west', 'v130_christians', 'v130_muslim', 'v130_others','v502_never in union', 'v502_currently in union/living with a man','v502_formerly in union/living with a man','v536_active in last 4 weeks', 'v536_not active in last 4 weeks'], axis=1)

X.columns.values
X=X.columns

Data Modelling

Now that we now have our value of X and y; we would feed our logistic regression model with the training data:

X=os_data_X[X]
y=os_data_y['y']

import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

The result of the modelling is shown below:

From the table above, features v106_primary, v201_0, and v201_6 have p-values greater than 0.05 hence are not significant predictors of modern contraceptive use (they are considered not significant because their ‘p>|z|’ is greater than 0.05). As a consequence, they will be removed as predictors from the model. Also, at this point, the prediction accuracy of our model using our test data is 79%. This can be seen below:

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=1000, class_weight='balanced')
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Next, we would remove the 3 features from X, and run the model again, and thereafter check for accuracy and precision.

X=['v106_no education', 'v106_secondary/higher',
       'v190_Average', 'v190_poor', 'v190_rich', 'v201_1-5',
       'v201_7+', 'v536_never had sex', 'v013_15-19',
       'v013_20-24', 'v013_25-29', 'v013_30-34', 'v013_35-39',
       'v013_40-44', 'v013_45-49', 'v714_no', 'v714_yes']

All the included features are all significant (having p-values < 0.05) in the new model. The prediction accuracy of this model is now 77%

Model performance evaluation

Confusion Matrix

We can assess the confusion matrix to see the distribution of correct and incorrect prediction

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

Elements on the left diagonal (3425, 3948) of the confusion matrix are those predicted right by the model:

The first element on the first row and first column (3425 — number of those predicted by the model as using a modern contraceptive method, and are using), and the element on the second row and second column (3948 — number of those predicted by the model as not using a modern contraceptive method and are not using) are predicted correctly by the model. Overall, they constituent the 77% accurate prediction of the model.

Elements on the right diagonal (1330, 908) of the confusion matrix are those predicted wrong:

The second element on the first row, and second column (1330 — number of those predicted by the model as using the modern contraceptive, but are not using) and the element on the second row and first column (908 — number of those predicted by the model as not using modern contraceptive but are using) are predicted incorrectly by the model. Overall, they constituent the 23% inaccurate prediction of the model.

Receiver Operating Characteristics Curve

Also, we would use the receiver operating characteristics curve (ROC) to evaluate the reliability and strength of our model to predict accurately. A good ROC curve for a model is one that is on the top left side of the plane i.e a good classifier model has a ROC curve line (blue line) that stays far away from the dotted red line to the left.

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

The result of the ROC is shown below:

As can be seen above, the logistics regression curve (blue curve) is on the top left side of the ROC curve. The precision and f1-score also highlight the performance and strength of our model. The f1-score for the currently using modern contraceptive class prediction is 78%, and 75% for those not currently using. The overall average is 77%. The precision level of our model to predict currently using modern contraceptive use is 75% and 79% for currently not using any method, with an overall average of 77%.

Model Features importance

As we now have a list of features that are significant predictors of modern contraceptive use; it will be important to evaluate each significant feature level of importance to predicting modern contraceptive use about others in the model.

We would use the eli5 library to assess feature importance, using its permutation importance technique.

perm = PermutationImportance(logreg, random_state=0).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

From the table, the order of importance is from top to bottom.

Note that the first number in each row indicates the level of reduction in model performance as the feature is removed and reshuffled within the model. The second number is a measure of the level of variance of the model performance as the feature is reshuffled in the model. As can be seen, if shuffled, being poor (v190_poor) has the biggest effect-value variability on model performance, and the highest effect-value on model performance too.

It can be said that ‘v190_poor’ is the most important feature in predicting modern contraceptive use. This means rating a respondent as being poor or otherwise is the strongest determinant of their modern contraceptive use.

k-fold Cross-Validation

It is not standard practice to rely only on the one-time accuracy value of a model. It is best to repeat the process of training the model and evaluating the accuracy several times. Thereafter, take the mean and standard deviation of the repeated measures as the final accuracy value of the model. This can be achieved using the concept of the k-fold cross-validation.

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

for scoring in["accuracy", "roc_auc"]:
    seed = 0
    kfold = model_selection.KFold(n_splits=10, random_state=0)
    logreg = LogisticRegression(solver='lbfgs', max_iter=1000, class_weight='balanced')
    results = model_selection.cross_val_score(logreg, X_test, y_test, cv=kfold, scoring=scoring)
    print("Model", scoring, " mean=", results.mean() , "stddev=", results.std())

From the k-fold cross-validation, the mean accuracy of the model is 77%, the standard deviation is 0.9.

Saving the model for future use

We will now save the model for future use, and deployment

from sklearn.externals import joblib 
  
# Save the model as a pickle file 
joblib.dump(logreg, 'ModernContraceptiveUsePredictorModel.pkl')

The Jupyter notebook used for this analysis is available here

Now that’s it. Hope you find this useful? Please drop your comments and follow me on LinkedIn at Ayobami Akiode LinkedIn

Basic example of a machine learning model prediction: Predicting modern contraceptive use in rural… was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Basic example of a machine learning model prediction: Predicting modern contraceptive use in rural…

A basic example of a machine learning prediction: Predicting contraceptive use in rural Nigeria

Overview of analysis, and data source

Loading data into python

Data and features description

Selecting out rural respondents sample from the whole dataframe

Univariate (single feature) exploratory data analysis (EDA)

Feature Engineering

Bivariate (two features) exploratory data analysis (EDA) and test of association

Checking features for missing values

Dummy coding the features

Class imbalance correction and train/test data split

Feature selection for modelling

Data Modelling

Model performance evaluation

Model Features importance

k-fold Cross-Validation

Saving the model for future use

Related

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

A basic example of a machine learning prediction: Predicting contraceptive use in rural Nigeria

Overview of analysis, and data source

Loading data into python

Data and features description

Selecting out rural respondents sample from the whole dataframe

Univariate (single feature) exploratory data analysis (EDA)

Feature Engineering

Bivariate (two features) exploratory data analysis (EDA) and test of association

Checking features for missing values

Dummy coding the features

Class imbalance correction and train/test data split

Feature selection for modelling

Data Modelling

Model performance evaluation

Model Features importance

k-fold Cross-Validation

Saving the model for future use

Related

Related Posts

Leave a Comment Cancel Reply