Test the predictor feature statistically at bivariate, before including it in the model
Do you know? Using boxplots, scatterplots, and other visualization tool is awesome but never enough in confirming that a relationship exists between two variables (The target feature (Y)and each of the predictor features(X))?
I know you may ask, why?
Well, because a statistician may argue that the visual pattern you are seeing in your sample is not what truly obtains in the population your sample represents. Specifically, she/he may say — “What you are seeing actually happened by chance, so it is highly probable that when a sample of similar characteristics is drawn, severally, from the same population, you may not see the relationship indicated by this visualization again”
What then is the best practice to overcome the huddle created by statistics or statisticians in this instance…?
The gold standard — “Use data visualization tools(charts/two-way table) to visualize possible existence of relationships or association between the two features, then conduct a statistical test to statistically confirm if what the visual pattern suggests is true or not”
In this article, we would — state the appropriate criteria for using the different bivariate statistical tests in confirming the existence of a relationship/association between the target feature (Y) and each of the predictor variables (X); and demonstrate how to implement that in python.
This article will cover the following sections:
1.A Quick Overview Of Inferential Bivariate Analysis Concepts and Framework
2.Overview Of Data Source
3. Rules For applying the different inferential bivariate statistical test
4. How To carry out bivariate inferential analysis in Python
1 . A Quick Overview Of Inferential bivariate analysis concepts and framework
Bivariate analysis is the empirical analysis of two variables to ascertain if a relationship or association exists between them. It has two components, the descriptive, and the inferential components. The descriptive component is concerned with the visualization of the relationship using graphs like the scatterplot, boxplots, and numerical tabulations (cross-tabulation, etc). The inferential component is concerned with the use of statistical methods in testing and determining the relationship. The figure below highlights the hierarchical system of the components.
In this study, we would only be looking at the inferential component of the bivariate framework. An extensive description of the descriptive component can be found in my related earlier article — Visualization, not enough to suggest a relationship; Confirm it with a Statistical test.
1.1. Inferential Bivariate Analysis
This is the application of statistical methods in determining if a relationship, association, or difference exists between two variables/features. There are two types, namely — Parametric and Non-parametric.
1.1.1 Parametric Bivariate Analysis: is used when the distribution of the population from which the two variables are from are normally distributed or bell-shaped. Though according to the central limit theorem, parametric methods can still be used when the sample size of the sample is at least 30 or more. Though, there are statistical methods that can be used to test for normality, E.g — the Shapiro-Wilk test, D’agostino’s k-squared test, Anderson-darling test, lilliefors test, Jarque–beta test, Kolmogorov-Smirnov test, and Chi-square Test.
1.1.2 Non-Parametric Bivariate Analysis: is used when the distribution of the population from which the two variables are from, are unknown or not normally distributed. Though it is mostly also considered for small sampled datasets that are not parametric. There are statistical tests that fall within the non-parametric rule. This statistical test can be seen in the bivariate analysis framework figure above. Just like the parametric statistical test, the non-parametric statistical test to use at any instance depends on the type of variables/features.
2.Overview Of Data Source
The data used in this article is a free sourced secondary data of the demographic and health survey (DHS) for Nigeria collected in 2013, called NDHS 2013. The DHS is a nationally representative survey that is being conducted in developing countries. In Nigeria, the survey is routinely conducted every 5 years; and covers all the states of Nigeria, including the capital. The strata of respondents data for this analysis is that of Ogun State.
To request any country-level free sourced data from the DHS program, kindly visit the DHS Program, sign-up, and request.
3. Rules for applying the different inferential bivariate statistical test
Selecting the appropriate statistical test that would go with a bivariate visualization is determined by two things: the parametric status of the sample and the types of variables involved.
The table below indicates, in brief, the two parametric status and variable type criteria that must be met for each bivariate inferential statistic
4. How To carry out bivariate inferential analysis in Python
In this section, we would show how to carry out bivariate analysis in python starting with parametric methods, then non-parametric methods.
First, we will import and load data into python through the Jupyter notebook
import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Though there are 4617 features (variables) and 630 rows (observations) in this data frame, we would only require 9 in our analysis. As a consequence, we would only keep the 9 in the dataframe.
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed] Ogun_ndhs2013.shape Ogun_ndhs2013.head(9)
4.1. Parametric Inferential Bivariate Analysis in python
4.1.1 Chi-square Test of Independence: This is a bivariate parametric method that is used for testing the association between a nominal target feature (independent variable) and a predictor feature(dependent variable). None of the cells of the contingency must be 5 or less than it. In cases where such exist(cell(s) value ≤5), then Fisher’s exact test should be used.
In this instance of our analysis, our Nominal target feature will be — current contraceptive method (v313), while our Nominal predictor feature will be — the type of place of residence (v025)
We will import all required libraries and modules
import numpy as np from scipy.stats import chi2_contingency from scipy.stats import chi2
We would run frequency distributions of the predictor and target features and recode the v313 from 4-response-category to 3-response-category feature.
Ogun_ndhs2013 ['v025'].value_counts() Ogun_ndhs2013 ['v313'].value_counts()
#We will v313 recode into: No method, Modern Method, and Others v313_recode = { "v313": {"No method": 'No method', "Modern method": 'Modern method', "Traditional method": 'Others', "Folkloric method": 'Others'} }
Ogun_ndhs2013.replace(v313_recode, inplace=True)
Make the crosstabulation (two-way table) of the predictor feature (v025), and the target feature (v313), and compute the chi-square statistic and its related p_value
#Develop crosstabulation needed for chi-square statistic and p_value computation tab0=pd.crosstab(index=Ogun_ndhs2013.v313, columns=Ogun_ndhs2013.v025) print(tab0)
#compute the chis-quare statistic and p_value chi2, p, dof, expected=chi2_contingency(tab0.values) print('Chi-square statistic is %0.3f, p_value is %0.3f' %(chi2,p))
The chi-square statistic value is 0.001, p_value=0.9. Since p_value>0.05; there is no association between the two features.
4.1.2 T-test of mean difference: This is a bivariate parametric method that is used to test if there is a significant difference in the mean value of the numerical target feature (dependent variable) between the two response-categories of the binary predictor feature(independent variable).
There are majorly three types of T-test, namely:
a. One sample t-test
b. Two independent sample t-test
c. Paired Sample t-test
In this instance, the numerical target feature will be — Age (v012), while the binary predictor feature will be — Type of place of residence (v025)
4.1.2.1 One-sample T-test of mean difference: This is a bivariate parametric method that is used to test if there is a significant difference between the mean value of a numerical target feature and a hypothetical value. The hypothetical value usually represents what the population or known value is. As an example, if the Nigerian National census shows that the average age of a Nigerian is 30, this may be considered the hypothetical value. You may collect a data sample from a state in Nigeria, and use a one-sample t-test to see if the average age of people in that state is significantly different from the National value(30).
For our analysis, using our Ogun 2013 NDHS data, we would take the overall mean age of women of reproductive in Ogun state as the hypothetical value. With the one-sample t-test, we would test if the mean age of women of reproductive age in rural areas of Ogun is significantly different from the hypothetical overall mean age of women in the state.
In most contexts, the hypothetical value would have been known beforehand. In fact, it would have been documented in literature or from the community of practice.
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import ttest_1samp
Calculate the mean age of women of reproductive age in Ogun state, and in rural areas of Ogun state only.
#Overall Mean age of women in Ogun state
Ogun_state_mean_age=Ogun_ndhs2013['v012'] Ogun_state_mean_age.mean(), Ogun_state_mean_age.std()
#Mean age of women in rural areas of Ogun state
age_rural=Ogun_ndhs2013[(Ogun_ndhs2013['v012']>=15) & (Ogun_ndhs2013['v025']== "Rural")] age_rural=age_rural['v012'] age_rural.mean(), age_rural.std()
Compute the one-sample t-test statistic and its associated p_value
stats.ttest_1samp(rural_age,31)
The one-sample t-test statistic value is 2.06, p_value=0.03. Since p_value<0.05; there is a significant mean difference between the mean age of rural respondents and the hypothetic value.
4.1.2.2 Two independent sample t-test of mean difference: This is a bivariate parametric method that is used to test if there is a significant difference in the mean value of the numerical target feature (dependent variable) between the two response-categories of a binary predictor feature that are independent.
Note, it called an independent sample because it is expected that the observations of the two response-categories should be mutually exclusive. For example, the response-categories of the variable — Type of place of residence is mutually exclusive, because respondents can either be rural or urban, but not both.
In this instance, the numerical target feature will be — Age (v012), while the binary predictor feature will be — Type of place of residence (v025)
We would take the mean age of women of reproductive in rural and urban areas of Ogun state. Using the two independent sample t-test, we would test if the mean age of women of reproductive age in rural areas and urban areas of Ogun is significantly different.
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import ttest_ind
Calculate the mean age of women of reproductive age in rural and urban areas of Ogun state.
#Overall Mean age of women in urban areas Ogun state
age_urban=Ogun_ndhs2013[(Ogun_ndhs2013['v012']>=15) & (Ogun_ndhs2013['v025']== "Urban")] age_urban=age_urban['v012'] age_urban.mean(), age_urban.std()
#Mean age of women in rural areas of Ogun state
age_rural=Ogun_ndhs2013[(Ogun_ndhs2013['v012']>=15) & (Ogun_ndhs2013['v025']== "Rural")] age_rural=age_rural['v012'] age_rural.mean(), age_rural.std()
Compute the independent sample t-test statistic and p_value
stats.ttest_ind(age_rural,age_urban)
The two independent sample t-test statistic values between the mean age of rural and urban respondents are 2.82, p_value=0.004. Since p_value<0.05; there is a significant mean difference between the mean age of rural and urban respondents.
4.1.2.3 Paired-sample t-test test of mean difference: This is a bivariate parametric method that is used to test if there is a significant mean difference of the same variable collected at two different times from the same set of respondents or observations.
Let’s give two scenario examples —
First scenario: You may want to test if the mean score of a set of students in Biochemistry in the first year of college is different from their mean score in Biochemistry in the second year.
Second scenario: You may want to test if the mean score of a set of children in writing is different from the mean score of the same set of children in reading.
We would use two similar features in our analysis, to reflect how to use paired sample t-test. The number of respondent’s daughters not living with respondents (v203); and the number of respondent’s daughters living with respondents(v205). we would test if the mean number of respondent’s daughter living with her is different from the mean of those not living with her.
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import ttest_rel
Take out v205 and v203 from the Ogun NDHS 2013 data as series
daughter_home=Ogun_ndhs2013['v205'] daughter_away=Ogun_ndhs2013['v203']
Compute the paired sample t-test statistic and p_value
ttest_rel(daughter_home,daughter_away, nan_policy='omit')
# The omit means that ignore pairs with missing values in the #analysis. #Other options are: propagate (the default argument). With this, if #there are missing pairs or one part of a pair, the program will #return "nan")
The paired sample t-test statistic value is 9.84 , p_value=0.00. Since p_value<0.05; there is a significant mean difference between the mean number of respondents’ daughters that live at home and those that are living away.
4.1.3 Pearson Correlation test of relationship: This is a bivariate parametric method that is used to test if a linear relationship exists between two numerical variables, a predictor and target feature. Correlation values range from -1 to 0(zero)to +1. If the value of the correlation statistics is between 1 and 0, then the relationship between the two variables is positive, contextually, it means that as one variable is increasing, the other is increasing, also if one variable is decreasing, the other is decreasing. if the value of the correlation statistic is negative, that is if it has a value between 0 and -1, then the two variables have a negative relationship. Contextually, it means that as one variable is increasing, the other is decreasing vice versa. If the value of the correlation statistic is zero, that is if it has a value of 0 then the two variables have no relationship(linear). Contextually, it means an increase or decrease in one does not correlate to an increase or decrease in the other.
Note, correlation statistic is widely known as correlation coefficient and it is denoted by r. Let give two scenario examples, where the use of correlation statistic is suitable.
In this instance, the numerical predictor feature will be — Age (v012), while the numerical target feature will be — Number of living children(v218)
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import pearsonr
Take out v012 and v218 from the Ogun NDHS 2013 data as series
Age_X=Ogun_ndhs2013['v012'] No_Living_Chid_Y=Ogun_ndhs2013['v218']
Compute the Pearson correlation statistic and its associated p_value
stats.pearsonr(Age_X, No_Living_Chid_Y)
The Pearson correlation statistic value is 0.7 , p_value=0.00. Since p_value<0.05; there is a significant strong positive relationship between the age of respondents and the number of their living children.
4.1.4 Regression test of Linear relationship: This is a bivariate parametric method that is used to compute and estimate the linear regression line that exists between two numerical features (Target feature(Y); Predictor feature (X) ), that has been found to be correlated.
A regression line is denoted by — Y=MX+C. It is used for computing the two essential parameters, namely:
1. M, called gradient or slope of the regression line. It is defined as the increase in the value of Y for every unit increase in X (change in Y/change in X)
2. C, called intercept of the regression line. It is defined as the value of Y when X=0.
In this instance, the numerical predictor feature will be — Age (v012), while the numerical target feature will be — Number of living children(v218)
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import linregress
Take out v012 and v218 from the Ogun NDHS 2013 data as series
Age_X=Ogun_ndhs2013['v012'] No_Living_Chid_Y=Ogun_ndhs2013['v218']
Compute regression statistic and p_value
slope, intercept, r_value, p_value, std_err = stats.linregress(Age_X, No_Living_Chid_Y)
# r squared is the coefficient of determination, and tells stands between 0 and 1 print ("r-squared:", r_value**2)
The regression coefficient is 0.155, p_value=0.00. Since p_value<0.05; the change in the number of children living with respondents for every unit increase in the age of respondents is significant. Also, R-squared=0.49 means, 49% of the variance observed in the Target feature(number of children living with respondents) is explained by the variance of the Predictor feature(age of respondents)
4.1.4 One-way Analysis of Variance(ANOVA): This is a bivariate parametric method that is used to test if there is a significant difference in the mean value of the numerical target feature (independent variable) across the response-categories of a nominal predictor feature/variable (with more than two response-categories). The difference between the T-test and the ANOVA is that the predictor feature in a T-test is a two response-categories categorical feature, while the predictor feature in an ANOVA has more than two response-categories.
In this instance, the numerical predictor feature will be — Age (v012), while the numerical target feature will be — the educational status of respondents (v106)
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import f_oneway
Calculate the age distribution of respondents by their educational status
#Age distribution of respondents that have no education
Only_NoEducation=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='No education') & (Ogun_ndhs2013['v012']>=15)] NoEducation_educ_age=Only_NoEducation['v012']
#Age distribution of respondents that have primary education
Only_primary_education=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Primary') & (Ogun_ndhs2013['v012']>=15)] primary_educ_age=Only_primary_education['v012']
#Age distribution of respondents that have secondary education
Only_secondary_education=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Secondary') & (Ogun_ndhs2013['v012']>=15)] secondary_educ_age=Only_secondary_education['v012']
#Age distribution of respondents that have higher education
Only_Higher=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Higher') & (Ogun_ndhs2013['v012']>=15)] Higher_educ_age=Only_Higher['v012']
Compute one way ANOVA statistic and p_value
f_oneway(secondary_educ_age,primary_educ_age, NoEducation_educ_age, Higher_educ_age)
The ANOVA statistic value is 16.4, p_value=0.00. Since p_value<0.05, there is a mean age difference, at least between two respondents by educational status.
4.2. Non-parametric Inferential Bivariate Analysis
4.2.1 Man-Whitney U test: This is a non-parametric method, that is analogous to the two-sample independent sample t-test, except it is used when the data are ordinal or ranked. It can be used as a replacement for the independent sample t-test when there is no assumption that the (interval or ordinal) predictor feature is normally distributed or when the distribution is unknown. The Mann-Whitney U test can still be used when the number of observations is just 8. Though the Mann-Whitney U test has less statistical power than the t-test; it is stronger with more skewed data, including those with high outliers. Whitney-Manny assumes equal variance (homoscedasticity) of the two samples being compared else might not be a fit for use even if the data is ordinal.
The null hypothesis for Mann-Whitney is: The probability that a randomly drawn observation from one of the samples is greater than a randomly drawn observation from the other sample is 0.5. In simple terms, the null hypothesis proposes that the median of the two samples is equal or the same.
In this instance of our analysis, the target feature will be — Age (v012), while the binary predictor feature will be — Type of place of residence (v025).
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import mannwhitneyu
Get the age distribution of respondents by type of place of residence from the Ogun NDHS 2013 data, as series
#Rural respondents' age data as series
age_urban=Ogun_ndhs2013[(Ogun_ndhs2013['v012']>=15) & (Ogun_ndhs2013['v025']== "Urban")] age_urban=age_urban['v012'] age_urban.std()
#Urban respondents' age data as series
age_rural=Ogun_ndhs2013[(Ogun_ndhs2013['v012']>=15) & (Ogun_ndhs2013['v025']== "Rural")] age_rural=age_rural['v012'] age_rural.std()
Compute the Mann_whitney U test statistic and its associated p_value
stats.mannwhitneyu(age_rural,age_urban)
The Mann-Whitney statistic value is 42368, p_value=0.002. Since p_value<0.05; so there is a significant age difference between women in urban and rural areas.
4.2.2 Spearman Rank Correlation: This is a non-parametric method, that is analogous to a Pearson correlation statistic. The statistic is called the Spearman rank coefficient and measures the strength and direction of the relationship between the two variables (A predictor feature(X), and a Target feature(Y)). To be implemented, it requires that either of the two variables have ordinal properties so they could either be ordinal, or continuous (interval, ratio). Similarly ranked features will have a high Spearman rank correlation coefficient, vice versa.
The statistical power of the Pearson correlation is stronger than that of the spearman rank correlation, however, it is suitable for use when the distributions of the two variables are not normal or are skewed.
Also, a spearman rank correlation measures the monotonic relationship between the two variables, while the Pearson statistic measures the linear relationship between the two variables. The measure of a monotonic relationship is measured using the -1 to +1 range like a spearman correlation coefficient. The p-value of the spearman is more reliable for the sample size of 500 or more. The statistical reliability of Spearman rank correlation in testing the null hypothesis is accepted, at minimum, when the sample size is 8. Though you can still carry out spearman rank correlation for data with less than 8 sample sizes, it will have close to zero statistical power.
The null hypothesis for Spearman rank correlation is: There is no monotonic association between the two features. A monotonic relationship is one in which the two variables move in the same direction or opposite direction at a constant rate.
We would use — Age(v012) as the predictor feature, and — Number of living children (v218), as the target feature, with the assumption that their distribution is skewed or unknown, but not normal.
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import spearmanr
Pull out v012 and v218 from the Ogun NDHS 2013 data as series
Age_X=Ogun_ndhs2013['v012'] No_Living_Chid_Y=Ogun_ndhs2013['v218']
Compute the Spearman rank correlation statistic and its associated p_value
stats.spearmanr(Age_X, No_Living_Chid_Y)
The spearman rank correlation statistic value is 0.72, p_value=0.00. Since p_value<0.05; there is a significant positive monotonic relationship between the age of women and their number of living children.
4.2.3 Kruskal Wallis: This is a non-parametric method, that is analogous to an Analysis of Variance (ANOVA), and a substitute for the Mann-Whitney test when the Predictor feature(X) has more than two response-categories. The Kruskal follows a chi-square distribution when each response-category has at least 5 observations. It computes the median of the target feature (Y) on each response-category of the predictor feature then compares to see if they are the same. It is important to note that Kruskal Wallis is also used for scale parameters comparison.
Note, the Kruskal Wallis assumes the variance of the response-categories is the same. In instances where this variance assumption does not hold, the mood statistic should be used.
The null hypothesis for Kruskal-Wallis is: The median value of the predictor feature by the response-categories of the target feature is the same or equal.
We would use the two variables used in computing the one-way analysis of variance (one-way ANOVA), with the assumption that their distribution is skewed, unknown, but not normal.
In this instance, the numerical predictor feature will be — Age (v012), while the numerical target feature will be — the educational status of respondents (v106)
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import kruskal
Get the age distribution of respondents by their educational status from the Ogun NDHS 2013 data, as series
#Age distribution of respondents that have no education Only_NoEducation=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='No education') & (Ogun_ndhs2013['v012']>=15)] NoEducation_educ_age=Only_NoEducation['v012']
#Age distribution of respondents that have primary education Only_primary_education=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Primary') & (Ogun_ndhs2013['v012']>=15)] primary_educ_age=Only_primary_education['v012']
#Age distribution of respondents that have secondary education Only_secondary_education=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Secondary') & (Ogun_ndhs2013['v012']>=15)] secondary_educ_age=Only_secondary_education['v012']
#Age distribution of respondents that have higher education Only_Higher=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Higher') & (Ogun_ndhs2013['v012']>=15)] Higher_educ_age=Only_Higher['v012']
Compute Kruskal Wallis statistic and p_value
stats.kruskal(secondary_educ_age,primary_educ_age, NoEducation_educ_age, Higher_educ_age)
The Kruskal Wallis statistic value is 46.33, p_value=0.00. Since p_value<0.05; there is a significant difference in median age among, at least, two women educational groups.
4.2.4. Mood Median test: This is a non-parametric method, that is used to test if an ordinal Target feature (Y) has the same median across the two or more response-categories of the nominal/binary predictor feature(X). That is, it test if the median of the response-category of the target feature (Y) is the same across the different response-categories of the nominal/binary predictor feature(X).
Computing the mood median test requires that the overall median of the target feature is calculated. The median values of the target feature are also then calculated by the response-categories of the predictor features. The value of the median for each response-category is then compared with the overall median. A contingency table is formed based on the comparison outcome between the grand median and the median value of each response-category.
The value of the contingency table is used to compute the mood median test statistic and its p-value. This computation method makes the mood median test different from the Mann-Whitney and Kruskal-Wallis median tests.
Mood median is stronger than both Whitney-Manny and Kruskal-Wallis with data that has outliers. Both Kruskal-Wallis and Mann-Whitney have stronger statistical power than the mood median test for moderate to larger sample size data, vice versa. Mood median is also suitable for use in situations where homoscedasticity does not hold, as Kruskal-Wallis assumes homoscedasticity
The null hypothesis for mood median is: The median value of the predictor feature by the response-categories of the target feature, in relative to the overall mean, is the same or equal
In this instance, we would assume age to be an ordinal feature, since, in some real-life contexts, it is considered ordinal.
In this instance, the predictor feature will be — Age (v012), while the target feature will be — the educational status of respondents (v106)
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import median_test
Get the age distribution of respondents by their educational status from the Ogun NDHS 2013 data, as series
#Age distribution of respondents that have no education Only_NoEducation=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='No education') & (Ogun_ndhs2013['v012']>=15)] NoEducation_educ_age=Only_NoEducation['v012']
#Age distribution of respondents that have primary education Only_primary_education=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Primary') & (Ogun_ndhs2013['v012']>=15)] primary_educ_age=Only_primary_education['v012']
#Age distribution of respondents that have secondary education Only_secondary_education=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Secondary') & (Ogun_ndhs2013['v012']>=15)] secondary_educ_age=Only_secondary_education['v012']
#Age distribution of respondents that have higher education Only_Higher=Ogun_ndhs2013[(Ogun_ndhs2013['v106']=='Higher') & (Ogun_ndhs2013['v012']>=15)] Higher_educ_age=Only_Higher['v012']
Compute mood median statistic and p_value
stat, p, med, tbl = median_test(secondary_educ_age,primary_educ_age, NoEducation_educ_age, Higher_educ_age)
The mood median statistic value is 29.72, p_value=0.00. Since p_value<0.05; there is a significant difference in the median age, at least, among two women educational groups or samples.
4.2.5. Goodman and Kruska’s gamma: This is a non-parametric method that measures the association between two ranked variables, it simply assesses the rank correlation between two ranked variables.
For a Goodman and Kruskal gamma statistic to be used, two assumptions must be met, namely: The two variables must be ordinal, and the relationship between the two variables must be monotonic. A monotonic relationship is one in which the two variables move in the same direction or opposite direction at a constant rate. It is typically impossible to estimate a monotonic relationship statistically, but a graph can roughly visually show it. In contrast, for a linear relationship; the two variables move in the same or opposite direction but not at a constant rate.
Goodman and Kruska’s gamma can be used as a substitute for Spearman rank correlation, Wilcoxon two-sample test, and the sign test when the observation of the two variables has tied rank value. Tied rank is when the observations of two distinct samples have the same rank value.
A value of -1 for the Goodman and Kruska’s gamma statistic indicates a negative association; while a value of +1 for the Goodman and Kruska’s gamma statistic indicates a positive association. A zero value means there is no relationship.
It is important to note that currently, there is no package or library in python for calculating Goodman and Kruska’s gamma statistic. We would use the function developed by P.Stikker to resolve this problem. Stikker’s method was compared with STATA Goodman and Kruskal method output and was found to be the same. Please note, P.Stikker’s original has been modified here to suit our needs.
In this instance, the ordinal predictor feature will be — categorized age (v013), while the ordinal target feature will be — wealth index of respondents (v190)
Import required libraries from python and add the function developed by P.Stikker
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import norm
Add P.Stikker’s Gamma and Kruskals developed python function
def goodmanKruskalgamma(data, row_feature, column_feature): myCrosstable = pd.crosstab(data[row_feature], data[column_feature]) nRows = myCrosstable.shape[0] nCols = myCrosstable.shape[1] C = [[0 for x in range(nCols)] for y in range(nRows)]
# top left part for i in range(nRows): for j in range(nCols): h = i-1 k = j-1 if h>=0 and k>=0: for p in range(h+1): for q in range(k+1): C[i][j] = C[i][j] + list(myCrosstable.iloc[p])[q]
# bottom right part for i in range(nRows): for j in range(nCols): h = i+1 k = j+1 if h<nRows and k<nCols: for p in range(h, nRows): for q in range(k, nCols): C[i][j] = C[i][j] + list(myCrosstable.iloc[p])[q] D = [[0 for x in range(nCols)] for y in range(nRows)]
# bottom left part for i in range(nRows): for j in range(nCols): h = i+1 k = j-1 if h<nRows and k>=0: for p in range(h, nRows): for q in range(k+1): D[i][j] = D[i][j] + list(myCrosstable.iloc[p])[q]
# top right part for i in range(nRows): for j in range(nCols): h = i-1 k = j+1 if h>=0 and k<nCols: for p in range(h+1): for q in range(k, nCols): D[i][j] = D[i][j] + list(myCrosstable.iloc[p])[q]
P = 0 Q = 0 for i in range(nRows): for j in range(nCols): P = P + C[i][j] * list(myCrosstable.iloc[i])[j] Q = Q + D[i][j] * list(myCrosstable.iloc[i])[j] GKgamma = (P - Q) / (P + Q) n = myCrosstable.sum().sum() Z1 = GKgamma * ((P + Q) / (n * (1 - GKgamma**2)))**0.5 forASE_R = 0 forASE_SPSS = 0 for i in range(nRows): for j in range(nCols): forASE_R = forASE_R + list(myCrosstable.iloc[i])[j] * (Q * C[i][j] - P * D[i][j])**2 forASE_SPSS = forASE_SPSS + list(myCrosstable.iloc[i])[j] * (C[i][j] - D[i][j])**2
ASE_R = 4 * (forASE_R)**0.5 / (P + Q)**2 ASE_SPSS = 2 * (forASE_SPSS - (P - Q)**2 / n)**0.5 / (P + Q) Z_R = GKgamma / ASE_R Z_SPSS = GKgamma / ASE_SPSS from scipy.stats import norm p1 = norm.sf(Z1) p2_R = norm.sf(Z_R) p3_SPSS = norm.sf(Z_SPSS) return print('gamma statistic is %0.3f, p_value for R is %0.3f, ASE for R is %0.3f, p_value for spss is %0.3f, ASE for spss is %0.3f' %(GKgamma,p2_R,Z_R,p3_SPSS,Z_SPSS))
Assess the two features to see if there is a need for pre-processing
Ogun_ndhs2013['v013'].value_counts()
Out[4]:
25-29 122 30-34 108 35-39 88 15-19 88 20-24 82 40-44 80 45-49 62 Name: v013, dtype: int64
Ogun_ndhs2013['v190'].value_counts()
Richest 237 Richer 237 Middle 114 Poorer 40 Poorest 2 Name: v190, dtype: int64
Pre-process wealth index(v190) by recoding into three response-categories (rich, middle, poor), it currently has five response-categories (richest, richer, middle, poorer, and poorest)
#We will recode into: Rich, Middle, and Poor recode_v190 = { "v190": {"Richest": 'Rich', "Richer": 'Rich', "Middle": 'Middle', "Poorer": 'Poor' , "Poorest": 'Poor'} }
Ogun_ndhs2013.replace(recode_v190, inplace=True)
#v190 was recoded into same feature, that's into v190 #New v190 after pre-processing
Ogun_ndhs2013['v190'].value_counts() Rich 474 Middle 114 Poor 42 Name: v190, dtype: int64
Compute the crosstab needed for computing the Gammas and Kruskal’s statistic
myCrosstable = pd.crosstab(Ogun_ndhs2013['v013'], Ogun_ndhs2013['v190']) myCrosstable
Compute Gamma and Kruskal statistic and p_value
goodmanKruskalgamma(Ogun_ndhs2013, 'v013', 'v190')
The Gamma and Kruskal statistic value is -0.122, p_value=0.98. Since p_value>0.05; there is no association between the two features.
4.2.6 Chi-square Goodness of fit test: This is a non-parametric method that tests if the observed proportion or frequency of the response-category of a categorical variable is totally different from their expected value. That is, it test if the distribution of the observed value fits any of the theoretical distribution (binomial, multinomial). It is analogous to how we test if a feature’s mean is the same as a hypothetical mean in a one-sample T-test. Note, the frequency value of each response-category should be at least 5.
In general, The set-up of the chi-square goodness of fit test command in python is — chi-square(f_obs, f_exp=None, ddof=0, axis=0), where f_obs is the observed observation in each response-category, f_exp is the expected observation in each response-category. by default, they are assumed as uniform and the mean of the expected values. axis: The axis of the broadcast result of f_obs and f_exp along which to apply the test. If the axis is None, all values in f_obs are treated as a single data set. Default is 0.
In this instance, we would use — v313 (currently using a method ), as our Target categorical feature, while the predictor feature will be — v025 (a type of place of residence)
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
X_y=Ogun_ndhs2013[['v313','v025']]
import sklearn as svm from sklearn.feature_selection import chi2
Indicate the frequency of the two features (v313, v025)
# None of the value is less than 5 Ogun_ndhs2013 ['v313'].value_counts() No method 489 Modern method 118 Traditional method 16 Folkloric method 7 Name: v313, dtype: int64
X_y['v025'].value_counts() Urban 358 Rural 272 Name: v025, dtype: int64
Conduct feature engineering on the two features — convert their response-categories into numerical values and also merge some response-categories
#We will recode v313 into 2-response-category designated as 1 and 0
v313_recode = { "v313": {"No method": 0, "Modern method": 1, "Traditional method": 0, "Folkloric method": 0} }
X_y.replace(v313_recode, inplace=True)
#frequency distribution of v313 after feature engineering
X_y['v313'].value_counts() 0 512 1 118 Name: v313, dtype: int64
#We will recode v025 into 2-response-category designated as 1 and 2 v025_recode = { "v025": {"Urban": 1, "Rural": 2} }
X_y.replace(v025_recode, inplace=True)
#frequency distribution of v025 after feature engineering X_y['v025'].value_counts() 1 358 2 272 Name: v025, dtype: int64
Compute the chi-square goodness of the fit test statistic and p_value
# state X and y
X = X_y.drop('v313',axis=1) y = X_y['v313']
#compute the statistic and p_value chi_scores = chi2(X,y) print('statistic=%.3f, p-value=%.3f' % (chi_scores[0], chi_scores[1]))
The Goodness of fit test statistic value is 0.00002121, and p_value=0.99632516. Since p_value>0.05; the observed values do not fit the expected value.
4.2.7 Mcnemar test: This is a non-parametric method that is used to test if two proportions of the same binary variable and from the population are the same. it is also called the test of homogeneity or consistency because it helps to check if the same binary variable has changed after an intervention. Basically, it answers the question — is there a change in a phenomenon after something was done concerning it? Essentially, It is used in evaluating the effect or impact of an intervention on a population. it is used in pretest-posttest research design,case-control studies, and matched pairs.
It is analogous to a paired sample t-test but for a target binary feature measure, not a target continuous feature.
For example, you may test if there is an increase in the proportion of those that passed an exam after a tutorial was institutionalized. Also, you may want to test if the proportion of women using contraceptive methods increases after a family planning radio jingo was aired.
For a Mcnemar test, you must have a target binary feature and a predictor binary feature. Also, the sample data should be randomly selected from the population. The Mcnemar test requires a 2X2 contingency table to check the marginal homogeneity of the binary variable. The minimum sample size required for the Mcnemar test is 10.
Cochran’s Q test is a broader variant of the Mcnemar test. it is used to determine if there are differences in a dichotomous dependent variable between three or more related groups.
Note: if the contingency table is not 2×2. Then exact will be false. A binomial distribution is used for the estimation if the contingency table is 2*2 or square contingency table. If the contingency table is not 2*2, it uses chi-square distribution. Chi-square is the approximation to the distribution of the test statistic for large sample sizes. Therefore, the test statistic is a chi-square statistic if exact is false, else it is a binomial statistic.
Since for the Mcnemar test, we are majorly interested in assessing the change in the same binary (dichotomous)feature/variable assessed at different times; and the NDHS 2013 is cross-sectional. We would use a single variable and modify it to achieve our analysis and illustration goal.
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np import statsmodels as sm import statsmodels.stats as sms from statsmodels.stats.contingency_tables import mcnemar
For this analysis, we would pre-process the v313 variable and create two variables out of it. One will serve as a pre-test, the other as a post-test
import numpy as np from scipy import stats from scipy.stats import norm
Create the pre-test feature from v313
#We would use variable - v313
Ogun_ndhs2013 ['v313'].value_counts(normalize=True) * 100
No method 77.619048 Modern method 18.730159 Traditional method 2.539683 Folkloric method 1.111111 Name: v313, dtype: float64
#create a pretest feature Ogun_ndhs2013['pre_test_using_modern_method']=Ogun_ndhs2013['v313']
#We will recode into: Yes, No. Only Modern method is yes pre_recode = { "pre_test_using_modern_method": {"No method": 'No', "Modern method": 'Yes', "Traditional method": 'No', "Folkloric method": 'No'} }
Ogun_ndhs2013.replace(pre_recode, inplace=True)
# check the created the pretest feature again Ogun_ndhs2013['pre_test_using_modern_method'].value_counts(normalize=True) * 100
No 81.269841 Yes 18.730159 Name: pre_test_using_modern_method, dtype: float64
Create the post-test feature from v313
Ogun_ndhs2013['post_test_using_modern_method']=Ogun_ndhs2013['v313']
Ogun_ndhs2013['post_test_using_modern_method'].value_counts(normalize=True) * 100 No method 77.619048 Modern method 18.730159 Traditional method 2.539683 Folkloric method 1.111111 Name: post_test_using_modern_method, dtype: float64
#We will recode into: Yes, No. Modern and Tradtional methods are yes
post_recode = { "post_test_using_modern_method": {"No method": 'No', "Modern method": 'Yes', "Traditional method": 'Yes', "Folkloric method": 'Yes'} }
Ogun_ndhs2013.replace(post_recode, inplace=True)
# check the created post-test feature again
Ogun_ndhs2013['post_test_using_modern_method'].value_counts(normalize=True) * 100
No 77.619048 Yes 22.380952 Name: post_test_using_modern_method, dtype: float64
Create the contingency table of the pre-test and post-test feature
post_test_using_modern_method No Yes pre_test_using_modern_method No 489 23 Yes 0 118
Compute Mcnemar statistic and p_value
result=mcnemar(tab0, exact=True)
# summarize the finding print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))
The Mcnemar statistic value is 0.0000, and p_value=0.00000. Since p_value<0.05; There is a significant increase in the proportion of women using modern contraceptives.
4.2.8 Kendall Tau Correlation: This is a non-parametric method that measures how two ordinal variables correspond. It is an alternative to spearman rank correlation when the sample is small and the tied ranks are many. The features involve must be ordinal or continuous. Though not a strict requirement, a monotonic relationship between the two features makes the Kendall tau correlation a suitable fit for use. Just like Pearson and spearman rank correlation, correlation values range from -1 to 0(zero)to +1. If the value of the correlation statistics is between 1 and 0, then the relationship between the two variables is positive, contextually, it means that as one variable is increasing, the other is increasing, also if one variable is decreasing, the other is decreasing. if the value of the correlation statistic is negative, that is if it has a value between 0 and -1, then the two variables have a negative relationship. Contextually, it means that as one variable is increasing, the other is decreasing vice versa. If the value of the correlation statistic is zero, that is if it has a value of 0 then the two variables have no relationship.
There are three types of Kendall Tau correlation: Kendall Tau-a, Kendall Tau-b, and Kendall Tau-c, also known as Stuart’s tau-c. Though their values are close, the standardization of their values on the -1 to +1 scale makes them a bit different. Kendall Tau-a is approximate Kendall Tau-b and Kendall-c when their ranks are not tied(tied rank). The Mann-Whitney U test is the same as Kendall’s tau when the predictor feature is binary.
We would use the two variables used in computing the spearman rank correlation coefficient, with the assumption that their distribution is skewed or unknown, have a smaller small size than that of the spearman rank, and monotonic.
In this instance, the numerical predictor feature will be — Age (v012), while the numerical target feature will be — Number of living children(v218)
Import required libraries and dataset
#import Ogun NDHS 2013 dataset into pandas import pandas as pd path=r'B:....Ogun_NDHS_2013.dta' Ogun_ndhs2013=pd.read_stata(path)
Variables_needed=['v313','v025', 'v012', 'v013', 'v203', 'v205', 'v218', 'v106', 'v190'] # These are the only variables needed
Ogun_ndhs2013=Ogun_ndhs2013[Variables_needed]
import numpy as np from scipy import stats from scipy.stats import kendalltau
Take out v012 and v218 from the Ogun NDHS 2013 data as series
Age_X=Ogun_ndhs2013['v012'] No_Living_Chid_Y=Ogun_ndhs2013['v218']
Compute the Kendall Tau correlation statistic and its associated p_value
stats.kendalltau(Age_X, No_Living_Chid_Y)
tau, p_value = kendalltau(Age_X, No_Living_Chid_Y) tau, p_value
The Kendall Tau correlation statistic value is 0.57, and p_value=0.00000. Since p_value<0.05; There exists a significant relationship between the age of respondents and the number of living children.
The Jupyter notebook used for this analysis is available here
Key take-aways:
- Carrying out bivariate visualization as the only way to confirm a relationship exists between two features, is statistically not enough.
- Using a bivariate statistical test address the possibility that the relationship shown by the visualization happens by chance.
- The type of bivariate statistical test depends on the type of the target and predictor features (categorical, numerical), the sample size, and the parametric nature (parametric, non-parametric) of the features.
- For a statistical test between a predictor and target feature where the p_values is less than or equal to 0.05 (p_value≤0.05), then the predictor feature can be included in the modeling, else (p_value>0.05), the predictor feature should not be included in the modeling of the target feature.
- The central limit theorem supports using a parametric method for features that have a sample size ≥ of 30.
Now that’s it. Hope you find this useful? Please drop your comments and follow me on LinkedIn at Ayobami Akiode LinkedIn
References
https://www.real-statistics.com
- Features – A list of our statistical guides | Laerd Statistics
- Statistics How To: Elementary Statistics for the rest of us!
- Numpy and Scipy Documentation – Numpy and Scipy documentation
- scikit-learn
- stikpet
- Wikipedia
Visualization Only! Not Enough. How to Carry Out Bivariate Statistical Test in Python was originally published in Geek Culture on Medium, where people are continuing the conversation by highlighting and responding to this story.