7 criteria that must be met before using a linear regression model (LRM) for prediction

In this article, we would look at the seven criteria that must be met before using Multiple Linear Regression to model a prediction. So, this article will cover the following topics:

1. Definition of Multiple Linear Regression(MLR),

2. Difference between Multiple Linear Regression (MLR) and Simple Linear Regression(SLR)

3. Summary table, and detail of the criteria that must be met before using MLR appropriately for prediction.

4. Implementation and test of each criteria in python

5. Conclusion

1. Definition of Multiple Linear Regression(MLR)

Multiple Linear Regression is a statistical model that should only be used to model the prediction of a target variable (Y, dependent variable) by the predictor variables (X, features, independent variables).

2. Difference between Multiple Linear Regression (MLR) and Simple Linear Regression(SLR)

The MLR is a plural version of the Simple Linear Regression (SLR) in terms of the number of feature variables being deployed to predict the target variable. While an MLR would have at least two feature (independents, predictors) variables predicting the target variable, the SLR must only have one feature variable.

3. Summary table, and detail of the criteria that must be met before using MLR appropriately for prediction.

In this section, we would present and describe the seven criteria that must be met before using MLR for prediction. The summary table also summarizes both the statistical and graphical tests appropriate for testing if each criteria is met.

Criteria 1: The Target Variable (Dependent Variable) must be a Continous or Discrete Variable while each of the Predictor variables (Independent variable) must be either Continuous (Ratio or Interval), Discrete, or Ordinal.

The target variable can only be a continuous or discrete variable. While each of the predictor variables can only be continuous, discrete, or ordinal. None should be a nominal or categorical variable.

What are Continous or Discrete variables? Continuous and Discrete variables are those that are intrinsically expressed in numbers E.g Age, height, and weight. In simple terms, a continuous variable can be considered as one with decimal extension E.g Weight — 2.6kg. Discrete can be considered as one that do not have decimal extension E.g Age — It is either you are 2 years old or 3 years old. You can not be 2.5 years old or 2.8 years old.

What are Ordinal Variables? Ordinal variables can be referred to as position-oriented (first position, second position, third position, etc) variables. Referencing is placed on ranking E.g the Grade of a school subject, Position of an athlete in a race. Though an ordinal variable is generally considered a qualitative variable. It must be declared or converted to a numerical variable before use in a multiple regression model.

Statistical Test / Diagnostic Check for Assessing Criteria 1

Check for the variable type of variables (Target and Predictors). Different applications have different ways of checking this. In python, the “Info()” method is used in pandas to check for variables types (Integer, Float, and String). In the STATA application, the “codebook command” is used. floats are integers with decimal extensions.

Criteria 2: Only Linear relationship exists between each of the Predictor variable (Independent variable) and the Target variable (Dependent Variable)

Only a Linear relationship should exist between each predictor (Independent(s)) variable X, and target variable y (dependent). In terms of mathematical representation, the relationship between the Independent variables and the dependent variable can only be represented by the equation of a straight line: Y = MX+C (not by any other type of equations like a quadratic equation or polynomial), where M is the gradient of the line, and C is the intercept.

What is the Gradient of a Straight line? The Gradient of a straight line denoted by M is the increase in Y for every unit increase or decrease in X. It can either be negative or positive. A positive gradient means that Y increases as X increases by a unit. A negative gradient means Y decreases as X decreases by a unit. A zero gradient means that Y is zero no matter the unit change in X.

What is the Intercept of a Straight line? The intercept of a straight line, denoted by C, is the average value of y when X=0.

Statistical Test for checking Criteria 2 — Correlation Matrix and corresponding P-value

The correlation matrix and the corresponding P-value can be used to check if a statistically significant linear relationship exist between each of the independent variables and the dependent variable. The Correlation matrix indicates the Pearson correlation coefficient, and the corresponding p-value between each of the Predictor Variables, and the Target variable.

According to the rule of thumb, a p-value < 0.05 indicate a statistically significant linear relationship exist between the specific independent variable and the dependent variable, otherwise none exist.

Graphical Diagnostic for checking Criteria 2 — A Scatter Plot between each Independent Variable and the Dependent Variable , with a regression line

A scatter Plot for each Independent variable with the Dependent Variable, mapped with a regression line can be used to visually check if only a linear relationship exists between each of the independent variable and the dependent variable. For each scatter plot, each independent variable is on the X-axis, while the dependent variable is on the Y-axis.

If the regression line is visually shown to be a straight line but not quadratic/polynomial, irrespective of the sloping nature of the line (downward slope, upward slope), then a linear relationship exists between each of the Independent variable and the dependent variable.

What is a regression line? A regression line indicates the connection of the scattered data points with a line on an X-Y graph.

What is a scatter plot? A graph that is use to plot a set of points for two variables that are situated on a vertical (independent variable) and horizontal(dependent variable) axis.

Criteria 3: No Multicollinearity must exist between the independent variables.

Multicollinearity is a situation where independent variables are strongly correlated. No Multicollinearity exist between a set of independent variables(Xs) if atleast no one pair is strongly correlated. The presence of multicollinearity leads to increased standard errors that leads to wider confidence interval. Wider confidence interval would lead to less precise estimates of slope parameters in regression analysis.

Statistical Test for checking Criteria 3 — Variance Inflation Factor (VIF)

The Variance Inflation Factor (VIF) can be used to check if a multicollinearity exist between each of the independent variables. It indicates if a correlation exist between an independent variable and the group of other independent variables. In summary, if an independent variable has high VIF, it means they can be predicted by atleast one of other independent variables

According to the rule of thumb, If an independent variable has high VIF, it means they can be predicted by atleast one of the other independent variables else otherwise.

The general rule of thumb for Multicollinearity: If VIF <= 3, Multicollinearity does not exist. If 4 >= VIF < 9, Multicollinearity might exist, there maybe need to conduct further investigation. If VIF >=10, Multicollinearity significantly exist. The receiprocal of VIF is called Tolelerance.

Graphical Diagnostic for checking Criteria 3 — Scatter plot to visualize correlation effect among independent variables

A pair-wise scatterplot between each possible pair of the independent variable can be used to visually assess if multicollinearity exist between each pair.

Criteria 4: There have to be Homoscedasticity in the data

To understand the concept of Homoscedasticity, we would can break the word into its smallest unit of meaning (morphemes). Homoscedasticity can be broken into two unit words, namely — “Homo”, and “Scedasticity”. That is, Homoscedasticity → Homo + Scedasticity

What is Homo? It means the same; What about Scedasticity? It means the distribution of error terms.

Simply put, Homoscedasticity is a situation where error terms are not distributed randomly but distributed in a way that makes them align in an obvious systematic and fixed pattern(same variance or constant variance).

To understand the concept of homoscedasticity, we need to understand the concept of residual through the X-Y plane. Residual (denoted as Yr ) is the difference between each observed value of the target variable (denoted as Yo), and the estimated/predicted value (denoted as Yp) of the target variable by the prediction model.

Residual (Yr) = Observed value of the target variable(Yo) — estimated/predicted value of the target variable(Yp): (Yr = Yo-Yp)

For homoscedasticity to occur, the distribution of Yr on an X-Y plane for each observation of each independent variable X being assessed must be on or along the same horizontal line. That is they must be horizontally distributed. Otherwise, the distribution of Yr on the X-Y plane for each observation of each independent variable X being assessed must show an incremental or decrement line along the Y-axis, that is, it must show a slope (positive or negative). Therefore, Homoscedasticity holds in data when the distance between each residual and the line of best fit is the same throughout all observation points.

Statistical Test for checking Criteria 4 — Breusch-Pagan test

The Breusch-pagan can be used for checking for the Homoscedasticity of a dataset. The null hypothesis for the Breusch Pagan test is that the error variances are all equal (Homoscedasticity). The alternative hypothesis is that the error variances are not equal(Heteroscedasticity). More specifically, as Y increases, the variances increase (or decrease).

Graphical Diagnostic for checking Criteria 4 — The Residual by Fitted Values Scatterplot

The Residual by Fitted values scatter plot is a good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). In a residual-fitted value scatterplot, if the distance between the regression line and each residual value is the same throughout then Homoscedasticity occurs, otherwise, Heteroscedasticity does.

Criteria 5: The error terms are normally distributed

To understand the concept of normally distributed error terms, we need to understand the concept of error terms. Error terms are the variation in the value of the dependent variable not explained by the independent variables. So linear regression assumes that the variability in Y, which is unexplained by X, is normally distributed.

Statistical Test for checking Criteria 5 — Kolmogorov Smirnov

The Kolmogorov-Smirnov can be used to check if error terms distribution follows a normal distribution or have the same pattern of normality. Besides the Kolmogorov-Smirnov test, another common test that be used to statistically check for the normality of error terms is the Shapiro-Wilk test. The Shapiro-Wilk test is preferred for small samples (n is less than or equal to 50). For larger samples, the Kolomogrov-Smirnov test is recommended.

Graphical Diagnostic for checking Criteria 5 — The Normal Q-Q Plot

A Q-Q plot is a scatterplot formed by plotting two sets of quantiles — one on the X-axis, the other on the Y-axis — against one another. If both are from the same distribution, the plotted points (for each value point(x,y)) roughly form a line that’s straight. If our data comes from a normal distribution, we should see all the points sitting on a straight line and not forming an S or quadratic shape.

What is a quantile? is a value that serves as a cut-off point. It divides the set of values of a variable or distribution into equal continuous intervals and equal probabilities. e.g A 0.25 quantile divides a set of points or values into 4 equal sets of values; A 0.20 quantile divides a set of points or values into 5 equal sets of values. Percentiles(1/100 — dividing into 100 equal parts), Deciles(1/10 — dividing into 10 equal parts), Quartiles(1/4 — dividing into 4 equal parts), (1/3 — dividing into 3 equal parts) are examples of quantiles used in research as statistics.

Criteria 6: There should not be Significant Outliers in the data

Outliers are observations of a variable that does not follow the expected trend, pattern, or behavior of the remaining observations of that variable. Outliers stand out from the rest of the observations and can be considered — rebel observation(s) of the variable. that is, it does not fit with the rest of the data and be said to be a stand-alone. Outliers in data are observations from each variable in the data that behave totally differently from the rest.

Within the regression context, they are observations that fall far away from the “cloud” of points. They are extremely important because they can have a strong influence on regression prediction. They are capable of negatively affecting the regression by presenting and over-reporting the regression coefficient of the predictor with the outlier points. It can ultimately reduce the accuracy of the predictive model. It can also affect the statistical significance.

Statistical Test for checking Criteria 6 — Cook’s Distance

Cook’s Distance is a measure of observation or a variable’s value on linear regression. variable value or observations with a very high and large influence may be outliers.

To identify values or observations that have a very high influence on a prediction using a cook’s distance, there is a rule of thumb. The rule of thumb is — any point or set of observations with a Cook’s distance value that is greater than over 4/n (where n is the total number of observations or called sample size) should be considered to be an outlier. Therefore should be examined further. Also, any observation or set of values with Cook’s distance value that is more than three times the mean of the variable may possibly be an outlier.

Graphical Diagnostic for checking Criteria 6 — Boxplot

Boxplot captures the summary of data with only a simple box and whiskers in such a way that it reveals observations that are different from the rest. Boxplots generates 5-number summary statistics, namely — Minimum(Q0, 0th percentile), Lower quartile(Q1, 25th percentile), Median (Q2, 50th percentile), Upper quartile(Q3, 75th percentile), Maximum(Q4, 100th Percentile), and indicate the outliers. One can just get insights(quartiles, median, and outliers) into the dataset by just looking at its boxplot.

A Box Plot Outliers detector (Box and Whisker Graph) is easy to interpret, even for non-technical audiences.

5-number summary of the Boxplot. Image from FlowingData

In a boxplot, an observation or set of values is outliers if they are 1.5 times bigger or smaller than the expected observation. An extreme value is considered to be an outlier if it is at least 1.5 interquartile ranges below the first quartile or at least 1.5 interquartile ranges above the third quartile.

The interquartile range (IQR) in a boxplot, is the distance between the upper and lower quartiles (Q3-Q1). IQR is used as a measure of dispersion to indicate how spread out the values of a continuous variable are around the mean. Higher IQR means higher variance, lower IQR means lower variance.

Criteria 7: No auto-correlation of the error terms

Autocorrelation, within the context of regression, occurs when the error terms are dependent on each other. That is when they strongly influence each other. The existence of autocorrelation can be measured from the level of randomness of error terms. If the clustering of error terms are so random that they do not show any pattern then the error terms can be said to be independent and non-autocorrelated. However, if the clustering of error terms shows clear pattern then the error terms can be said to be dependent on each other, and auto-correlated.

Statistical Test for checking Criteria 7 — Durban-Watson Test

A Durban-Watson test is used to determine if autocorrelation exist between cluster of errors. If autocorrelation exist, they will determine if they are positively autocorrelated, or negatively autocorrelated.

The Durbin-Watson test statistic (d) always ranges between 0 and 4. If the value is near 2, it indicates evidence of non-autocorrelation. If the value is towards 0, it indicates evidence of positive autocorrelation. If the value is towards 4, it indicates evidence of negative autocorrelation.

The null hypothesis for the durbin-watson test stastistic — Residuals from the regression are not autocorrelated (autocorrelation coefficient, ρ = 0). The Alternative hypothesis — Residuals from the regression are autocorrelated (autocorrelation coefficient, ρ > 0). The durban-watson statistic is approximately equal to 2*(1-r) where r is the sample autocorrelation of the residuals.

As a general rule of thumb, if the Durbin-Watson test statistic value is between the range of 1.5 and 2.5, then autocorrelation is considered normal. However, values outside of this range could indicate that autocorrelation is a problem.

Graphical Diagnostic for checking Criteria 7 — The Residual by Successive Residual Scatter plot (LagPlot of the Residual)

A Lagplot of the residual is a type of scatterplot that visually indicate if the residuals are independent and auto-correlated. The lag plot works by plotting each residual value versus the value of the successive residual. That is, the first residual is plotted against the second, the second versus the third, etc.

As an illustration — if the total number of residuals is n, such that we have — R1, R2, R3,…, Rn. The Lagplot will be an X-Y plane, where for every residual Rn, we would plot an Rn on the Y-axis against an Rn-1 on the X-axis. That is, we would plot an R2 on the Y- axis against an R1 on the X-axis, R3 on the Y-axis against an R2 on the X-axis. The occurrence of lag happens at the X-axis, and at Lag=1 since the gap of comparison for plotting happens between a residual point(say Rn)and the residual point immediately before it(say Rn-1).

So the maximum number of plots that can be plotted on the lagplot is n-1 since the last lagplot point, P, will be P(Rn-1, Rn). that is, Because of the way the residuals are paired, there will be one less point on this plot than on most other types of residual plots.

In a Lagplot, If the residual plots are randomly scattered with no apparent structure or pattern, then no autocorrelation exists in the data. Usually, random residual plot distributions do not exhibit any identifiable structure in the lag plot. However, if the residual plots are clustered in a clearly patterned or structured way, such that they provide a linear, quadratic or polynomial plot, then autocorrelation exists in the residual lagplot distribution.

4. Implementation and test of each criteria in python

In this section, we would implement the assessment of each criteria in python.

The data used for this illustration is a free sourced data from the Nigerian Demographic and Health Survey 2018 (NDHS 2018). The DHS is a nationally representative survey that is being conducted in developing countries.

To request any country-level free sourced data from the DHS program, kindly visit the DHS Program, sign-up, and request.

1. We will import and load data into python through Jupyter notebook

#import data into python
path=r'B:ndhs2018_respondent_with_child.csv'
ndhs2018_respondent_with_child=pd.read_csv(path)

The dataset (dataframe) is named — “ndhs2018_respondent_with_child”, as indicated above. Let’s check for the number of observations and variables in the dataset using the shape() method

#determine the number of observations and columns
ndhs2018_respondent_with_child.shape
(29992, 4)

The number of observations is 29992, while the number of variables (columns) is 4.

Note: We have done some processing to the data to meet our goal before importing into Jupiternote book. The original dataset has thousands of variables (columns) and observations.

Next, we would check for the name, and data type of the 4 variables using the info() method

#determine the variable type of each variable in the dataset
ndhs2018_respondent_with_child.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29992 entries, 0 to 29991
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 v012 29992 non-null int64
1 v106 29992 non-null object
2 v190 29992 non-null object
3 Percent_children_dead 29992 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 937.4+ KB

The four variables are — v012, v106, v190, and Percent_children_dead. variables — v012, and Percent_children_dead are integer variables; while variables — v106, and v190 are object variables.

Next, we would conduct an exploratory analysis of the four variables (columns) by running a percentage frequency distribution for each using the count() method.

We also see the option-categories contained in each variable. For example, variable v106 has 4 option-categories namely — no education, secondary, primary, and higher.

#list of the response-categories of the educational status(v106)variable
ndhs2018_respondent_with_child['v106'].value_counts(normalize=True)*100
no education    41.527741
secondary 31.838490
primary 17.841424
higher 8.792345
Name: v106, dtype: float64
#list of the response-categories of the weatlh index status (v190)variable
ndhs2018_respondent_with_child['v190'].value_counts(normalize=True)*100
poorer     21.399040
middle 21.172313
poorest 20.992265
richer 19.825287
richest 16.611096
Name: v190, dtype: float64
#percentage distribution of parity variable (Percent_children_dead)
ndhs2018_respondent_with_child['Percent_children_dead'].value_counts(normalize=True)*100
0.00     67.361296
33.33 4.527874
50.00 3.831022
25.00 3.804348
20.00 3.214190
...
90.00 0.003334
7.14 0.003334
91.67 0.003334
61.54 0.003334
90.91 0.003334
Name: Percent_children_dead, Length: 61, dtype: float64

#percentage distribution for variable age (v012)
ndhs2018_respondent_with_child['v012'].value_counts(normalize=True)*100
30    6.948520
25 6.658442
35 6.255001
40 5.128034
28 4.154441
45 4.134436
20 3.907709
32 3.587623
27 3.584289
38 3.440918
26 3.217525
22 3.204188
36 2.817418
33 2.807415
23 2.774073
29 2.747399
48 2.717391
31 2.564017
42 2.557349
34 2.550680
37 2.534009
24 2.510670
49 2.310616
39 2.140571
43 2.060549
21 2.007202
46 1.800480
41 1.767138
47 1.613764
18 1.513737
44 1.507069
19 1.373700
17 0.816885
16 0.230061
15 0.056682
Name: v012, dtype: float64

2. We will assess the data to see if they meet the 7 criteria.

2.1. Criteria 1: The Independent variable(s) must be Continous, Ordinal, or Discrete, while the Dependent variable(s) must only be Continuous or Discrete but not Ordinal

Criteria 1 Check — Variable Type Check

In python, we would check for the variables’ type of the 4 columns using the info() method

#check the variable types of the four columns(variables)
ndhs2018_respondent_with_child.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29992 entries, 0 to 29991
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 v012 29992 non-null int64
1 v106 29992 non-null object
2 v190 29992 non-null object
3 Percent_children_dead 29992 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 937.4+ KB

From the result, only variables (column) v012, and percent_children_dead are integers(int64, and float64). the other two (v106, v190) are not (object).

Next, we would convert the two variables (v106, v190)that are not integers to integer variable types, so they can meet criteria 1.

Let’s start with v106:

#code the response-categories as numbers
Independent_v106 = {
"v106": {"no education": 0, "secondary": 1, "primary": 2, "higher":3}
}
ndhs2018_respondent_with_child.replace(Independent_v106, inplace=True)

Let’s check if the variable type of variable v106 has changed from object to integer.

#check the education variable to see if it has been expressed in numbers
ndhs2018_respondent_with_child['v106'].value_counts(normalize=True) * 100
1    21.399040
3 21.172313
2 20.992265
4 19.825287
5 16.611096
Name: v190, dtype: float64

The variable type of v106 has changed from object to integer

Let’s convert v190 to integer:

#code the response-categories as numbers
Independent_v190 = {
"v190": {"poorer": 1, "poorest": 2, "middle": 3, "richer":4, "richest":5}
}
ndhs2018_respondent_with_child.replace(Independent_v190, inplace=True)

The variable type of v190 has changed from object to integer

#check the wealth index variable to see if it has been expressed in numbers
ndhs2018_respondent_with_child['v190'].value_counts(normalize=True) * 100
1    21.399040
3 21.172313
2 20.992265
4 19.825287
5 16.611096
Name: v190, dtype: float64

Next, let’s check if the four variables or columns are now integers

#check all the variables to see if they are now integers
ndhs2018_respondent_with_child.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29992 entries, 0 to 29991
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 v012 29992 non-null int64
1 v106 29992 non-null int64
2 v190 29992 non-null int64
3 Percent_children_dead 29992 non-null float64
dtypes: float64(1), int64(3)
memory usage: 937.4 KB

Verdict for Criteria 1: All four variables are now integers, so the first criteria was met.

2.2. Criteria 2: Only a Linear relationship exists between each of the Predictor variables (Independent variable) and the Target variable (Dependent Variable).

We would use both statistical and graphical diagnostic checks to see if the dataset and its variables meet criteria 2

2.2.1 Statistical Test for checking Criteria 2 — Correlation Matrix and corresponding P-value.

Let’s use a correlation matrix and its p-value to check for Criteria 2

#conduct correlation matrix 
corr_matrix=ndhs2018_respondent_with_child.corr()
#round off correlation coefficient to 4 decimal places
round(corr_matrix,4)
v012                      v106       v190         Percent_children_dead
v012 1.0000 0.1095 0.1229 0.0771
v106 0.1095 1.0000 0.4389 -0.1516
v190 0.1229 0.4389 1.0000 -0.1841
Percent_children_dead 0.0771 -0.1516 -0.1841 1.0000

The correlation matrix shows that a linear relationship exist between each of the independent variable (v106, 190, and v190), and the dependent variable (percent children dead). With the relationship was stronger between percent children dead and v190(-0.1841 about -18%), and v106 (-0.1516 about -15%), and less stronger for v012 (0.07 about 7%)

The Correlation matrix p-value(P-value <0.0)shows that all the pairwise relationships are statistically significant

#use custom function to calculate p-values for correlation matrix
from scipy.stats import pearsonr
r_pvalues(ndhs2018_respondent_with_child)
                       v012 v106 v190  Percent_children_dead
v012 0.0 0.0 0.0 0.0
v106 0.0 0.0 0.0 0.0
v190 0.0 0.0 0.0 0.0
Percent_children_dead 0.0 0.0 0.0 0.0

2.2.2.Graphical Diagnostic for checking Criteria 2 — A Scatter Plot between each Independent Variable and the Dependent Variable, with a regression line

Using scatterplot, let’s visually check if a linear relationship exists between the dependent variable (Percent_children_dead), and each of the Independent variable(v012, v106, and v190).

Let check that between percent children dead and v012

# importing libraries
import seaborn as sb

# use regplot
sb.regplot(x = "v012",
y = "Percent_children_dead",
ci = None,
data = ndhs2018_respondent_with_child)

The regression line supports the type of linear relationship that exist between the v012 and percent children dead. It shows that it is a positive relationsip (that is “percent children dead” increases as “v012” increases) or decreases as v012 decreases.

Lets check visually that a linear relationship exist between the dependent variable (Percent_children_dead), and the Independent variable(v106)

# importing libraries
import seaborn as sb

# use regplot
sb.regplot(x = "v106",
y = "Percent_children_dead",
ci = None,
data = ndhs2018_respondent_with_child, color='yellow')

The regression line supports the type of linear relationship that exist between the v106 and the percent of children dead variable as indicated statistically. It shows that it is a negative or inverse relationship (that is “percent children dead” decreases as “v106” increases) increases as v106 decreases.

Lets check if a linear relationship exist between the dependent variable (Percent_children_dead), and the Independent variable(v190), visually

# importing libraries
import seaborn as sb

# use regplot
sb.regplot(x = "v190",
y = "Percent_children_dead",
ci = None,
data = ndhs2018_respondent_with_child, color='red')

The regression line supports the type of linear relationship that exist between the v190 and percent children dead as indicated statistically. It shows that it is a negative or inverse relationsip (that is “percent children dead” decreases as “v190” increases) increases as v190 decreases.

Verdict for Criteria 2: The three independent variables have pairwise linear relationship with the dependent variable. v106, and v190 have inverse or negative relationship with the dependent variable. v012 has direct or positive relationsip with the independent variable. The second criteria was met.

2.3.Criteria 3: No Multicollinearity must exist between the independent variables.

We would use both statistical and graphical diagnostic check to see if the dataset and its variables meet criteria 3

2.3.1. Statistical Test for checking Criteria 3 — Variance Inflation Factor (VIF)

Lets use a variance inflation factor (VIF) to statistically check if every possible pair of the independent variables (v012 vs v190, v012 vs v106, etc) have a linear relationship.

from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
# Split the columns into y and X
y = ndhs2018_respondent_with_child['Percent_children_dead']
X = ndhs2018_respondent_with_child[['v012', 'v106', 'v190']]
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns

# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(len(X.columns))]

print(vif_data)
feature       VIF
0 v012 4.670006
1 v106 2.403949
2 v190 5.822557

Based on the rule on the rule of thumb, it appears there is not a significant multicollinearity between the independent variables. especially for v0106 (2.403)

2.3.2.Graphical Diagnostic for checking Criteria 3 — Scatter plot to visualize correlation effect among independent variables

Using scatterplot, let’s visually check if multicollinearity exists between the independent variables using pairwise plotting.

import seaborn as sns
# Split the columns into y and X
y = ndhs2018_respondent_with_child['Percent_children_dead']
X = ndhs2018_respondent_with_child[['v012', 'v106', 'v190']]
sns.pairplot(X);

There appears not to exist a clear linear relationship between each pair of independent variables. Therefore, there is no clear or significant multicollinearity between the independent variables.

Verdict for Criteria 3: No Multicollinearity exist between the independent variables. The third criteria was met.

2.4. Criteria 4: There has to be Homoscedasticity in the data

We would use both statistical and graphical diagnostic checks to see if the dataset and its variables meet criteria 3

2.4.1. Statistical Test for checking Criteria 4 — Breusch-Pagan test

Let’s use a Breusch-pagan test to statistically check if there is homoscedasticity in the data. Recall — The null hypothesis for the Breusch Pagan test is that the error variances are all equal (Homoscedasticity).

# Importing libraries
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.compat import lzip
import statsmodels.stats.api as sms
# Conduct the Breusch-Pagan test
names = ['Lagrange multiplier statistic', 'p-value',
'f-value', 'f p-value']

# Get the test result
test_result = sms.het_breuschpagan(fit_model.resid, fit_model.model.exog)

lzip(names, test_result)
[('Lagrange multiplier statistic', 2751.761503193194),
('p-value', 0.0),
('f-value', 1009.8122027447771),
('f p-value', 0.0)]

Based on the result above, the Lagrange multiplier statistic is 2751.76 and the corresponding p-value is 0.0. Because this p-value is less than 0.05, we reject the null hypothesis that homoscedasticity occurs in the data and accept the alternative hypothesis instead. We conclude that the data did not experience homoscedasticity but heteroscedasticity.

2.4.2. Graphical Diagnostic for checking Criteria 4 — The Residual by Fitted Values Scatterplot.

Using a scatterplot, let’s visually check if heteroscedasticity exists in the data if we plot the residual of the regression with fitted values.

import statsmodels.api as sm
#Prepare the residual of the dependent variable.
residuals = result.resid
plt.scatter(y, residuals)
plt.plot(y, [0]*len(y))
plt.xlabel("Fitted Values(Predicted values of y)")
plt.ylabel("Residuals(Actual y - Predicted y)")

Since the residual appears not to be on the same horizontal (constant variance) but spread upwards, it is highly possible that the variance of the residuals are not equal or constant. Therefore, this plot suggests heteroscedasticity exists in the data.

Verdict for Criteria 4: There is no Homoscedasticity in the data. Therefore the fourth criteria was not met.

2.5. Criteria 5: The error terms are normally distributed

We would use both statistical and graphical diagnostic checks to see if the dataset and its variables meet criteria 5

2.5.1. Statistical Test for checking Criteria 5 — Kolmogorov Smirnov

We would assess and test the normality status of the error terms of the data using the Kolmogorov-Smirnov statistic. The null hypothesis for the Kolmogorov-Smirnov test is that the error terms are normally distributed or are from a normal distribution

#perform Kolmogorov-Smirnov test for normality
import math
import numpy as np
from scipy.stats import kstest

#perform Kolmogorov-Smirnov test for normality
kstest(residuals, 'norm')
KstestResult(statistic=0.6656366230671799, pvalue=0.0)

The result of the Kolmogorov-Smirnov test shows that the error terms are not from a normal distribution. Since the p-value = 0.0(<0.05), we would reject the null hypothesis and accept the alternative hypothesis that the error terms are not normally distributed.

2.5.2. Graphical Diagnostic for checking Criteria 5 — The Normal Q-Q Plot

We would visually check if the result of the statistical test makes sense using the normal q-q plot.

import statsmodels.api as sm
residuals = result.resid
sm.qqplot(residuals)
plt.show()

Since the plot line takes more of S shape and not close to a straight line, this means the error terms are significantly more likely not to be normally distributed. This supports the statistical findings about the error terms not being normally distributed.

Verdict for Criteria 5: Both the statistical and graphical test show that the error terms are not normally distributed. Therefore, the error terms are not normally distributed. The fifth criteria was not met.

2.6.Criteria 6: There should not be Significant Outliers in the data

We would use both statistical and graphical diagnostic checks to see if there are outliers in the dataset.

2.6.1. Statistical Test for checking Criteria 6 — Cook’s Distance

The cook distance would be used to check for outliers. If there exist outliers in the data, an inquiry must be made about the status of the outlier

#suppress scientific notation

import numpy as np
np.set_printoptions(suppress=True)


#create instance of influence
influence = result.get_influence()

#obtain Cook's distance for each observation
cooks_distance = influence.cooks_distance

#display Cook's distances
print(cooks_distance[0])
[0.00000261 0.0000017  0.00000018 ... 0.00000174 0.00000283 0.00049528]
#summary of cooks' distance influence
summary_cooks=influence.summary_frame()
summary_cooks
#list out identified outliers
Cooks_outliers = summary_cooks[summary_cooks['cooks_d'] > (4/29992) ]
Cooks_outliers

According to cook's distance, there are 1446 observations that are outliers out of the total 29992. There may be a need to investigate further on the nature of the outliers. Questions to ask can be: are these systemic, contextual, and acceptable outliers or do they happen by mistake? For this dataset, it is highly less likely that they are actual mistakes. The origin of the dataset (DHS) makes it so. Recall — according to the rule of thumb, outliers have cook distance values that are higher than or equal to 4/n where n is the sample size.

2.6.2. Graphical Diagnostic for checking Criteria 6 — Boxplot

We would visually check if there are outliers in the dataset, as indicated and specified by the cook’s distance

#run the boxplots
ndhs2018_respondent_with_child.boxplot(figsize = (10, 7))

As indicated and shown by the boxplot above, there are actually identified outliers in the dataset, and almost all of them are found on the variable “percent children dead”

Verdict for Criteria 6: Though both the statistical test and graphical diagnostic show that there are outliers in the dataset. Remover or none remover of outliers depends on the context and circumstance of the variable (s) concerned, and the goal of analysis. The sixth criteria was not met.

2.7.Criteria 7: No auto-correlation of the error terms

We would use both statistical and graphical diagnostic checks to see if the error terms are autocorrelated.

2.7.1.Statistical Test for checking Criteria 7 — Durban-Watson Test

We would use the durban-watson test to check if error terms are autocorrelated

#import the durban_watson statistic
from statsmodels.stats.stattools import durbin_watson as dwtest
dwtest(residuals)
1.8366305102662541

Since the value of the durban-watson statistic is close to 2, one can say that there is no autocorrelation of the error terms. The rule of thumb for the durban-watson statistic states if the value is near 2, it indicates evidence of non-autocorrelation. If the value is towards 0, it indicates evidence of positive autocorrelation. If the value is towards 4, it indicates evidence of negative autocorrelation. Durbin-Watson test statistics (d) always ranges between 0 and 4.

2.7.2. Graphical Diagnostic for checking Criteria 7 — The Residual by Successive Residual Scatter plot (LagPlot of the Residual)

We would visually check if the residuals are autocorrelated or not by conducting a lagplot of the residuals.


import pandas as pds
import matplotlib.pyplot as plt

# Draw a lag plot
pd.plotting.lag_plot(residuals, lag=1);
plt.title("residuals Lag plot with lag=1");
plt.show(block=True);

Based on the result of the graph above, it is apparent that there is no clear pattern or shape (linear, quadratic etc) on about of the scatter suggesting that no autocorrelation exist between residuals.

Verdict for Criteria 7: Both the statistical and graphical test show that autocorrelation does not exist between residuals. Hence the seventh criteria was met.

5. Conclusion

To use linear regression for prediction, the 7 criteria must met first by the variables involve. If any of the 7 criterias are not met, there are ways to fix it, depending on context and specific criteria violated. The fixing may range from removing variables responsible for the violation of the specific criteria, to conducting a transformation using transformation functions like logarithms, to encoding categorical variable to numerical variable.

For the criteria analysis conducted in python, only four criterias were met out of the 7 assessed. The four criteria that were met are — All variables are integers, Linear relationship exist between each of the independent variable and the dependent variable, No multicollinearity exist between any pair of the independent variables, and the residuals are not autocorrelated. The three that were not met were — There has to be homoscedasticity in the data, the error terms must be normally distributed, and significant outliers must not exist in the data.

Subscribe to DDIntel Here.

Visit our website here: https://www.datadriveninvestor.com

Join our network here: https://datadriveninvestor.com/collaborate


7 criteria that must be met before using a linear regression model (LRM) for prediction was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

A note to our visitors

This website has updated its privacy policy in compliance with changes to European Union data protection law, for all members globally. We’ve also updated our Privacy Policy to give you more information about your rights and responsibilities with respect to your privacy and personal information. Please read this to review the updates about which cookies we use and what information we collect on our site. By continuing to use this site, you are agreeing to our updated privacy policy.