Visualization is never enough to conclude that a relationship exist between the two Variables. It is best to confirm it Statistically.
When preparing data for modeling or analysis, one of the known practices is, to conduct a data visualization of the target feature (Y) with each of the predictor features (X) so we can roughly see if any relationship/association exists between them. The truth is, using visualization only is never enough to conclude that a relationship/association exists between any two variables. No, it is not. The preferred standard is to combine the bivariate visualization with a statistical test.
The gold standard — “Use data visualization tools(charts/two-way table) to visualize possible existence of relationships or association between the two features, then conduct a statistical test to statistically confirm if what the visual pattern suggests is true or not”
In this article, I will take you through the following:
- The whole concept of bivariate analysis;
- Types of bivariate analysis;
- Selecting the appropriate statistic based on Parametric status and Type of variable.
1. The Concept of Bivariate Analysis
Bivariate analysis is the empirical analysis of two variables to ascertain if a relationship or association exists between them. It uses a combination of both visual and statistical tests, or at least one, to do this. Bivariate analysis can be part of multivariate analysis, and be a stand-alone.
As part of multivariate analysis, it is a key component of the pre-processing activity before modeling. This is to ensure that each predictor feature that is plugged into the modeling has a relationship or association with the target feature before there are plugged into predictive modeling (logistics regression, support vector machine, Decision tree, etc). If any predictor feature is found to have no relationship with the target feature(Y) during bivariate analysis assessment, they are not included in the predictive modeling process.
As a stand-alone, you may just want to test for association, relationship, differences, etc between two variables.
The framework below highlights the conceptual representation of bivariate analysis and its components:
2. Types of Bivariate Analysis
Bivariate analysis has two components — Descriptive and Inferential.
2.1 Descriptive Bivariate analysis
This is the visual part of the bivariate analysis, it can be presented as a table-way (frequency distribution, percentage distribution) or as graphs (bar charts, scatterplots, boxplots, etc). It suggests a relationship or association between two variables, based on the visual pattern.
2.1.1 Two-way Table Bivariate Analysis is the tabular (or matrix) representation of the relationship between two variables. A two-way table, also called a Contingency table, is a table in matrix form, that shows the observed numbers between two interacting variables/features. The table has two sets of headers; one on the first row; the other on the first column. One of the headers represents the response-categories of one of the two variables, while the other represents the response-categories of the other variable.
The table below shows the two-way table of two features/Variables — Gender, and Economic Status. The response-categories of variable — Gender (Male, Female), is on the first column and the response-category of the other variable — Economic Class ( Lower-class, Middle-class, Upper-class) is on the first row of the table.
Types of Two-way table
Frequency & Percentage distributions — When the actual values of a two-way are represented in the table, as above, it is called Frequency distribution; when the actual values are represented as percentages, it is called Percentage Distribution.
The triplet two-way table below shows the frequency and percentage distributions of two variables. Table 1 highlights frequency distribution, while Table 2 and Table 3 highlight percentage distributions. Table 2 highlights column percentage distribution of the values i.e all the values are turned into percentages along the columns, so the total value on each column sums to 100%. Table 3 highlights the row percentage distribution of the values i.e all the values are turned into percentages along the rows, so the total value on each row sums to 100%.
2.1.2 Graphical Bivariate Analysis is the diagrammatic representation of the relationship between two features(Variables). The type of graph depends on the type of feature (numerical, categorical, or ordinal).
Types of Graphical Bivariate Analysis
Boxplot is the graphical representation of the relationship between a categorical variable and a continuous variable. The continuous variable is represented on the vertical axis, while the categorical variable is represented on the horizontal axis.
Boxplots generates 5-number summary statistics, namely — Minimum(Q0, 0th percentile), Lower quartile(Q1, 25th percentile), median (Q2, 50th percentile), Upper quartile(Q3, 75th percentile), Maximum(Q4, 100th Percentile), and indicate the outliers. It also helps identify the outliers in the continuous variable. Outliers are values that fall below the minimum, or above the maximum values. Boxplot can also be used in Univariate (just one variable) analysis.
During data preprocessing, identified outlier values are usually removed from the data, because they consistute a noise in your your analysis, and can undermine the quality of your predictive modelling by aiding underestimation or overestimation.
Interquartile range (IQR) in a boxplot, is the distance between the upper and lower quartiles (Q3-Q1). IQR is used as a measure of dispersion to indicate how spread out the values of a continuous value is around the mean. Higher IQR means higher variance, lower IQR means lower variance.
A scatterplot is the graphical representation of the relationship between two numerical variables. It uses dots to indicate the pattern of the relationship.
If it suggests that a relationship exists. The relationship can be a positive correlational relationship( direct correlational relationship) or a negative correlational relationship (Inverse correlational relationship).
A positive correlational relationship is one in which an increase in one variable, leads to an increase in the other. Also, a decrease in one leads to a decrease in the other. In a positive relationship, the two variables have the same relationship orientation. A negative correlational relationship is one in which a decrease in one variable, leads to an increase in the other, vice versa. In an inverse correlational relationship, the two variables have opposite relationship orientation.
To visually assess if a relationship exists and in what direction? It is best to use a Trendline on the chart. The graph(from Khan Academy) below suggests the direction of the relationship of the X and Y variables. The first graph from the left suggests a positive relationship, the middle graph a negative relationship, while the last graph suggests no relationship.
Line Chart is the graphical representation of the relationship between a categorical variable and a numerical variable. It is similar to a boxplot, just that all dots (defined the x-axis values) are orderly and sequentially linked together using a line. Fundamentally, it shows the trendline of the values of the numerical variable by the categorical variable, with respect to time, season, etc.
Barchart is the graphical representation of the relationships between two categorical variables. Bars are used to indicate the patterns of the relationship. Barchart can also be used in Univariate (just one variable) analysis. There are two major types of bivariate bar charts — group and stacked bar charts.
Group bar chart is the visual representation of the distribution of response-categories of a categorical variable by the other using bars.
A stacked bar chart shows the distribution of the response-categories of a categorical variable on each of the response-category of another using a single bar.
The bivariate choropleth map is the geographical representation of the relationship between two variables, it references the location where the two variables are interacting. It compares the relationship between the two variables using a geographical scale (degree) and spread. Bivariate choropleth map uses color-coding progression to indicate the response-categories of the two variables.
2.2 Inferential Bivariate Analysis
This is the application of statistical methods in determining if a relationship, association, or difference exists between two variables. It should be used to complement descriptive bivariate analysis or alone.
“In any instance, It is best to combine a descriptive bivariate analysis with an inferential bivariate statistical test.”
Inferential bivariate analysis is implemented, by carrying out the following sequentially:
1.Create the Null and Alternative hypotheses. Create the Null hypothesis that no relationship/association exists between the two variables/features under review. Also, create the counter Alternative hypotheses that a relationship/association exists between the two variables.
2.Select and use the appropriate statistic (chi-square, t-test, correlation, etc) and its affiliated p-value (together with the Confidence Interval) to see which hypotheses should be accepted.
3.Compute the selected statistic and its p-value.
4.Make a decision on hypotheses to choose. Decide on the hypothesis to select based on the p-value. If p-value ≤ 0.05, then accept the Alternative hypothesis (H1); If p-value>0.05, then accept the Null hypothesis (Ho)
There are two types of inferential bivariate statistics — Parametric and Non-parametric inferential bivariate analysis.
“Actually, the appropriate statistic to use is determined by the parametric status (Parametric or nonparametric) of the data being assessed, and the types of the variables involve.”
2.2.1 Parametric inferential bivariate analysis is used when the distribution of the population from which the two variables are from are normally distributed or bell-shaped. The parametric inferential statistics to use depends on the type of variable.
“Irrespective of the distribution, you can use parametric inferential statistics if your sample size is greater than or equal to 30 (applying the central limit theorem).”
2.2.2 Non-parametric inferential bivariate analyses are statistical bivariate analysis methods that are used when the distribution of the population from which the two variables are from, are unknown or not normally distributed. It can also be used when the sample size of the sample distribution is less than 30. Similar to the parametric test, the type of non-parametric statistic to use is determined by the type of variable.
Types of Variables
Qualitative — are variables naturally captured as open-ended and text, with a possibly infinite number of responses; that is, their responses are naturally not codified or categorized. These types of variables are mostly handled in qualitative research and unsupervised learning.
Quantitative — are variables naturally captured as close-ended, with finite response-categories; that is, they have codified or categorized number of response-categories. The two types of quantitative variables are — Numeric and Categorical.
Types of Quantitative variable
Numerical variables — are quantitative variables that are naturally captured as numbers. Examples are Age, Number of countries, Height, Weight, Temperature, etc
Categorical variables — are quantitative variables that are naturally captured with a finite set of possible response-categories. E.g The possible response-categories for Gender are Male, Female; the possible response-categories for Economic class are Lower-Class, Middle-Class, and Upper-Class.
Types of Numerical variable
Continuous variables — are numerical variables that can be naturally captured with decimals. They take real number values. Their values are obtained by measurement. Examples are Height, Weight, Temperature, etc.
Discrete variables — are numerical variables that can not be captured naturally with decimals i.e they take integer values only. Their values are captured by counting. E.g Age, Number of games played, Number of Arsenal players on loan.
Types of Categorical variable
Nominal variables — are categorical variables that have finite possible response-categories, with no natural ordering to the categories. For example, the Type of Place of residence has two response-categories — Rural and Urban, States in Nigeria.
Ordinal Variables — are categorical variables that have finite, naturally- ordered response-categories. The ordinal variable shows the incremental increase/decrease in the weight of each category along a spectrum and represents each response-category with a number. That is, it measures and indicates the reduction or increment in the degree of a qualitative phenomenon (E.g satisfaction measure, the position of students in a class based on grades, etc) using numbers. Rating and ranking of the qualitative phenomenon being measured are the core basis for ordinal variables. The difference between every two proxy representative numbers of an ordinal variable does not represent the quantitative difference in the qualitative phenomenon. Examples of ordinal variables are the Likert scale, the semantic differential scale.
Binary Variables — are categorical variables that have two finite response-categories, with no natural ordering. The binary variable is a type of Nominal variable with only two response-categories. For example, Gender only has two response-categories (Male, Female), type of place of residence — Rural, Urban.
Interval variables — are a special type of numeric ordinal variable because their values can be ranked, however, the distance between every two proxy values has been standardized to be equal, and the same. Also, zero value is not absolute zero but arbitrary i.e when zero value is indicated; it does not mean the phenomenon no longer exists; it just means a value on the scale. There can also be values below zero. Example — Temperature in Celsius(2°C, 1°C, 0, -1°C, -2°C), pH Scale (0–6: Acidity level, 7: Neutrality, 8-14: Alkalinity level), Time.
Ratio Variable — are numeric variables that have the characteristics of an Interval variable, and also represents zero value as an absolute value. That is when zero is indicated, it means the phenomenon no longer exists. For example — weight, age, flow rate, money.
3. Selecting the appropriate Statistical Test based on Parametric Status and Type of Variable Compatability
The table below indicates the appropriate statistical test to use, based on parametric status and type of variables.
Summary
In this article —
- We conceptualized the different components of bivariate analysis.
- We re-emphasized that the gold standard for bivariate analysis is — Use data visualization tools(charts/two-way table) to visualize the possible existence of relationships or association between the two features, then conduct a statistical test to statistically confirm if what the visual pattern suggests is true or not”.
- We indicated that choosing the right bivariate statistical test depends on the parametric status of your sample data, and the types of variable the two variables are.
- We highlighted, using a table, the different bivariate statistical tests by parametric status and types of variable.
Now that’s it. Hope you find this useful? Please drop your comments and follow me on LinkedIn at Ayobami Akiode LinkedIn
Visualization not enough to assess relationship between 2 variables; Combine with Statistical test was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.