Be ‘Visibly’ sure there are clusters in that dataset before profiling what the main themes are.
Conducting Customer and Audience Segmentation etc are cost-effective ways of making evidence-based decisions, by finding themes from unsupervised data. However, conducting Cluster Analysis on a dataset and making decisions with it without verifying, first, if clusters indeed exist; can be far more damaging, expensive, and costly.
In this article, we will explain how you can use two visualization algorithms (VAT and iVAT), to assess the Clustering Tendency of an unsupervised data set in python before commencing any relevant Cluster analysis.
To look at the different areas of VAT and iVAT, we would cover the following sections:
1. The Concept of Cluster Tendency
2.Cluster Tendency Algorithms
3.The VAT, and iVAT Algorithms
4. How to Evaluate the Cluster Tendency of a Dataset using VAT, and iVAT test in Python
5. Conclusions
6. Summary
1. The Concept of Cluster Tendency
Cluster Tendency is the process of assessing a dataset for the possible existence of clusters. It is meant to help us answer this critical question — ‘Are there clusters in this dataset based on our research question?’
It is essential to answer this question; because it will determine if going ahead to conduct cluster analysis (K-means, hierarchical, etc) on the dataset, is necessary or not.
So, how do we answer this question, or what do we do to answer it?
The simple answer — We deploy ‘Cluster Tendency Assessment Algorithms’
2. Cluster Tendency Assessment Algorithms
Cluster tendency assessment algorithms are machine learning methods that assess datasets for the possible existence of cluster patterns. There are two main methods for conducting these assessments; which are — Statistical and Visual methods. An example of the Statistical method is the Hopkins test, while examples of the Visual methods are — Visual Assessment of Tendency (VAT), and Improved Visual Assessment of Tendency (iVAT) test.
3. The VAT, and iVAT Algorithms
3.1. VAT (Visual Assessment for Tendency)
VAT is a visual method of assessing the clustering tendency of a dataset. The algorithm specifies the numbers of clusters that can be found in a dataset, and also shows if there are existing clusters within clusters( cluster hierarchies) by creating densely black square along the left diagonal of a squared-size Map. The algorithm works by creating a minimum spanning tree of observations, the pairwise distance between those observations is displayed as the black squares of the map.
Please note — In rare situations, the densely black squares will not lie along a diagonal, especially if the the map does not take a square shape (since it uses a matrix, and not all matrix are square matrix), and the black shape will not take a square-shape too (can be rectangular).
Also, the use of VAT and iVAT should not be seen as a substitute for metrics (E.g Elbow, and Silhouette methods) specifically designed in determining the number of clusters that can be found in a dataset. Majorly, the sole function of the VAT and iVAT algorithms is to visually suggest if clusters exist in a dataset or not, so as to avoid the expensive cost of conducting cluster analysis on datasets that have none. That is, the VAT and iVAT Maps can be used to validate and reinforce the main algorithms that estimate number of clusters, they should not be used as substitutes.
In general, the way to calculate the number of clusters for a given VAT/iVAT algorithm is to calculate the number of densely black squares within the printed map.
Example of VAT Maps: The figure below shows 4 main Maps (3a — 3d) with differing numbers of black squares representing clusters. As illustrations, the first Map with the 3a-tag has three visibly black squares, therefore it has three clusters. Also, the last figure with 3d-tag has two black squares, so it has two clusters.
3.2. iVAT (Improved Visual Assessment for Tendency)
iVAT is a visual method of assessing the clustering tendency of a dataset. It is an improved variant of VAT. It is different from VAT, by providing more precise and clearer densely black squares/rectangles in its printed map. However, it has more computing time cost.
Just as it is with the VAT algorithm, the black square images, which represents the number of clusters, will in rare situations not be on the left diagonal and may not be squares.
Example of iVAT Maps: The figure below shows 3 Maps (referencing from the left) with differing numbers of black squares, representing clusters. As illustrations, the first square (from the left) has three clusters (though there exist clusters within clusters too), the second has three black squares, therefore it has three clusters.
4. How to Evaluate The Cluster Tendency of a Dataset Using VAT and iVAT Test in Python
In Python, the VAT and iVAT assess the Cluster Tendency of a dataset, visually, using a Dissimilarity matrix. The matrix works for both numerical and categorical data types, as it has specific functions for all.
Now, let’s talk briefly about Dissimilarity (and Similarity) for the sake of those that are keenly interested in knowing what it is.
4.1. The Basic Concept Of Dissimilarity and Similarity
Dissimilarity is the numerical measure of how different two data objects/points are. The farther away two data points are, the higher their dissimilarity, vice versa. It is worthy of mention that ‘Similarity’, which is the numerical measure of how alike two data objects/points are, is the direct opposite of Dissimilarity. The closer two data objects are, the higher their similarity, vice versa.
Therefore, if two data points have a high similarity value, they will have a low dissimilarity value; If they have a high dissimilarity value, they will have a low similarity value. This simply means that dissimilarity and similarity have an inverse relationship.
4.2. Dissimilarity and Clustering
Within the context of VAT and iVAT algorithms in python, a very low dissimilarity between two data points indicates highly dense black squares/rectangle, indicating the existence of clustering. while a very high dissimilarity simply means they are far apart to be clustered or be in a cluster.
4.3. VAT and iVAT Algorithms (VAT) in Python
To implement the VAT (& iVAT) algorithms in python, we would carry out the following, sequentially, in our anaconda environment :
- Import the dataset into the anaconda working environment as dataframe through Pandas;
- Import the VAT and iVAT algorithms into our working environment from the pyclustertend module;
- Convert any categorical feature in the dataset into a numerical feature using either OneHot Encoding (Dummy Coding, if the feature is Nominal) or LabelEncoding (if the feature is Ordinal);
- Convert the dataframe into NumPy array;
- Feed the algorithms with the Numpy array dataset; and
- Get the result of algorithms as Maps.
The output of running the algorithms on our dataset will produce an ordered dissimilarity square-shaped Map containing the black squares (or rectangles).
4.4. Practical Session: Implementing the Algorithms
4.4a. Import the Packages
#import the packages import pandas as pd # working with data import numpy as np # working with arrays from pyclustertend import vat from pyclustertend import ivat
4.4b. Import the Dataset
#import dataset into anaconda through pandas # Specify the path directory to the stata dataset on your computer so as to import it into pandas path=r"B:Current_Contraceptive_Users_Modified_NDHS2018_15_19.dta" # Read the stata dataset into pandas (as its dataframe) and name the dataframe as DHS Young_Contraceptive_Users=pd.read_stata(path)
4.4c. Convert Categorical Features to Numerical Features
There are 5 nominal(not ordinal) features in our dataset, we would convert them to numerical features using One hot encoding ( dummy coding)
# convert the categorical features in the dataset to numerical Young_Contraceptive_Users=pd.get_dummies(Young_Contraceptive_Users, columns = ['v025','v106','v130','MaritalStatus','SES'])
4.4d. Convert Pandas Dataframe to Numpy Array
# convert the categorical features in the dataset to numerical Young_Contraceptive_Users=pd.get_dummies(Young_Contraceptive_Users, columns = ['v025','v106','v130','MaritalStatus','SES'])
4.4e. Feed the Array Data to the VAT Algorithm
#feed the array data to the VAT (Visual Assessment Tendency) algorithm vat(Young_Contraceptive_Users_array)
4.4f. Feed the Array Data to the iVAT Algorithm
#feed the array data to the improved VAT (Visual Assessment Tendency) algorithm ivat(Never_Use_Contraceptive_array)
5. Conclusion
From the result of implementing the Algorithms (both VAT and iVAT), it is observed from the maps produced that the Young Contraceptive Users’ dataset has clusters, and therefore should be explored and profiled with Cluster analysis. It is convenient to conduct the K-Prototype Cluster Analysis on the dataset since it has both numerical and categorical features(Mixed data type). Also, the iVAT map appears clearer and more precise than that of the VAT as expected. At minimum, there are about seven clusters in the dataset, as shown by the maps.
Remember, there are specific algorithms, designed to help you identify the numbers of statistically significant clusters. The VAT and iVAT Algorithms are just meant to help us decide if we should go ahead with the cluster analysis or not. Though VAT and iVAT maps can be used to validate and reinforce the main algorithms that estimate number of clusters, they should not be used as substitutes.
6. Summary
1.The use of Visual Assessment Tendency(VAT) and its iVAT variant is to, visually, suggest if there is a need to conduct Cluster analysis or not on your dataset.
2.The use of Visual Assessment Tendency (VAT) and its variant should not be used for determining the number of clusters that can be found in a dataset.Though they can be used to validate and reinforce the main algorithms that estimate number of clusters, they should not be used as substitutes. There are specific metrics — Elbow method and Silhouette method, available to help you do that.
3. The Improved Visual Assessment Tendency (iVAT) is the improved version of VAT because it has more precise and clearer images. However, it has more computing time cost. That is, it takes more time converging than the VAT.
Now that’s it. Hope you find this useful? Please drop your comments and follow me on LinkedIn at Ayobami Akiode LinkedIn
The Jupyter notebook used for this analysis is available here
Data source:
The data used in this article is a free sourced secondary data of the demographic and health survey (DHS) for Nigeria collected in 2018, called NDHS 2018. The DHS is a nationally representative survey that is being conducted in developing countries. In Nigeria, the survey is routinely conducted every 5 years; and covers all the states of Nigeria, including the capital.
To request any country-level free sourced data from the DHS program, kindly visit the DHS Program, sign-up, and request.
Using Visualization Algorithms (VAT & iVAT) To Assess Cluster Tendency was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.