1. Import and Convert Our Dataset into Pandas Dataframe

In [1]:
#import pandas library into our jupyternotebook
import pandas as pd
In [2]:
# import the NDHS 2018 dataset in stata file into pandas 
path=r"C:\Users\XXXXXXXXXXX\NDHS 2018\NGIR7AFL.DTA"
DHS_Dataset=pd.read_stata(path)   

Key Insight - The NDHS 2018 is an abbreviation for the Nigeria Demographic and Health Survey conducted in 2018. It is a nationally representative survey conducted about every four years by the collaborative efforts of the USAID, the Government of Nigeria, and other development Organizations(WHO, UN Agencies etc).

2. Check the No. of Observations and Variables In The Pandas Dataframe

In [3]:
# Determine the number of rows (observations), and columns(variables or features) of the data
DHS_Dataset.shape
Out[3]:
(41821, 5394)

3. Identify The Categorical Variable (v190) For This Demonstration

In [4]:
DHS_Dataset['v190'].name
Out[4]:
'v190'
In [5]:
DHS_Dataset['v190'].dtypes
Out[5]:
CategoricalDtype(categories=['poorest', 'poorer', 'middle', 'richer', 'richest'], ordered=True)

4. Commence Merging of Richest & Richer as Rich, and Poorest & Poorer as Poor

In [6]:
# Create a dictionary that makes the poorest and poorer the key to value poor 
# Also, middle key would have middle value
# finally, richer, and richest keys would have rich value
merge_v190 = {
    
                 "v190": {"poorest": 'poor', 
                          "poorer": 'poor', 
                          "middle": 'middle', 
                          "richer": 'rich', 
                          "richest": 'rich'}
                          }     
In [7]:
DHS_Dataset.replace(merge_v190, inplace=True)
In [8]:
DHS_Dataset['v190'].value_counts()
Out[8]:
rich      16869
poor      16093
middle     8859
Name: v190, dtype: int64