Overview¶

This analysis attempts to group together countries based on nothing more than the professed happiness of its citizens. To do so, we will be using data from the 2019 World Happiness Report to build several clustering models. The primary models discussed in this analysis are:

K-means
Agglomerative Clustering
Affinity Propagation
Gaussian Mixture

Additionally, these models will be briefly demonstrated:

DBSCAN
HDBSCAN

Each model will be visualized in 3 different forms:

A scatter plot using unaltered data
A scatter plot using scaled data
A Boxplot of the unaltered data

Glossary:

GWP - Gallup World Poll
WVS - World Value Surveys

Imports¶

In [124]:

import numpy as np    
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

from sklearn.preprocessing import StandardScaler 
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, DBSCAN
from sklearn.mixture import GaussianMixture 

import hdbscan

# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics.cluster import silhouette_score, silhouette_samples, calinski_harabasz_score, davies_bouldin_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer, InterclusterDistance

import geopandas

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [2]:

RS = 404 # Random state/seed
pd.set_option("display.max_columns",30) # Increase columns shown

Data¶

The 2019 World Happiness Report dataset may be obtained from

https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls

The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. This year’s World Happiness Report focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.

It has 2 descriptor/ID variables (Country,Year), one response (Life Ladder), six proposed determinants of the response, and several additional variable that were either calculated or gather from external sources.

Country - Name of the country.
Year - Year of data collection.
Life Ladder - (survey,0-10) Cantril Life Ladder/Happiness score/subjective well-being. The national average response to the following question:
- "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"

Six Hypothesized Underlying Determinants:

Log_GDP - (calculated-normalized,external), Log GDP per capita in purchasing power parity. Constant 2011 international dollar prices from World Development Indicators (November 14, 2018)
Life_Expectancy - (partial-interpolated,external), Healthy life expectancies at birth are based on the data extracted from the World Health Organization's Global Health Observatory data repository
Social_support - (survey,binary) National average response to:
- "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?"
Freedom - (survey,binary), National average response to:
- "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"
Generosity - (survey,binary,calculated-residual) is the residual of regressing national average of response to:
- "Have you donated money to a charity in the past month?" on GDP per capita.
Corruption_Perception - (survey,2x-binary), National average response to two questions:
- "Is corruption widespread throughout the government or not"
- "Is corruption widespread within businesses or not?"

Additional inclusions:

Positive affect - (survey,3x-binary), Average of three GWP positive affect measures (waves 3-7): happiness, laugh and enjoyment in Gallup World Poll(waves 3-7). Responses to the following three questions, respectively:
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Happiness?",
- "Did you smile or laugh a lot yesterday?"
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Enjoyment?"
Negative affect - (survey,3x-binary), Average of three GWP negative affect measures: worry, sadness and anger in Responses to the following three questions, respectively:
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Worry?",
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Sadness?"
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Anger?"
"giniLadder" - (multi,calculated-metastat), Inequality/distribution statistics of happiness scores by WP5-year from the GWP release. WP5 is GWP's coding of countries, including some sub-country territories.
- sdLadder Standard deviation of ladder by country-year
- cvLadder Standard deviation/Mean of ladder by country-year
giniIncGallup - (calculated-normalized,external), Household Income International Dollars. Income variables are created by converting local currency to International Dollars (ID) using purchasing power parity (PPP) ratios.
giniIncWB - (partial,external), Unbalanced panel of yearly index. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments
giniIncWBavg - (calculated,external), the average of giniIncWB in the period 2000-2016. Most countries are missing some gini index period data.
Confidence_natGovt - (survey,binary,external), citizens' confidence in key institutions (WP139) Response to:
- "Do you have confidence in each of the following, or not? How about the national government?"
"WGI indicators of governance quality" - (survey, amalgam, calculated, external), based on over 30 individual data sources produced by a variety of survey institutes, think tanks, non-governmental organizations, international organizations, and private sector firms, enterprise, citizen and expert survey respondents.
- Democratic Quality - average "Voice and Accountability" and "Political Stability and Absence of Violence"
- Delivery Quality - average "Government Effectiveness", "Regulatory Quality", "Rule of Law", "Control of Corruption"

Expanded data:

trust_Gallup and trust_WVS* - (survey,binary), Percentage of respondents with positive-trust response to:
- "Generally speaking, would you say that most people can be trusted or that you [have,need] to be [very] careful in dealing with people?"

Primary definitions:
https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix1.pdf

Definitions (Democratic Quality, Delivery Quality, Confidence_natGovt):
https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix2.pdf

Relevant files: data/ - contains happiness_2016.csv the externally obtained data for this analysis.

In [3]:

whr = pd.read_excel('data/WHR2019.xls')

In [4]:

#whr = pd.read_csv('data/happiness_2016.csv')
whr.columns

Out[4]:

Index(['Country name', 'Year', 'Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Positive affect', 'Negative affect',
       'Confidence in national government', 'Democratic Quality',
       'Delivery Quality', 'Standard deviation of ladder by country-year',
       'Standard deviation/Mean of ladder by country-year',
       'GINI index (World Bank estimate)',
       'GINI index (World Bank estimate), average 2000-16',
       'gini of household income reported in Gallup, by wp5-year',
       'Most people can be trusted, Gallup',
       'Most people can be trusted, WVS round 1981-1984',
       'Most people can be trusted, WVS round 1989-1993',
       'Most people can be trusted, WVS round 1994-1998',
       'Most people can be trusted, WVS round 1999-2004',
       'Most people can be trusted, WVS round 2005-2009',
       'Most people can be trusted, WVS round 2010-2014'],
      dtype='object')

In [5]:

# Shortened and cleaned names, most are derived from WHR2019 paper
full_colnames = [
    'Country', 'Year', 'Life_Ladder', 
    'Log_GDP','Social_support', 'Life_Expectancy', 'Freedom', 'Generosity','Corruption_Perception', 
    'Positive_affect', 'Negative_affect',
    'Confidence_natGovt', 'Democratic_Quality','Delivery_Quality', 
    'sdLadder','cvLadder',
    'giniIncWB','giniIncWBavg','giniIncGallup',
    'trust_Gallup',
    'trust_WVS81_84','trust_WVS89_93','trust_WVS94_98','trust_WVS99_2004','trust_WVS2005_09','trust_WVS2010_14'
]
core_col = full_colnames[:9]
ext_col = full_colnames[:14] + full_colnames[17:19]

In [6]:

whr.columns = full_colnames

In [23]:

# Shorten and Clean names for dot access
whr.columns = whr.columns.str.replace('Most people can be trusted','trust_in_people')
whr.columns = whr.columns.str.replace(' ','_')
whr.columns = whr.columns.str.replace('[(),]','') # Strip parens and commas
whr.columns

In [7]:

whr.iloc[np.r_[0:3,-3:0]] # HeadTail

Out[7]:

	Country	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	sdLadder	cvLadder	giniIncWB	giniIncWBavg	giniIncGallup	trust_Gallup	trust_WVS81_84	trust_WVS89_93	trust_WVS94_98	trust_WVS99_2004	trust_WVS2005_09	trust_WVS2010_14
0	Afghanistan	2008	3.723590	7.168690	0.450662	50.799999	0.718114	0.177889	0.881686	0.517637	0.258195	0.612072	-1.929690	-1.655084	1.774662	0.476600	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	2009	4.401778	7.333790	0.552308	51.200001	0.678896	0.200178	0.850035	0.583926	0.237092	0.611545	-2.044093	-1.635025	1.722688	0.391362	NaN	NaN	0.441906	0.286315	NaN	NaN	NaN	NaN	NaN	NaN
2	Afghanistan	2010	4.758381	7.386629	0.539075	51.599998	0.600127	0.134353	0.706766	0.618265	0.275324	0.299357	-1.991810	-1.617176	1.878622	0.394803	NaN	NaN	0.327318	0.275833	NaN	NaN	NaN	NaN	NaN	NaN
1701	Zimbabwe	2016	3.735400	7.538829	0.768425	54.400002	0.732971	-0.068105	0.723612	0.737636	0.208555	0.699344	-0.900649	-1.374650	2.776363	0.743257	NaN	0.432	0.596690	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942
1702	Zimbabwe	2017	3.638300	7.549491	0.754147	55.000000	0.752826	-0.069670	0.751208	0.806428	0.224051	0.682647	-0.988153	-1.350867	2.656848	0.730244	NaN	0.432	0.581484	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942
1703	Zimbabwe	2018	3.616480	7.553395	0.775388	55.599998	0.762675	-0.038384	0.844209	0.710119	0.211726	0.550508	NaN	NaN	2.498696	0.690919	NaN	0.432	0.541772	NaN	NaN	NaN	NaN	0.116683	NaN	0.082942

Exploratory Data Analysis¶

The dataset contains NA values, however there are a few examples where a field's contribution to happiness is 0.0. This is likely a side effect of having a modeled rather than purely gathered dataset. One possibility is that if a country ranked the lowest for that particular characteristic it was simply zeroed out.

Happiness_Score is the summation of Economy_GDP_per_Capita, Family, Health_Life_Expectancy, Freedom, Trust_Government_Corruption, Generosity, and Dystopia_Residual within a margin of error between the confidence intervals.

Other than Country, Region, and Happiness Rank, all of the variables are continuous floating point.

In [8]:

whr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 26 columns):
Country                  1704 non-null object
Year                     1704 non-null int64
Life_Ladder              1704 non-null float64
Log_GDP                  1676 non-null float64
Social_support           1691 non-null float64
Life_Expectancy          1676 non-null float64
Freedom                  1675 non-null float64
Generosity               1622 non-null float64
Corruption_Perception    1608 non-null float64
Positive_affect          1685 non-null float64
Negative_affect          1691 non-null float64
Confidence_natGovt       1530 non-null float64
Democratic_Quality       1558 non-null float64
Delivery_Quality         1559 non-null float64
sdLadder                 1704 non-null float64
cvLadder                 1704 non-null float64
giniIncWB                643 non-null float64
giniIncWBavg             1502 non-null float64
giniIncGallup            1335 non-null float64
trust_Gallup             180 non-null float64
trust_WVS81_84           125 non-null float64
trust_WVS89_93           220 non-null float64
trust_WVS94_98           618 non-null float64
trust_WVS99_2004         491 non-null float64
trust_WVS2005_09         630 non-null float64
trust_WVS2010_14         671 non-null float64
dtypes: float64(24), int64(1), object(1)
memory usage: 346.2+ KB

In [9]:

whr.isna().sum()

Out[9]:

Country                     0
Year                        0
Life_Ladder                 0
Log_GDP                    28
Social_support             13
Life_Expectancy            28
Freedom                    29
Generosity                 82
Corruption_Perception      96
Positive_affect            19
Negative_affect            13
Confidence_natGovt        174
Democratic_Quality        146
Delivery_Quality          145
sdLadder                    0
cvLadder                    0
giniIncWB                1061
giniIncWBavg              202
giniIncGallup             369
trust_Gallup             1524
trust_WVS81_84           1579
trust_WVS89_93           1484
trust_WVS94_98           1086
trust_WVS99_2004         1213
trust_WVS2005_09         1074
trust_WVS2010_14         1033
dtype: int64

In [10]:

whr[whr[core_col].isna().any(axis=1)].shape # 188 entries have at least 1 missing value from the core attributes

Out[10]:

(188, 26)

We can see a substantial portion of the values are missing, particularly in WVS reports of people's perceived trust in others.

In [11]:

whr.describe()

Out[11]:

	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	sdLadder	cvLadder	giniIncWB	giniIncWBavg	giniIncGallup	trust_Gallup	trust_WVS81_84	trust_WVS89_93	trust_WVS94_98	trust_WVS99_2004	trust_WVS2005_09	trust_WVS2010_14
count	1704.000000	1704.000000	1676.000000	1691.000000	1676.000000	1675.000000	1622.000000	1608.000000	1685.000000	1691.000000	1530.000000	1558.000000	1559.000000	1704.000000	1704.000000	643.000000	1502.000000	1335.000000	180.000000	125.000000	220.000000	618.000000	491.000000	630.000000	671.000000
mean	2012.332160	5.437155	9.222456	0.810570	63.111971	0.733829	0.000079	0.751315	0.709368	0.265679	0.481973	-0.136053	-0.001390	2.026707	0.392121	0.370000	0.385438	0.447771	0.226295	0.390480	0.283925	0.249574	0.268070	0.264336	0.237493
std	3.688072	1.121149	1.185794	0.119210	7.583622	0.144115	0.163365	0.186074	0.107984	0.084707	0.192059	0.876074	0.975849	0.401484	0.124661	0.083232	0.082396	0.108505	0.119079	0.123309	0.113226	0.118126	0.145120	0.160169	0.157482
min	2005.000000	2.661718	6.457201	0.290184	32.299999	0.257534	-0.336385	0.035198	0.362498	0.083426	0.068769	-2.448228	-2.144974	0.863034	0.133908	0.240000	0.211000	0.200969	0.066618	0.176535	0.066020	0.048720	0.075872	0.038242	0.031518
25%	2009.000000	4.610970	8.304428	0.747512	58.299999	0.638436	-0.115534	0.696083	0.621855	0.205414	0.334735	-0.790461	-0.711416	1.743369	0.310139	0.305000	0.321429	0.368424	0.139773	0.290300	0.223553	0.176876	0.155833	0.144976	0.118725
50%	2012.000000	5.339557	9.406206	0.833098	65.000000	0.752731	-0.022080	0.805775	0.718541	0.254544	0.464109	-0.227386	-0.218633	1.973070	0.372744	0.352000	0.371000	0.426541	0.198450	0.380174	0.292383	0.229924	0.232000	0.198380	0.193531
75%	2015.000000	6.273522	10.193060	0.904432	68.300003	0.848155	0.093522	0.876458	0.801530	0.314896	0.614862	0.650468	0.699971	2.242300	0.456311	0.428000	0.432200	0.514803	0.281627	0.478149	0.341741	0.294242	0.385469	0.391370	0.335000
max	2018.000000	8.018934	11.770276	0.987343	76.800003	0.985178	0.677743	0.983276	0.943621	0.704590	0.993604	1.575009	2.184725	3.718958	1.022769	0.634000	0.626000	0.961435	0.640332	0.571719	0.594595	0.647737	0.637185	0.737305	0.661757

Most of the data seems to be of a similar magnitude, with Ladder, Log GDP, and life expectancy being the largest exceptions

In [12]:

fig, ax = plt.subplots(figsize=(10,8))
corrmat = whr.drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0);

In [13]:

fig, ax = plt.subplots(figsize=(6,4))
corrmat = whr[core_col].drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0,annot=True);

In [14]:

fig, ax = plt.subplots(figsize=(8,6))
corrmat = whr[ext_col].drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0);

The upper left-hand square was likely by design, if happiness (life_ladder) is the statistic we are looking to understand, the attributes to immediately follow are likely what most consider large contributing factors.

Intuitive correlations:

correlation between all trust statistics
Democratic Quality <+> Delivery Quality
SDMean Ladder <-> Ladder
GINI index <+> GINI index mean

Potentially interesting correlations:

trust WVS 81-84 <+> Democratic+Delivery Quality
trust WVS 81-84 <+> log GDP
trust WVS 81-84 <-> Perceptions of corruption

While the trust 81-84 could be an interesting variable to investigate, it is also worth remembering that this attribute has the most null values out of all, so these should be taken with a grain of salt.

In [15]:

gov_col = ['Freedom', 'Corruption_Perception', 'Confidence_natGovt','Democratic_Quality', 'Delivery_Quality']
fig, ax = plt.subplots(figsize=(6,4))
corrmat = whr[gov_col].corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0,annot=True);

In [ ]:

whr_ext = whr[ext_col].copy() # Using an extended, but not quite full, version of dataset

In [16]:

whr_ext.groupby('Country').Year.count().describe()

Out[16]:

count    165.000000
mean      10.327273
std        3.371624
min        1.000000
25%        9.000000
50%       12.000000
75%       13.000000
max       13.000000
Name: Year, dtype: float64

In [17]:

whr_ext.groupby('Country').Year.count().hist(bins=13);

Almost half of all countries in the dataset have an entry for all years that the survey has been conducted. Additionally, 75% of countries have at least 9 years worth of data entries.

In [300]:

#From:  00BF11 -> BF2200
rygscale = [
    [0,'rgb(191, 34, 0)'],
    [0.2,'rgb(191, 75, 0)'],
    [0.3,'rgb(191, 116, 0)'],
    [0.4,'rgb(191, 156, 0)'],
    [0.5,'rgb(184, 191, 0)'],
    [0.6,'rgb(144, 191, 0)'],
    [0.7,'rgb(103, 191, 0)'],
    #[0.8,'rgb(63, 191, 0)'],
    #[0.9,'rgb(22, 191, 0)'],
    [0.8,'rgb(0, 191, 17)'],#new
    [0.9,'rgb(30, 223, 29)'],#new
    [1,'rgb(61, 255, 41)']]#new
# https://convertingcolors.com/rgb-color-191_34_0.html
# http://www.perbang.dk/rgbgradient/

In [301]:

whr_recent = whr_ext.iloc[whr_ext.groupby('Country').Year.idxmax()]
data = [
    go.Choropleth(
        locations = whr_recent.Country,#whrffl_imp.index, 
        locationmode = 'country names',
        z = whr_recent['Life_Ladder'],
        text = ['{} ({})'.format(c,y) for c, y in zip(whr_recent.Country, whr_recent.Year)],
        hoverinfo='z+text',
        colorscale = rygscale,
        marker = go.choropleth.Marker(line = go.choropleth.marker.Line(color = 'rgb(255,255,255)',width = 0.15)),
        colorbar = go.choropleth.ColorBar(title = 'Happiness Score')
    )
]

layout = go.Layout(
    title = go.layout.Title(text = 'World Happiness 2019<br>(Cantrill Life Ladder)'),
    geo = go.layout.Geo(
        showcoastlines = True,
        landcolor = 'lightgray',
        showland = True,
        projection = go.layout.geo.Projection(type = 'equirectangular')#'natural earth')
    ),
    width=960,
    annotations = [
        go.layout.Annotation(x = 0.96,y = 0.01,xref = 'paper',yref = 'paper',
        text = 'Source: <a href="https://worldhappiness.report/ed/2019/#read">World Happiness Report 2019</a>',
        showarrow = False),
        go.layout.Annotation(x = 0, y = -0.15, xref = 'paper',yref = 'paper',align='left',font={'size':9},
        text = '''
        "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top.<br>
         The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you.<br>
         On which step of the ladder would you say you personally feel you stand at this time?"
        ''',
        showarrow = False)
    ]
)

fig = go.Figure(data = data, layout = layout)
iplot(fig, filename = 'd3-world-map')

In [112]:

# format: whr.Country : world.name
whr_world = {
    'Bahrain' : None,
    'Bosnia and Herzegovina' : 'Bosnia and Herz.',
    'Central African Republic' : 'Central African Rep.',
    'Congo (Brazzaville)' : 'Congo',
    'Congo (Kinshasa)' : 'Dem. Rep. Congo',
    'Czech Republic' : 'Czech Rep.',
    'Dominican Republic' : 'Dominican Rep.',
    'Hong Kong S.A.R. of China' : None,
    'Ivory Coast' : "Côte d'Ivoire",
    'Laos' : 'Lao PDR',
    'Malta' : None,
    'Mauritius' : None,
    'North Cyprus' : 'Cyprus',
    'Palestinian Territories' : 'Palestine',
    'Singapore' : None,
    'Somaliland region' : 'Somalia',
    'South Korea' : 'Korea',
    'South Sudan' : 'S. Sudan',
    'Taiwan Province of China' : 'Taiwan'}

In [ ]:

Aggregating¶

Ideally, we would simply use the most up to data that we have for each country, unfortunately, that causes some problems.

Take latest¶

In [221]:

whr_ext.iloc[whr_ext.groupby('Country').Year.idxmax()]
#whrl = whr_ext.iloc[latest_idx].set_index('Country')

Out[221]:

	Country	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
10	Afghanistan	2018	2.694303	7.494588	0.507516	52.599998	0.373536	-0.084888	0.927606	0.424125	0.404904	0.364666	NaN	NaN	NaN	0.290681
21	Albania	2018	5.004403	9.412399	0.683592	68.699997	0.824212	0.005385	0.899129	0.713300	0.318997	0.435338	NaN	NaN	0.303250	0.456174
28	Algeria	2018	5.043086	9.557952	0.798651	65.900002	0.583381	-0.172413	0.758704	0.591043	0.292946	NaN	NaN	NaN	0.276000	0.667872
32	Angola	2014	3.794838	8.741481	0.754615	54.599998	0.374542	-0.157062	0.834076	0.578517	0.367864	0.572346	-0.739363	-1.168539	0.473500	0.440699
45	Argentina	2018	5.792797	9.809972	0.899912	68.800003	0.845895	-0.206937	0.855255	0.820310	0.320502	0.261352	NaN	NaN	0.460938	0.405356
58	Armenia	2018	5.062449	9.119424	0.814449	66.900002	0.807644	-0.149109	0.676826	0.581488	0.454840	0.670828	NaN	NaN	0.319250	0.406403
70	Australia	2018	7.176993	10.721021	0.940137	73.599998	0.916028	0.137795	0.404647	0.759019	0.187456	0.468837	NaN	NaN	0.342750	0.429814
81	Austria	2018	7.396002	10.741893	0.911668	73.000000	0.904112	0.051552	0.523061	0.752350	0.226059	0.488679	NaN	NaN	0.302692	0.299504
94	Azerbaijan	2018	5.167995	9.678014	0.781230	65.500000	0.772449	-0.251795	0.561206	0.592575	0.191392	0.834372	NaN	NaN	0.211000	0.260410
103	Bahrain	2017	6.227321	10.675694	0.875747	68.500000	0.905859	0.128193	NaN	0.813571	0.289760	NaN	-1.167434	0.226644	NaN	0.446609
116	Bangladesh	2018	4.499217	8.220746	0.705556	64.300003	0.901471	-0.038008	0.701421	0.541345	0.361238	0.831693	NaN	NaN	0.327750	0.367609
129	Belarus	2018	5.233770	9.778739	0.904569	66.099998	0.643602	-0.181865	0.718455	0.450333	0.235729	0.421279	NaN	NaN	0.281294	0.293444
141	Belgium	2018	6.892172	10.672445	0.929816	72.000000	0.808387	-0.127278	0.630412	0.749563	0.250297	0.441945	NaN	NaN	0.284308	0.299525
143	Belize	2014	5.955647	8.987144	0.756932	62.220001	0.873569	0.004827	0.782105	0.754977	0.281604	0.384267	0.284336	-0.524305	NaN	0.446026
153	Benin	2018	5.819827	7.663907	0.503544	54.299999	0.713264	0.024661	0.746511	0.646655	0.467872	0.639220	NaN	NaN	0.432667	0.606243
156	Bhutan	2015	5.082129	8.954588	0.847574	60.200001	0.830102	0.286635	0.633956	0.809641	0.311589	0.946393	0.469945	0.309431	0.392667	0.422514
169	Bolivia	2018	5.915734	8.860531	0.827159	63.599998	0.863247	-0.087568	0.786045	0.741973	0.387469	0.399588	NaN	NaN	0.521600	0.450113
180	Bosnia and Herzegovina	2018	5.887401	9.402726	0.835890	67.800003	0.658846	0.124627	0.912858	0.642940	0.277365	0.254097	NaN	NaN	0.325600	0.385817
191	Botswana	2018	3.461366	9.680226	0.794936	58.900002	0.817621	-0.259084	0.806945	0.729643	0.267084	0.718788	NaN	NaN	0.626000	0.616160
204	Brazil	2018	6.190922	9.557933	0.881505	66.400002	0.750609	-0.126327	0.763251	0.749728	0.349656	0.168187	NaN	NaN	0.547286	0.420397
214	Bulgaria	2018	5.098814	9.873219	0.923853	66.800003	0.724336	-0.179110	0.952014	0.639022	0.189091	0.218996	NaN	NaN	0.354667	0.341207
226	Burkina Faso	2018	4.927236	7.470520	0.664859	53.900002	0.720743	-0.004381	0.757399	0.710884	0.342866	0.622255	NaN	NaN	0.394667	0.605102
231	Burundi	2018	3.775283	6.541033	0.484715	53.400002	0.646399	-0.019334	0.598608	0.666442	0.362767	NaN	NaN	NaN	0.360000	0.680813
244	Cambodia	2018	5.121838	8.253352	0.794605	61.599998	0.958305	0.033787	NaN	0.844593	0.414346	NaN	NaN	NaN	NaN	0.603439
257	Cameroon	2018	5.250738	8.133471	0.676825	52.700001	0.816305	0.032507	0.884442	0.642437	0.355642	0.645226	NaN	NaN	0.438333	0.521751
270	Canada	2018	7.175497	10.701248	0.922719	73.599998	0.945783	0.097966	0.371741	0.823669	0.259398	0.610467	NaN	NaN	0.336800	0.465442
275	Central African Republic	2017	3.475862	6.494117	0.319589	45.200001	0.645252	0.093754	0.889566	0.613865	0.599335	0.650285	-1.523122	-1.538733	0.499000	0.715371
288	Chad	2018	4.486325	7.472575	0.577254	48.200001	0.650355	0.011340	0.762879	0.552737	0.543836	0.577436	NaN	NaN	0.415500	0.607655
301	Chile	2018	6.436221	10.065920	0.890085	69.900002	0.788530	-0.070616	0.816297	0.832562	0.275820	0.334744	NaN	NaN	0.491571	0.434313
314	China	2018	5.131434	9.694376	0.787605	69.300003	0.895378	-0.174899	NaN	0.855784	0.189640	NaN	NaN	NaN	0.425000	0.538206
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1398	South Sudan	2017	2.816622	NaN	0.556823	51.000000	0.456011	NaN	0.761270	0.585602	0.517364	0.461551	-2.138769	-2.018497	0.463000	0.703008
1411	Spain	2018	6.513371	10.465594	0.910315	74.400002	0.722251	-0.079351	0.776504	0.659188	0.357191	0.285196	NaN	NaN	0.345385	0.359144
1423	Sri Lanka	2018	4.400223	9.400388	0.828065	67.199997	0.852628	0.086762	0.858017	0.831293	0.301279	0.576130	NaN	NaN	0.393400	0.408002
1428	Sudan	2014	4.138673	8.340058	0.810616	55.119999	0.390096	-0.072781	0.793785	0.540845	0.302725	NaN	-2.049856	-1.397204	0.354000	0.455439
1429	Suriname	2012	6.269287	9.624583	0.797262	62.240002	0.885488	-0.076575	0.751283	0.764223	0.250365	0.721765	0.210094	-0.218572	NaN	0.367604
1431	Swaziland	2018	4.211565	8.946771	0.779270	NaN	0.709974	-0.179938	0.692341	0.824355	0.252339	0.689549	NaN	NaN	0.523000	0.732568
1444	Sweden	2018	7.374792	10.766932	0.930680	72.599998	0.941725	0.069573	0.262797	0.822676	0.160755	0.494396	NaN	NaN	0.274154	0.381510
1452	Switzerland	2018	7.508587	10.975945	0.930291	74.099998	0.926415	0.096369	0.301260	0.792226	0.191520	0.849979	NaN	NaN	0.328100	0.320725
1459	Syria	2015	3.461913	NaN	0.463913	55.200001	0.448271	NaN	0.685237	0.369440	0.642589	NaN	-2.448228	-1.548680	0.358000	0.525934
1470	Taiwan Province of China	2018	6.467005	NaN	0.896459	NaN	0.741033	NaN	0.735971	0.848399	0.092696	0.311723	NaN	NaN	NaN	0.330178
1482	Tajikistan	2017	5.829234	7.971401	0.662693	63.799999	0.832002	0.122264	0.718337	0.602668	0.277725	0.929793	-1.195176	-1.216369	0.326600	0.367954
1495	Tanzania	2018	3.445023	7.928911	0.675330	57.500000	0.807142	0.141757	0.611534	0.762089	0.221005	0.914648	NaN	NaN	0.384667	0.568629
1508	Thailand	2018	6.011562	9.734829	0.873052	67.199997	0.904828	0.251650	0.906596	0.843489	0.198190	0.605364	NaN	NaN	0.396692	0.482407
1516	Togo	2018	4.022895	7.287405	0.596354	54.700001	0.611966	-0.007063	0.808538	0.608449	0.446454	0.323221	NaN	NaN	0.437667	0.444904
1521	Trinidad and Tobago	2017	6.191860	10.266848	0.916029	63.500000	0.859140	-0.004833	0.911336	0.846467	0.248099	0.272541	0.420911	-0.046981	NaN	0.415465
1531	Tunisia	2018	4.741132	9.304474	0.732954	66.900002	0.649680	-0.203249	0.840117	0.591727	0.365014	0.349490	NaN	NaN	0.381000	0.431391
1544	Turkey	2018	5.185689	10.148917	0.847027	66.800003	0.528629	-0.181654	0.804879	0.434654	0.350773	0.513677	NaN	NaN	0.405800	0.362624
1553	Turkmenistan	2018	4.620602	9.749464	0.984489	62.200001	0.857774	0.237280	NaN	0.612210	0.189025	NaN	NaN	NaN	NaN	0.271070
1566	Uganda	2018	4.321715	7.458709	0.739841	55.700001	0.728513	0.088241	0.856106	0.685169	0.390319	0.503853	NaN	NaN	0.432200	0.655503
1579	Ukraine	2018	4.661909	9.012027	0.900937	64.599998	0.663055	-0.055042	0.942961	0.608771	0.221851	0.079710	NaN	NaN	0.265000	0.349749
1590	United Arab Emirates	2018	6.603744	11.127678	0.851041	67.099998	0.943664	0.036494	NaN	0.787243	0.302042	NaN	NaN	NaN	NaN	0.721079
1603	United Kingdom	2018	7.233445	10.596948	0.928484	72.300003	0.837508	0.221998	0.404276	0.783172	0.228276	0.420860	NaN	NaN	0.341083	0.417473
1616	United States	2018	6.882685	10.922465	0.903856	68.300003	0.824607	0.107713	0.709928	0.815383	0.292226	0.313816	NaN	NaN	0.408167	0.701418
1629	Uruguay	2018	6.371715	9.959661	0.917316	69.000000	0.876211	-0.108451	0.682916	0.876920	0.274946	0.361706	NaN	NaN	0.427364	0.437542
1641	Uzbekistan	2018	6.205460	8.773365	0.920821	65.099998	0.969898	0.311695	0.520360	0.825422	0.208660	0.969356	NaN	NaN	0.348000	0.384974
1654	Venezuela	2018	5.005663	9.270281	0.886882	66.500000	0.610855	-0.176156	0.827560	0.759221	0.373658	0.260700	NaN	NaN	0.497167	NaN
1667	Vietnam	2018	5.295547	8.783416	0.831945	67.900002	0.909260	-0.039124	0.808423	0.692222	0.191061	NaN	NaN	NaN	0.362750	0.415666
1678	Yemen	2018	3.057514	NaN	0.789422	56.700001	0.552726	NaN	0.792587	0.461114	0.314870	0.308151	NaN	NaN	0.357000	0.448597
1690	Zambia	2018	4.041488	8.223958	0.717720	55.299999	0.790626	0.036644	0.810731	0.702698	0.350963	0.606715	NaN	NaN	0.527400	0.619443
1703	Zimbabwe	2018	3.616480	7.553395	0.775388	55.599998	0.762675	-0.038384	0.844209	0.710119	0.211726	0.550508	NaN	NaN	0.432000	0.541772

165 rows × 16 columns

In [19]:

# Get latest year indices
latest_idx = whr_ext.groupby('Country').Year.idxmax()
whrl = whr_ext.iloc[latest_idx].set_index('Country')
# Check NAs in the core data set
whrl[whrl[core_col[1:]].isna().any(axis=1)]

Out[19]:

	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
Country
Bahrain	2017	6.227321	10.675694	0.875747	68.500000	0.905859	0.128193	NaN	0.813571	0.289760	NaN	-1.167434	0.226644	NaN	0.446609
Cambodia	2018	5.121838	8.253352	0.794605	61.599998	0.958305	0.033787	NaN	0.844593	0.414346	NaN	NaN	NaN	NaN	0.603439
China	2018	5.131434	9.694376	0.787605	69.300003	0.895378	-0.174899	NaN	0.855784	0.189640	NaN	NaN	NaN	0.425000	0.538206
Cuba	2006	5.417869	9.676425	0.969595	68.440002	0.281458	NaN	NaN	0.646712	0.276602	0.513176	-0.706359	-0.543394	NaN	NaN
Cyprus	2018	6.276443	NaN	0.825573	73.699997	0.794215	NaN	0.848337	0.750122	0.298021	0.352440	NaN	NaN	0.326167	0.448661
Egypt	2018	4.005451	9.293960	0.758824	61.700001	0.681654	-0.222930	NaN	0.492261	0.285184	NaN	NaN	NaN	0.312000	0.323929
Gambia	2018	4.922099	7.376554	0.684800	55.000000	0.718729	NaN	0.691070	0.804012	0.379208	0.757543	NaN	NaN	0.422667	0.592391
Jordan	2018	4.638934	9.024435	0.799544	66.800003	0.762420	-0.183490	NaN	NaN	NaN	NaN	NaN	NaN	0.343000	0.391051
Kosovo	2018	6.391826	NaN	0.822407	65.149826	0.889737	NaN	0.922078	0.778271	0.170248	0.347547	NaN	NaN	0.289909	0.402302
Kuwait	2017	6.093905	11.090272	0.853491	66.500000	0.884182	-0.039014	NaN	0.692072	0.307321	NaN	-0.323727	-0.114578	NaN	0.591861
Libya	2018	5.493978	NaN	0.824165	62.299999	0.780559	NaN	0.645839	0.705535	0.398903	NaN	NaN	NaN	NaN	0.596754
Malta	2018	6.909711	NaN	0.931542	72.199997	0.927341	NaN	0.595200	0.721224	0.295699	0.757714	NaN	NaN	0.291100	0.385407
North Cyprus	2018	5.608056	NaN	0.837392	NaN	0.797066	NaN	0.613837	0.480453	0.261868	0.378324	NaN	NaN	NaN	0.200969
Oman	2011	6.852982	10.648312	NaN	65.500000	0.916293	-0.008942	NaN	NaN	0.295164	NaN	-0.314025	0.295601	NaN	0.494790
Palestinian Territories	2018	4.553922	NaN	0.819479	NaN	0.654535	NaN	0.813780	0.610405	0.418929	0.392373	NaN	NaN	NaN	0.482421
Poland	2017	6.201268	10.211576	0.881854	68.900002	0.830843	-0.127978	NaN	0.677436	0.203388	0.502480	0.651249	0.678493	NaN	0.260088
Qatar	2015	6.374529	11.693157	NaN	68.300003	NaN	NaN	NaN	NaN	NaN	NaN	-0.074040	0.823927	NaN	0.653175
Saudi Arabia	2018	6.356393	10.797972	0.867848	66.300003	0.854922	-0.209564	NaN	0.764405	0.288380	NaN	NaN	NaN	NaN	0.472084
Singapore	2018	6.374564	NaN	0.902841	76.800003	0.916078	NaN	0.096563	0.787093	0.106871	0.892469	NaN	NaN	NaN	0.399788
Somalia	2016	4.667941	NaN	0.594417	50.000000	0.917323	NaN	0.440802	0.891423	0.193282	0.700682	-2.134841	-2.125518	NaN	0.491746
Somaliland region	2012	5.057314	NaN	0.786291	NaN	0.758219	NaN	0.333832	0.735189	0.152428	0.651242	NaN	NaN	NaN	0.533575
South Sudan	2017	2.816622	NaN	0.556823	51.000000	0.456011	NaN	0.761270	0.585602	0.517364	0.461551	-2.138769	-2.018497	0.463000	0.703008
Swaziland	2018	4.211565	8.946771	0.779270	NaN	0.709974	-0.179938	0.692341	0.824355	0.252339	0.689549	NaN	NaN	0.523000	0.732568
Syria	2015	3.461913	NaN	0.463913	55.200001	0.448271	NaN	0.685237	0.369440	0.642589	NaN	-2.448228	-1.548680	0.358000	0.525934
Taiwan Province of China	2018	6.467005	NaN	0.896459	NaN	0.741033	NaN	0.735971	0.848399	0.092696	0.311723	NaN	NaN	NaN	0.330178
Turkmenistan	2018	4.620602	9.749464	0.984489	62.200001	0.857774	0.237280	NaN	0.612210	0.189025	NaN	NaN	NaN	NaN	0.271070
United Arab Emirates	2018	6.603744	11.127678	0.851041	67.099998	0.943664	0.036494	NaN	0.787243	0.302042	NaN	NaN	NaN	NaN	0.721079
Yemen	2018	3.057514	NaN	0.789422	56.700001	0.552726	NaN	0.792587	0.461114	0.314870	0.308151	NaN	NaN	0.357000	0.448597

That is quite a few missing values from the core attributes. Dropping these values would certainly degrade the quality of conclusions we are able to draw. Let's try another means of aggregating the data.

Mean across years¶

In [20]:

whr_mean = whr_ext.groupby('Country').mean()
whr_mean[whr_mean[core_col[1:]].isna().any(axis=1)]

Out[20]:

	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
Country
China	2012.000000	4.984993	9.284457	0.780020	67.969230	0.829893	-0.191763	NaN	0.818346	0.158571	NaN	-1.081073	-0.229781	0.425	0.519979
Cuba	2006.000000	5.417869	9.676425	0.969595	68.440002	0.281458	NaN	NaN	0.646712	0.276602	0.513176	-0.706359	-0.543394	NaN	NaN
North Cyprus	2014.666667	5.682304	NaN	0.829782	NaN	0.779380	NaN	0.700919	0.646726	0.347834	0.414453	NaN	NaN	NaN	0.332236
Oman	2011.000000	6.852982	10.648312	NaN	65.500000	0.916293	-0.008942	NaN	NaN	0.295164	NaN	-0.314025	0.295601	NaN	0.494790
Somalia	2015.000000	5.183286	NaN	0.601511	49.899999	0.919690	NaN	0.435836	0.875515	0.195745	0.701591	-2.213750	-2.112246	NaN	0.508235
Somaliland region	2010.500000	4.909162	NaN	0.820706	NaN	0.795702	NaN	0.418910	0.768032	0.117528	0.634423	NaN	NaN	NaN	0.515988
Swaziland	2014.500000	4.539328	8.922841	0.808210	NaN	0.658565	-0.126601	0.804796	0.822484	0.251696	0.521415	-0.897581	-0.541850	0.523	0.710633
Turkmenistan	2013.888889	5.614522	9.517484	0.930100	60.826667	0.760401	-0.001878	NaN	0.641446	0.199920	NaN	-1.031149	-1.533028	NaN	0.263699

We've improved in terms of NA quantity, but now we have a meaningless Year column and data that isn't representative of the most up to date information available. We need a method that can aggregate the data while still using the latest available information. Luckily, we already have most of the information needed to do this.

Forward Fill Latest¶

In [21]:

# Propagate the last available entry forward
whrffl = whr_ext.groupby('Country').ffill().iloc[latest_idx].set_index('Country')
whrffl[whrffl[core_col[1:]].isna().any(axis=1)]

Out[21]:

	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
Country
China	2018	5.131434	9.694376	0.787605	69.300003	0.895378	-0.174899	NaN	0.855784	0.189640	NaN	-0.877810	-0.064555	0.425	0.538206
Cuba	2006	5.417869	9.676425	0.969595	68.440002	0.281458	NaN	NaN	0.646712	0.276602	0.513176	-0.706359	-0.543394	NaN	NaN
North Cyprus	2018	5.608056	NaN	0.837392	NaN	0.797066	NaN	0.613837	0.480453	0.261868	0.378324	NaN	NaN	NaN	0.200969
Oman	2011	6.852982	10.648312	NaN	65.500000	0.916293	-0.008942	NaN	NaN	0.295164	NaN	-0.314025	0.295601	NaN	0.494790
Somalia	2016	4.667941	NaN	0.594417	50.000000	0.917323	NaN	0.440802	0.891423	0.193282	0.700682	-2.134841	-2.125518	NaN	0.491746
Somaliland region	2012	5.057314	NaN	0.786291	NaN	0.758219	NaN	0.333832	0.735189	0.152428	0.651242	NaN	NaN	NaN	0.533575
Swaziland	2018	4.211565	8.946771	0.779270	NaN	0.709974	-0.179938	0.692341	0.824355	0.252339	0.689549	-0.897581	-0.541850	0.523	0.732568
Turkmenistan	2018	4.620602	9.749464	0.984489	62.200001	0.857774	0.237280	NaN	0.612210	0.189025	NaN	-1.153554	-1.545981	NaN	0.271070

Now this is where we want to be. Using forward fill, we are able to preserve the latest available information while still reducing NA values. To a certain extent, we've corrupted the accuracy of Year, since it no long is exact measures from that year, but rather the last available data in each column up to that year. We'll keep it around for now, but drop it before we do any modeling.

The NaNs we are left with indicate that certain countries have no data available in any year of surveying.

In [22]:

# Save NA country index for later use for future dataframe build
naidx = whrffl[whrffl[core_col[1:]].isna().any(axis=1)].index
# Save underlying 'natural' indices
naiidx = whrffl.reset_index()[whrffl.reset_index()[core_col[1:]].isna().any(axis=1)].index
#nacnty = whr_lateff[whr_lateff[core_col].isna().any(axis=1)].Country.values
whrffl[core_col[1:]].isna().sum()

Out[22]:

Year                     0
Life_Ladder              0
Log_GDP                  3
Social_support           1
Life_Expectancy          3
Freedom                  0
Generosity               4
Corruption_Perception    4
dtype: int64

We can't call it quits yet, these remaining missing values need to be addressed before we can do any sort of clustering. There are quite a few easy methods we could use to fill in these values, we could just enter 0 and move on, but let's try to be smart about this.

Imputation¶

Somewhere in between filling values with a constant and engineering values by hand is variable imputation. If the stakes were higher, we'd want to try things like crafting missing GDP values with giniInc or values even just pull from another external data source, but let's keep it local and let some algorithms do the work for us.

looking at region relationships (China and HongKong China)
training models for each column of interest with a missing value
pull for external sources
use Freedom,Confidence_natGovt,Democratic_Quality,Delivery_Quality to try and derive Corruption_Perception

There is a case to be made that forward filling prior to imputation is not optimal since we are not allowing the algorithm to fully transform true missing values. However, we must always consider what the data represents when making any decisions.

In the worst case, a country has data entry in 2005 (the first survey year) and has NaN values for every year thereafter. In such a case, using forward fill would propagate the value up to 2018 (latest survey year) potentially meaning it is outdated and no longer relevant. The alternative is to allow the imputer to derive these missing values with the other non-missing values as input. The question we must ask is do we value real, but potential outdated data, over fake, but temporally responsive data.

In all likelihood, the change that a given country experiences over a 13 year period is smaller than what we could accurately impute. If the dataset spanned, say, a generation (25 years) then perhaps more weight should be given to the imputation option.

As with most decision made during an analysis, this could be thoroughly vetted and a true optimal solution discovered, but again, the stakes are low enough to just leave well enough alone. Additionally, as shown below, the difference between each method is largely insignificant.

In [23]:

# fit on non-aggregated extended data
imputer = IterativeImputer(estimator=BayesianRidge(),random_state=RS,max_iter=15).fit(whr_ext.iloc[:,1:])

We fit on the non-mutated data to maintain data purity and increase the number of samples the imputer has at its disposal.

FFill difference¶

In [24]:

# Impute on latest data
whrl_imp = pd.DataFrame(imputer.transform(whrl), columns=whrl.columns,index=whrl.index)
# Impute on latest forward filled data
whrffl_imp = pd.DataFrame(imputer.transform(whrffl), columns=whrffl.columns,index=whrffl.index)

In [25]:

# whrl.loc[naidx] # Take latest Before
whrffl.loc[naidx] # FFill-latest Before

Out[25]:

	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
Country
China	2018	5.131434	9.694376	0.787605	69.300003	0.895378	-0.174899	NaN	0.855784	0.189640	NaN	-0.877810	-0.064555	0.425	0.538206
Cuba	2006	5.417869	9.676425	0.969595	68.440002	0.281458	NaN	NaN	0.646712	0.276602	0.513176	-0.706359	-0.543394	NaN	NaN
North Cyprus	2018	5.608056	NaN	0.837392	NaN	0.797066	NaN	0.613837	0.480453	0.261868	0.378324	NaN	NaN	NaN	0.200969
Oman	2011	6.852982	10.648312	NaN	65.500000	0.916293	-0.008942	NaN	NaN	0.295164	NaN	-0.314025	0.295601	NaN	0.494790
Somalia	2016	4.667941	NaN	0.594417	50.000000	0.917323	NaN	0.440802	0.891423	0.193282	0.700682	-2.134841	-2.125518	NaN	0.491746
Somaliland region	2012	5.057314	NaN	0.786291	NaN	0.758219	NaN	0.333832	0.735189	0.152428	0.651242	NaN	NaN	NaN	0.533575
Swaziland	2018	4.211565	8.946771	0.779270	NaN	0.709974	-0.179938	0.692341	0.824355	0.252339	0.689549	-0.897581	-0.541850	0.523	0.732568
Turkmenistan	2018	4.620602	9.749464	0.984489	62.200001	0.857774	0.237280	NaN	0.612210	0.189025	NaN	-1.153554	-1.545981	NaN	0.271070

In [26]:

# whrl_imp.loc[naidx] # Take latest After
whrffl_imp.loc[naidx] #FFill-latest After

Out[26]:

	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
Country
China	2018.0	5.131434	9.694376	0.787605	69.300003	0.895378	-0.174899	0.598125	0.855784	0.189640	0.681986	-0.877810	-0.064555	0.425000	0.538206
Cuba	2006.0	5.417869	9.676425	0.969595	68.440002	0.281458	-0.113951	0.842834	0.646712	0.276602	0.513176	-0.706359	-0.543394	0.325290	0.271920
North Cyprus	2018.0	5.608056	10.096858	0.837392	70.319717	0.797066	-0.142033	0.613837	0.480453	0.261868	0.378324	-0.045114	0.384631	0.209252	0.200969
Oman	2011.0	6.852982	10.648312	0.905146	65.500000	0.916293	-0.008942	0.621081	0.795146	0.295164	0.607523	-0.314025	0.295601	0.427632	0.494790
Somalia	2016.0	4.667941	6.700755	0.594417	50.000000	0.917323	0.130703	0.440802	0.891423	0.193282	0.700682	-2.134841	-2.125518	0.481401	0.491746
Somaliland region	2012.0	5.057314	8.887497	0.786291	60.650726	0.758219	0.074972	0.333832	0.735189	0.152428	0.651242	-0.133132	0.440827	0.355993	0.533575
Swaziland	2018.0	4.211565	8.946771	0.779270	57.651606	0.709974	-0.179938	0.692341	0.824355	0.252339	0.689549	-0.897581	-0.541850	0.523000	0.732568
Turkmenistan	2018.0	4.620602	9.749464	0.984489	62.200001	0.857774	0.237280	0.896775	0.612210	0.189025	0.649490	-1.153554	-1.545981	0.292400	0.271070

In [27]:

whrffl_imp.loc[naidx] - whrl_imp.loc[naidx] # difference

Out[27]:

	Year	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
Country
China	0.0	0.0	0.0	0.000000e+00	0.000000	0.0	0.0	-0.045064	0.000000e+00	0.0	0.083032	-1.031326	-0.434279	0.000000	0.0
Cuba	0.0	0.0	0.0	0.000000e+00	0.000000	0.0	0.0	0.000000	0.000000e+00	0.0	0.000000	0.000000	0.000000	0.000000	0.0
North Cyprus	0.0	0.0	0.0	0.000000e+00	0.000000	0.0	0.0	0.000000	0.000000e+00	0.0	0.000000	0.000000	0.000000	0.000000	0.0
Oman	0.0	0.0	0.0	4.440892e-16	0.000000	0.0	0.0	0.000000	1.110223e-16	0.0	0.000000	0.000000	0.000000	0.000000	0.0
Somalia	0.0	0.0	0.0	0.000000e+00	0.000000	0.0	0.0	0.000000	0.000000e+00	0.0	0.000000	0.000000	0.000000	0.000000	0.0
Somaliland region	0.0	0.0	0.0	0.000000e+00	0.000000	0.0	0.0	0.000000	0.000000e+00	0.0	0.000000	0.000000	0.000000	0.000000	0.0
Swaziland	0.0	0.0	0.0	0.000000e+00	-0.099673	0.0	0.0	0.000000	0.000000e+00	0.0	0.000000	-0.503161	-0.216707	0.000000	0.0
Turkmenistan	0.0	0.0	0.0	0.000000e+00	0.000000	0.0	0.0	0.169159	0.000000e+00	0.0	-0.001948	-1.049260	-1.661919	0.045808	0.0

With a few exceptions, there is no difference between the filled data fields. Looking at core columns, Corruption_Perception is the only attribute with a non-negligible difference for Turkmenistan.

Models¶

K-means
Agglomerative Clustering
Affinity Propagation
Gaussian Mixture
DBSCAN
HDBSCAN

Helper functions¶

In [28]:

def plot_cluster(x, y, data, title='',centers=None, **kwargs):
    """ plot data from a clustering algorithm using dataframe column names
    
    Args:
        x, y : str 
            names of variables in ``data``
        data : pandas.Dataframe 
            desired plotting data
        title : str, optional 
            title of plot
        centers : array-like or pd.DataFrame, optional
            if provided, plots the given centers of the determined groups
        **kwargs : keyword arguments, optional
            arguments to pass to plt.scatter
    
    Returns:        
        ax : matplotlib Axes
            the Axes object with the plot drawn onto it.
    """
    
    fig, ax = plt.subplots(figsize=(8,5))
    
    labels = data[kwargs.get('c')]
    nlabels = labels.nunique()
    bounds = np.arange(labels.min(),nlabels+1)
    
    # 20 distinct colors, more visible and differentible than tab20 
    # https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
    cset = ['#3cb44b', '#ffe119', '#4363d8','#e6194b', 
        '#f58231','#911eb4', '#46f0f0', '#f032e6', '#bcf60c', 
        '#fabebe', '#008080', '#e6beff','#800000', '#aaffc3'] # take 14

    cm = (mpl.colors.ListedColormap(cset, N=nlabels) if labels.min() == 0 
          else mpl.colors.ListedColormap(['#000000']+cset, N=nlabels+1))    
    
    sct = ax.scatter(x,y,data=data,cmap=cm,edgecolors='face',**kwargs)
    
    if centers is not None:
        if isinstance(centers,np.ndarray):
            for g in centers[:,[data.columns.get_loc(x),data.columns.get_loc(y)]]:
                ax.plot(*g,'*r',markersize=12, alpha=0.6)
                
        if isinstance(centers,pd.DataFrame):
            ax.scatter(x,y,data=centers,marker='D',c=centers.index.values,cmap=cm,
                       s=np.exp(centers['Life_Ladder'])*75, # scale ♦ size by Life_Ladder score 
                       #s=(labels.value_counts().sort_index()/len(labels))*np.sqrt(nlabels)*200, #scale ♦ sizes by n
                       edgecolors='black',linewidths=1,alpha=0.7)
            
        ax.set_title('(color=group, ♦size=Happiness, ♦loc = group center)')
            
    ax.set_xlabel(x)
    ax.set_ylabel(y)
    
    fig.suptitle(title, fontsize=14)
    ax2 = fig.add_axes([0.95, 0.1, 0.03, 0.8]) # 'Magic' numbers for colorbar spacing
    norm = mpl.colors.BoundaryNorm(bounds,cm.N)
    cb = mpl.colorbar.ColorbarBase(ax2, cmap=cm, norm=norm,ticks=bounds+0.5, boundaries=bounds)
    cb.set_ticklabels(bounds)

    plt.show()
    return ax

In [29]:

def plot_boxolin(x,y,data):
    """ Plot a box plot and a violin plot.
    
    Args:
        x,y : str
            columns in `data` to be plotted. x is the 'groupby' attribute.
        data : pandas.DataFrame
            DataFrame containing `x` and `y` columns
    
    Returns:        
        axes : matplotlib Axes
            the Axes object with the plot drawn onto it.
    """
    fig,axes = plt.subplots(1,2,figsize=(10,5),sharey=True)
    whr_grps.boxplot(column=y,by=[x],ax=axes[0]) # could use sns.boxplot, but why not try something different
    sns.violinplot(x,y,data = whr_grps,scale='area',ax=axes[1])
    axes[0].set_title(None)
    axes[0].set_ylabel(axes[1].get_ylabel())
    axes[1].set_ylabel(None)
    plt.show()
    return axes

In [30]:

def cluster(model, X, **kwargs):
    """ Run a clustering model and return predictions.
    
    Args:
        model : {sklearn.cluster, sklearn.mixture, or hdbscan}
            Model to fit and predict
        X : pandas.DataFrame
            Data used to fit `model`
        **kwargs : `model`.fit_predict() args, optional
            Keyword arguments to be passed into `model`.fit_predict()
    Returns:
        (labels,centers) : tuple(array, pandas.DataFrame)
            A tuple containing cluster labels and a DataFrame of cluster centers formated with X columns
    """
    clust_labels = model.fit_predict(X,**kwargs)
    centers = X.assign(**{model.__class__.__name__ : clust_labels} # assign a temp column to X with model name
                      ).groupby(model.__class__.__name__,sort=True).mean() # group on temp, gather mean of labels
    
    return (clust_labels, centers)

In [31]:

def score_clusters(X,labels):
    """ Calculate silhouette, calinski-harabasz, and davies-bouldin scores
    
    Args:
        X : array-like, shape (``n_samples``, ``n_features``)
            List of ``n_features``-dimensional data points. Each row corresponds
            to a single data point.

        labels : array-like, shape (``n_samples``,)
            Predicted labels for each sample.
    Returns:
        scores : dict
            Dictionary containing the three metric scores
    """
    scores = {'silhouette':silhouette_score(X,labels),
              'calinski_harabasz':calinski_harabasz_score(X,labels),
              'davies_bouldin':davies_bouldin_score(X,labels)
             }
    return scores

Model data¶

In [32]:

ss = StandardScaler()
whrX = pd.DataFrame(ss.fit_transform(whrffl_imp.drop(columns='Year')), columns=whrffl_imp.drop(columns='Year').columns, index=whrffl_imp.index)
whrX.head()

Out[32]:

	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
Country
Afghanistan	-2.499994	-1.435267	-2.398592	-1.636104	-3.047681	-0.458046	1.095187	-2.485333	1.123785	-0.792207	-1.930070	-1.417846	-1.260899	-1.406094
Albania	-0.412826	0.132356	-0.972327	0.670231	0.356675	0.141452	0.942003	0.052647	0.236524	-0.439633	0.535487	-0.077580	-1.003994	-0.090691
Algeria	-0.377876	0.251331	-0.040310	0.269130	-1.462538	-1.039297	0.186609	-1.020356	-0.032534	-0.890863	-0.848822	-0.781312	-1.333314	1.591967
Angola	-1.505664	-0.416054	-0.397013	-1.349603	-3.040082	-0.937353	0.592058	-1.130291	0.741230	0.243888	-0.636459	-1.141828	1.053499	-0.213697
Argentina	0.299485	0.457333	0.779928	0.684557	0.520461	-1.268567	0.705990	0.991837	0.252072	-1.307633	0.593400	-0.106015	0.901679	-0.494615

Most clustering algorithms are sensitive to the scale of data, standard scaling is advised.

In [33]:

whr_grps = whrX.copy() # clone which we many append cluster groups to.

K-means¶

K-means clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares

https://scikit-learn.org/stable/modules/clustering.html#k-means

In [219]:

distortions = []
for n in range(2,10):
    model = KMeans(n_clusters=n,random_state=RS).fit(whrX)
    distortions.append(model.inertia_)
    labs = model.labels_
    print(f'n_clusters: {n}\n',score_clusters(whrX,labs))
    
fig,ax = plt.subplots(figsize=(6, 4))
ax.plot(range(2,10),distortions)
ax.set(**{'title':'Elbow curve','ylabel':'inertia','xlabel':'n_clusters'})
plt.show()

n_clusters: 2
 {'silhouette': 0.25062672721911344, 'calinski_harabasz': 68.24034195153395, 'davies_bouldin': 1.4735832588278601}
n_clusters: 3
 {'silhouette': 0.21842473200144147, 'calinski_harabasz': 56.332406869431544, 'davies_bouldin': 1.4582987467383164}
n_clusters: 4
 {'silhouette': 0.22315489072181924, 'calinski_harabasz': 51.81545435941449, 'davies_bouldin': 1.4672671574548002}
n_clusters: 5
 {'silhouette': 0.20704926195914414, 'calinski_harabasz': 46.643072783557756, 'davies_bouldin': 1.5650984113109518}
n_clusters: 6
 {'silhouette': 0.21286049487400313, 'calinski_harabasz': 42.543303221115266, 'davies_bouldin': 1.489634800160517}
n_clusters: 7
 {'silhouette': 0.18704542203305513, 'calinski_harabasz': 39.19929042406028, 'davies_bouldin': 1.5316037117968835}
n_clusters: 8
 {'silhouette': 0.1798248741675178, 'calinski_harabasz': 36.34018018648525, 'davies_bouldin': 1.6167793535472728}
n_clusters: 9
 {'silhouette': 0.18559846200887306, 'calinski_harabasz': 33.98962931508475, 'davies_bouldin': 1.5670829735303966}

In [35]:

km = KMeans(n_clusters=3,random_state=RS)
clabels_km, cent_km = cluster(km, whrX)
whr_grps['KMeans'] = clabels_km
cent_km

Out[35]:

	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
KMeans
0	0.217656	0.320396	0.434670	0.369709	0.055862	-0.362711	0.395972	0.082644	-0.247264	-0.364254	0.156262	0.051710	0.069050	-0.303925
1	1.392846	1.305680	0.945599	1.171077	0.884960	0.567926	-1.666063	0.737726	-0.851829	0.277077	1.332486	1.721006	-0.876892	-0.426572
2	-0.969489	-1.088649	-1.101580	-1.103676	-0.488587	0.302405	0.145345	-0.463053	0.769330	0.436994	-0.847172	-0.862191	0.291874	0.663599

In [36]:

plot_cluster('Log_GDP','Corruption_Perception', whr_grps, centers=cent_km, title='K-Means Cluster', c='KMeans');

The initial K-Means cluster plot seems to indicate that the populations with lower GDP per captia tend to believe there is more corruption in business/government and rate their lives lower on the Cantril Life Ladder. However, the relationship is not exact, as seen in cluster[0] happiness and GDP can increase to a certain extent while also increasing perceptions of corruption.

In [37]:

SilhouetteVisualizer(km).fit(whrX).poof()# https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html

In [38]:

InterclusterDistance(km,random_state=RS).fit(whrX).poof()#https://www.scikit-yb.org/en/latest/api/cluster/icdm.html

In [39]:

plot_boxolin('KMeans','Log_GDP', data = whr_grps);

Agglomerative Clustering¶

Agglomerative clustering performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The hierarchy of clusters is represented as a tree or dendrogram where the root of the tree is the unique cluster that gathers all the samples, and the leaves are the clusters with only one sample.

https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

In [40]:

ac = AgglomerativeClustering(n_clusters=3, affinity = 'euclidean', linkage = 'ward')
clabels_ac,cent_ac = cluster(ac, whrX)
whr_grps['AgglomerativeClustering'] = clabels_ac
cent_ac

Out[40]:

	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
AgglomerativeClustering
0	0.402045	0.454050	0.533844	0.480147	0.195623	-0.186684	0.266133	0.211264	-0.300157	-0.332329	0.214791	0.159060	-0.044172	-0.375563
1	-0.937664	-0.963593	-1.003734	-1.001937	-0.525592	0.081686	0.148570	-0.472275	0.663155	0.332682	-0.675974	-0.688059	0.324815	0.663159
2	1.517015	1.319089	1.016879	1.324479	1.024796	0.757579	-2.175405	0.712655	-0.980121	0.574577	1.533921	1.910826	-1.083430	-0.537756

In [41]:

plot_cluster('Log_GDP','Corruption_Perception',whr_grps, centers=cent_ac, title='Agglomerative Cluster',c='AgglomerativeClustering');

In [42]:

score_clusters(whrX,clabels_ac)

Out[42]:

{'silhouette': 0.19685197257469053,
 'calinski_harabasz': 48.287898116425936,
 'davies_bouldin': 1.4184695148698854}

In [43]:

plot_boxolin('AgglomerativeClustering','Log_GDP',whr_grps);

Affinity Propagation¶

In [44]:

ap = AffinityPropagation(damping = 0.5, max_iter = 250, affinity = 'euclidean')
clabels_ap, cent_ap = cluster(ap,whrX)
whr_grps['AffinityPropagation'] = clabels_ap
cent_ap

Out[44]:

	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
AffinityPropagation
0	-0.500617	-0.567265	-0.493516	-0.063212	0.293655	-0.052284	0.059636	-0.889806	0.309342	1.238099	-0.803796	-0.438167	-0.705119	-0.571035
1	-0.747142	-1.272132	-1.024236	-1.285140	-0.239862	0.221882	0.249245	0.015651	0.581525	0.484327	-0.424674	-0.730613	0.567298	0.978746
2	-1.279836	-1.313996	-1.903009	-1.561613	-1.223334	0.629845	0.294700	-1.372540	2.123510	-0.048624	-1.307729	-1.353406	0.344303	1.161366
3	0.222655	0.477397	0.609701	0.623043	-0.670309	-0.611355	0.745696	-0.853739	-0.286430	-1.206889	0.423385	0.222335	-0.646431	-0.864157
4	0.466757	0.073115	0.479527	0.338572	0.494534	-0.339993	0.483445	1.031721	-0.023339	-0.865418	0.270854	-0.206464	1.170024	0.098470
5	1.545599	1.302150	1.021691	1.311734	1.017160	0.768865	-1.940286	0.743073	-0.972178	0.444538	1.533331	1.879546	-1.083620	-0.663310
6	0.165519	-0.073751	0.656839	0.041118	0.987783	1.986966	0.606842	0.647387	-0.879889	0.636642	-0.532076	-0.595852	-0.587825	-0.545105
7	-1.006123	0.039446	0.132144	-0.950234	-0.139271	-1.012953	0.390493	0.606351	-0.366652	0.578280	0.364302	0.126509	2.534934	1.858689
8	0.723209	0.964495	0.369699	0.405695	0.909633	-0.012255	-0.688852	0.755897	-0.082930	0.769516	-0.092241	0.444170	0.454449	0.716381
9	0.730402	0.902029	0.792365	0.911248	0.224374	-0.714315	0.244477	0.061031	-0.770132	-0.517626	0.954993	0.963504	-0.925618	-1.038152
10	-1.408119	-0.783961	-0.254358	-0.947068	-2.798362	-0.613527	0.499679	-1.359079	0.285656	-0.599311	-1.384850	-1.129199	-0.317092	-0.555974
11	-0.898008	-0.861645	-0.733028	-0.804342	0.618704	1.440850	-0.997173	0.791620	-0.404495	1.501329	-0.515023	-0.413840	0.259245	0.480501
12	-0.567129	0.163784	-0.343996	0.153978	-0.790771	-0.888230	0.515507	-0.985809	0.473750	-0.640716	-0.706652	-0.416651	-0.116206	-0.114559

In [45]:

plot_cluster('Log_GDP','Freedom',whr_grps, centers=cent_ap,title='Affinity Propagation',c='AffinityPropagation');

In [46]:

score_clusters(whrX,clabels_ap)

Out[46]:

{'silhouette': 0.1807914178898567,
 'calinski_harabasz': 27.84691838858024,
 'davies_bouldin': 1.5462700501887794}

In [47]:

plot_boxolin('AffinityPropagation','Log_GDP',whr_grps);

Violin plot loses a fair bit of its aesthetics when we ramp up the cluster count

Gaussian Mixture¶

A probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

https://scikit-learn.org/stable/modules/mixture.html#mixture

In [48]:

gm = GaussianMixture(n_components=3,init_params='kmeans', random_state=RS)
clabels_gm,cent_gm = cluster(gm,whrX)
whr_grps['GaussianMixture'] = clabels_gm
cent_gm

Out[48]:

	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
GaussianMixture
0	0.121003	0.290848	0.329489	0.305264	0.128359	-0.278011	0.404304	0.102988	-0.194147	-0.334905	0.082581	-0.020472	0.155404	-0.172694
1	1.366164	1.290100	0.946226	1.152916	0.845298	0.504250	-1.563515	0.725152	-0.836110	0.228849	1.328231	1.680884	-0.898514	-0.469166
2	-0.848250	-1.078975	-0.974230	-1.035616	-0.609042	0.194823	0.116398	-0.511258	0.708173	0.416979	-0.769506	-0.777091	0.188000	0.497728

In [49]:

plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_gm, title='Gaussian Mixture',c='GaussianMixture');

In [50]:

score_clusters(whrX,clabels_gm)

Out[50]:

{'silhouette': 0.17602676065311476,
 'calinski_harabasz': 46.98962716751056,
 'davies_bouldin': 1.6456170460570363}

In [51]:

plot_boxolin('GaussianMixture','Log_GDP',whr_grps);

DBSCAN¶

Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them.

In [52]:

db = DBSCAN(eps=0.3)
clabels_db,cent_db = cluster(db,whrX)
whr_grps['DBSCAN'] = clabels_db
cent_db

Out[52]:

	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
DBSCAN
-1	3.095167e-16	-5.800074e-16	-2.508095e-16	-8.504981e-16	-5.887546e-17	-4.642751e-17	2.839479e-16	-4.750409e-16	3.297026e-17	5.070018e-16	1.480297e-17	-1.076580e-17	2.893308e-16	6.331636e-16

In [53]:

plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_db,title='DBSCAN',c='DBSCAN');

Running with with a scaled version of the data failed to categorize any of the points at all. It considered every point to be too noisy group.

The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.

The usage of centroid distance limits the distance metric to Euclidean space. A good value reported by this method does not imply the best information retrieval.

In [55]:

plot_boxolin('DBSCAN','Log_GDP',whr_grps);

Given this dataset has relatively low density, this model had substandard performance w.r.t the other contenders.

HDBSCAN¶

Hierarchical Density-Based Spatial Clustering of Applications with Noise. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

In [56]:

hd = hdbscan.HDBSCAN()
clabels_hd, cent_hd = cluster(hd,whrX)
whr_grps['HDBSCAN'] = clabels_hd
cent_hd

Out[56]:

	Life_Ladder	Log_GDP	Social_support	Life_Expectancy	Freedom	Generosity	Corruption_Perception	Positive_affect	Negative_affect	Confidence_natGovt	Democratic_Quality	Delivery_Quality	giniIncWBavg	giniIncGallup
HDBSCAN
-1	-0.325745	-0.191294	-0.233152	-0.234759	-0.124759	0.082251	0.087954	-0.227406	0.115121	0.213518	-0.318442	-0.256041	0.006019	0.152915
0	-0.594262	-1.306554	-1.067221	-1.435553	-0.471403	0.337794	0.436063	0.145287	0.848195	0.262831	-0.547068	-0.673820	0.313713	0.926564
1	0.548691	0.145915	0.517590	0.398214	0.540105	-0.527929	0.466860	1.033615	-0.025376	-0.973451	0.332459	-0.177761	1.218192	0.047480
2	1.012796	0.955901	0.821159	0.996843	0.254934	-0.072167	-0.715967	0.167328	-0.629233	-0.261801	1.103414	1.216625	-0.836442	-0.831162

In [57]:

plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_hd,title='HDBSCAN',c='HDBSCAN');

In [58]:

score_clusters(whrX,clabels_hd)

Out[58]:

{'silhouette': -0.06289074714996327,
 'calinski_harabasz': 14.315184058617332,
 'davies_bouldin': 1.9554553185648444}

In [59]:

plot_boxolin('HDBSCAN','Log_GDP',whr_grps);

Conclusions¶

This notebook explored the World Happiness Dataset using a total of 6 models:
K-means, Agglomerative Clustering, Affinity Propagation, Gaussian Mixture, DBSCAN, and HDBSCAN.

The was a somewhat significant degree of variation between the examined models, but those that were not prescribed a certain number of clusters arrived at a 9 or 10 groups. With this many groupings, however, it became much more difficult to see exactly how a model was making clustering decisions.

For the models which we assigned a group count of 3, K-means, Agglomerative Clustering, and Gaussian Mixture, two diagonal or vertical lines could nearly be drawn between decision boundaries by way of GDP considerations.

Only one version of one model failed to perform at all, that was DBSCAN with scaled data, the rest found some form of suitable clustering. However, the Boxplots showed that when models were given free reign over the number of clusters, they tend to have one cluster serve to explain a large range of values and another to explain an extremely tightly grouped set with many outliers.

Future work¶

One thought that was left untested was looking to see if the previous model's metrics in anyway influenced successive models. Since a new column was appended to the dataset and not removed, there is indeed a possibility of this happening. A good deal more EDA could be done on this dataset by looking at relationships between variables in a market basic analysis fashion. As always, there are many different hyperparamaters that could still be experimented with, as well as other clustering models like Spectral, Ward, and MeanShift that might yield interesting results. Additionally, and perhaps most poignantly, only Log_GDP and Corruption_Perception were explored, the rest were left untouched, thus leaving a significant portion of the data not truly explored to its fullest.

Clustering on the World Happiness Report 2019