Clustering on the World Happiness Report 2019

37 minute read

Overview

This analysis attempts to group together countries based on nothing more than the professed happiness of its citizens. To do so, we will be using data from the 2019 World Happiness Report to build several clustering models. The primary models discussed in this analysis are:

  • K-means
  • Agglomerative Clustering
  • Affinity Propagation
  • Gaussian Mixture

Additionally, these models will be briefly demonstrated:

  • DBSCAN
  • HDBSCAN

Each model will be visualized in 3 different forms:

  • A scatter plot using unaltered data
  • A scatter plot using scaled data
  • A Boxplot of the unaltered data

Glossary:

  • GWP - Gallup World Poll
  • WVS - World Value Surveys

Imports

In [124]:
import numpy as np    
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

from sklearn.preprocessing import StandardScaler 
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, DBSCAN
from sklearn.mixture import GaussianMixture 

import hdbscan

# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics.cluster import silhouette_score, silhouette_samples, calinski_harabasz_score, davies_bouldin_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer, InterclusterDistance

import geopandas

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
In [2]:
RS = 404 # Random state/seed
pd.set_option("display.max_columns",30) # Increase columns shown

Data

The 2019 World Happiness Report dataset may be obtained from

https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls

The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. This year’s World Happiness Report focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.

It has 2 descriptor/ID variables (Country,Year), one response (Life Ladder), six proposed determinants of the response, and several additional variable that were either calculated or gather from external sources.

  • Country - Name of the country.
  • Year - Year of data collection.
  • Life Ladder - (survey,0-10) Cantril Life Ladder/Happiness score/subjective well-being. The national average response to the following question:
    • "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"

Six Hypothesized Underlying Determinants:

  • Log_GDP - (calculated-normalized,external), Log GDP per capita in purchasing power parity. Constant 2011 international dollar prices from World Development Indicators (November 14, 2018)
  • Life_Expectancy - (partial-interpolated,external), Healthy life expectancies at birth are based on the data extracted from the World Health Organization's Global Health Observatory data repository
  • Social_support - (survey,binary) National average response to:
    • "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?"
  • Freedom - (survey,binary), National average response to:
    • "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"
  • Generosity - (survey,binary,calculated-residual) is the residual of regressing national average of response to:
    • "Have you donated money to a charity in the past month?" on GDP per capita.
  • Corruption_Perception - (survey,2x-binary), National average response to two questions:
    • "Is corruption widespread throughout the government or not"
    • "Is corruption widespread within businesses or not?"

Additional inclusions:

  • Positive affect - (survey,3x-binary), Average of three GWP positive affect measures (waves 3-7): happiness, laugh and enjoyment in Gallup World Poll(waves 3-7). Responses to the following three questions, respectively:

    • "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Happiness?",
    • "Did you smile or laugh a lot yesterday?"
    • "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Enjoyment?"
  • Negative affect - (survey,3x-binary), Average of three GWP negative affect measures: worry, sadness and anger in Responses to the following three questions, respectively:

    • "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Worry?",
    • "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Sadness?"
    • "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Anger?"
  • "giniLadder" - (multi,calculated-metastat), Inequality/distribution statistics of happiness scores by WP5-year from the GWP release. WP5 is GWP's coding of countries, including some sub-country territories.

    • sdLadder Standard deviation of ladder by country-year
    • cvLadder Standard deviation/Mean of ladder by country-year
  • giniIncGallup - (calculated-normalized,external), Household Income International Dollars. Income variables are created by converting local currency to International Dollars (ID) using purchasing power parity (PPP) ratios.

  • giniIncWB - (partial,external), Unbalanced panel of yearly index. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments

  • giniIncWBavg - (calculated,external), the average of giniIncWB in the period 2000-2016. Most countries are missing some gini index period data.

  • Confidence_natGovt - (survey,binary,external), citizens' confidence in key institutions (WP139) Response to:

    • "Do you have confidence in each of the following, or not? How about the national government?"
  • "WGI indicators of governance quality" - (survey, amalgam, calculated, external), based on over 30 individual data sources produced by a variety of survey institutes, think tanks, non-governmental organizations, international organizations, and private sector firms, enterprise, citizen and expert survey respondents.

    • Democratic Quality - average "Voice and Accountability" and "Political Stability and Absence of Violence"
    • Delivery Quality - average "Government Effectiveness", "Regulatory Quality", "Rule of Law", "Control of Corruption"

Expanded data:

  • trust_Gallup and trust_WVS* - (survey,binary), Percentage of respondents with positive-trust response to:
    • "Generally speaking, would you say that most people can be trusted or that you [have,need] to be [very] careful in dealing with people?"

Primary definitions:
https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix1.pdf

Definitions (Democratic Quality, Delivery Quality, Confidence_natGovt):
https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix2.pdf

Relevant files: data/ - contains happiness_2016.csv the externally obtained data for this analysis.

In [3]:
whr = pd.read_excel('data/WHR2019.xls')
In [4]:
#whr = pd.read_csv('data/happiness_2016.csv')
whr.columns
Out[4]:
Index(['Country name', 'Year', 'Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Positive affect', 'Negative affect',
       'Confidence in national government', 'Democratic Quality',
       'Delivery Quality', 'Standard deviation of ladder by country-year',
       'Standard deviation/Mean of ladder by country-year',
       'GINI index (World Bank estimate)',
       'GINI index (World Bank estimate), average 2000-16',
       'gini of household income reported in Gallup, by wp5-year',
       'Most people can be trusted, Gallup',
       'Most people can be trusted, WVS round 1981-1984',
       'Most people can be trusted, WVS round 1989-1993',
       'Most people can be trusted, WVS round 1994-1998',
       'Most people can be trusted, WVS round 1999-2004',
       'Most people can be trusted, WVS round 2005-2009',
       'Most people can be trusted, WVS round 2010-2014'],
      dtype='object')
In [5]:
# Shortened and cleaned names, most are derived from WHR2019 paper
full_colnames = [
    'Country', 'Year', 'Life_Ladder', 
    'Log_GDP','Social_support', 'Life_Expectancy', 'Freedom', 'Generosity','Corruption_Perception', 
    'Positive_affect', 'Negative_affect',
    'Confidence_natGovt', 'Democratic_Quality','Delivery_Quality', 
    'sdLadder','cvLadder',
    'giniIncWB','giniIncWBavg','giniIncGallup',
    'trust_Gallup',
    'trust_WVS81_84','trust_WVS89_93','trust_WVS94_98','trust_WVS99_2004','trust_WVS2005_09','trust_WVS2010_14'
]
core_col = full_colnames[:9]
ext_col = full_colnames[:14] + full_colnames[17:19]
In [6]:
whr.columns = full_colnames
In [23]:
# Shorten and Clean names for dot access
whr.columns = whr.columns.str.replace('Most people can be trusted','trust_in_people')
whr.columns = whr.columns.str.replace(' ','_')
whr.columns = whr.columns.str.replace('[(),]','') # Strip parens and commas
whr.columns
In [7]:
whr.iloc[np.r_[0:3,-3:0]] # HeadTail
Out[7]:
Country Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality sdLadder cvLadder giniIncWB giniIncWBavg giniIncGallup trust_Gallup trust_WVS81_84 trust_WVS89_93 trust_WVS94_98 trust_WVS99_2004 trust_WVS2005_09 trust_WVS2010_14
0 Afghanistan 2008 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686 0.517637 0.258195 0.612072 -1.929690 -1.655084 1.774662 0.476600 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan 2009 4.401778 7.333790 0.552308 51.200001 0.678896 0.200178 0.850035 0.583926 0.237092 0.611545 -2.044093 -1.635025 1.722688 0.391362 NaN NaN 0.441906 0.286315 NaN NaN NaN NaN NaN NaN
2 Afghanistan 2010 4.758381 7.386629 0.539075 51.599998 0.600127 0.134353 0.706766 0.618265 0.275324 0.299357 -1.991810 -1.617176 1.878622 0.394803 NaN NaN 0.327318 0.275833 NaN NaN NaN NaN NaN NaN
1701 Zimbabwe 2016 3.735400 7.538829 0.768425 54.400002 0.732971 -0.068105 0.723612 0.737636 0.208555 0.699344 -0.900649 -1.374650 2.776363 0.743257 NaN 0.432 0.596690 NaN NaN NaN NaN 0.116683 NaN 0.082942
1702 Zimbabwe 2017 3.638300 7.549491 0.754147 55.000000 0.752826 -0.069670 0.751208 0.806428 0.224051 0.682647 -0.988153 -1.350867 2.656848 0.730244 NaN 0.432 0.581484 NaN NaN NaN NaN 0.116683 NaN 0.082942
1703 Zimbabwe 2018 3.616480 7.553395 0.775388 55.599998 0.762675 -0.038384 0.844209 0.710119 0.211726 0.550508 NaN NaN 2.498696 0.690919 NaN 0.432 0.541772 NaN NaN NaN NaN 0.116683 NaN 0.082942

Exploratory Data Analysis

The dataset contains NA values, however there are a few examples where a field's contribution to happiness is 0.0. This is likely a side effect of having a modeled rather than purely gathered dataset. One possibility is that if a country ranked the lowest for that particular characteristic it was simply zeroed out.

Happiness_Score is the summation of Economy_GDP_per_Capita, Family, Health_Life_Expectancy, Freedom, Trust_Government_Corruption, Generosity, and Dystopia_Residual within a margin of error between the confidence intervals.

Other than Country, Region, and Happiness Rank, all of the variables are continuous floating point.

In [8]:
whr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 26 columns):
Country                  1704 non-null object
Year                     1704 non-null int64
Life_Ladder              1704 non-null float64
Log_GDP                  1676 non-null float64
Social_support           1691 non-null float64
Life_Expectancy          1676 non-null float64
Freedom                  1675 non-null float64
Generosity               1622 non-null float64
Corruption_Perception    1608 non-null float64
Positive_affect          1685 non-null float64
Negative_affect          1691 non-null float64
Confidence_natGovt       1530 non-null float64
Democratic_Quality       1558 non-null float64
Delivery_Quality         1559 non-null float64
sdLadder                 1704 non-null float64
cvLadder                 1704 non-null float64
giniIncWB                643 non-null float64
giniIncWBavg             1502 non-null float64
giniIncGallup            1335 non-null float64
trust_Gallup             180 non-null float64
trust_WVS81_84           125 non-null float64
trust_WVS89_93           220 non-null float64
trust_WVS94_98           618 non-null float64
trust_WVS99_2004         491 non-null float64
trust_WVS2005_09         630 non-null float64
trust_WVS2010_14         671 non-null float64
dtypes: float64(24), int64(1), object(1)
memory usage: 346.2+ KB
In [9]:
whr.isna().sum()
Out[9]:
Country                     0
Year                        0
Life_Ladder                 0
Log_GDP                    28
Social_support             13
Life_Expectancy            28
Freedom                    29
Generosity                 82
Corruption_Perception      96
Positive_affect            19
Negative_affect            13
Confidence_natGovt        174
Democratic_Quality        146
Delivery_Quality          145
sdLadder                    0
cvLadder                    0
giniIncWB                1061
giniIncWBavg              202
giniIncGallup             369
trust_Gallup             1524
trust_WVS81_84           1579
trust_WVS89_93           1484
trust_WVS94_98           1086
trust_WVS99_2004         1213
trust_WVS2005_09         1074
trust_WVS2010_14         1033
dtype: int64
In [10]:
whr[whr[core_col].isna().any(axis=1)].shape # 188 entries have at least 1 missing value from the core attributes
Out[10]:
(188, 26)

We can see a substantial portion of the values are missing, particularly in WVS reports of people's perceived trust in others.

In [11]:
whr.describe()
Out[11]:
Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality sdLadder cvLadder giniIncWB giniIncWBavg giniIncGallup trust_Gallup trust_WVS81_84 trust_WVS89_93 trust_WVS94_98 trust_WVS99_2004 trust_WVS2005_09 trust_WVS2010_14
count 1704.000000 1704.000000 1676.000000 1691.000000 1676.000000 1675.000000 1622.000000 1608.000000 1685.000000 1691.000000 1530.000000 1558.000000 1559.000000 1704.000000 1704.000000 643.000000 1502.000000 1335.000000 180.000000 125.000000 220.000000 618.000000 491.000000 630.000000 671.000000
mean 2012.332160 5.437155 9.222456 0.810570 63.111971 0.733829 0.000079 0.751315 0.709368 0.265679 0.481973 -0.136053 -0.001390 2.026707 0.392121 0.370000 0.385438 0.447771 0.226295 0.390480 0.283925 0.249574 0.268070 0.264336 0.237493
std 3.688072 1.121149 1.185794 0.119210 7.583622 0.144115 0.163365 0.186074 0.107984 0.084707 0.192059 0.876074 0.975849 0.401484 0.124661 0.083232 0.082396 0.108505 0.119079 0.123309 0.113226 0.118126 0.145120 0.160169 0.157482
min 2005.000000 2.661718 6.457201 0.290184 32.299999 0.257534 -0.336385 0.035198 0.362498 0.083426 0.068769 -2.448228 -2.144974 0.863034 0.133908 0.240000 0.211000 0.200969 0.066618 0.176535 0.066020 0.048720 0.075872 0.038242 0.031518
25% 2009.000000 4.610970 8.304428 0.747512 58.299999 0.638436 -0.115534 0.696083 0.621855 0.205414 0.334735 -0.790461 -0.711416 1.743369 0.310139 0.305000 0.321429 0.368424 0.139773 0.290300 0.223553 0.176876 0.155833 0.144976 0.118725
50% 2012.000000 5.339557 9.406206 0.833098 65.000000 0.752731 -0.022080 0.805775 0.718541 0.254544 0.464109 -0.227386 -0.218633 1.973070 0.372744 0.352000 0.371000 0.426541 0.198450 0.380174 0.292383 0.229924 0.232000 0.198380 0.193531
75% 2015.000000 6.273522 10.193060 0.904432 68.300003 0.848155 0.093522 0.876458 0.801530 0.314896 0.614862 0.650468 0.699971 2.242300 0.456311 0.428000 0.432200 0.514803 0.281627 0.478149 0.341741 0.294242 0.385469 0.391370 0.335000
max 2018.000000 8.018934 11.770276 0.987343 76.800003 0.985178 0.677743 0.983276 0.943621 0.704590 0.993604 1.575009 2.184725 3.718958 1.022769 0.634000 0.626000 0.961435 0.640332 0.571719 0.594595 0.647737 0.637185 0.737305 0.661757

Most of the data seems to be of a similar magnitude, with Ladder, Log GDP, and life expectancy being the largest exceptions

In [12]:
fig, ax = plt.subplots(figsize=(10,8))
corrmat = whr.drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0);
In [13]:
fig, ax = plt.subplots(figsize=(6,4))
corrmat = whr[core_col].drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0,annot=True);
In [14]:
fig, ax = plt.subplots(figsize=(8,6))
corrmat = whr[ext_col].drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0);

The upper left-hand square was likely by design, if happiness (life_ladder) is the statistic we are looking to understand, the attributes to immediately follow are likely what most consider large contributing factors.

Intuitive correlations:

  • correlation between all trust statistics
  • Democratic Quality <+> Delivery Quality
  • SDMean Ladder <-> Ladder
  • GINI index <+> GINI index mean

Potentially interesting correlations:

  • trust WVS 81-84 <+> Democratic+Delivery Quality
  • trust WVS 81-84 <+> log GDP
  • trust WVS 81-84 <-> Perceptions of corruption

While the trust 81-84 could be an interesting variable to investigate, it is also worth remembering that this attribute has the most null values out of all, so these should be taken with a grain of salt.

In [15]:
gov_col = ['Freedom', 'Corruption_Perception', 'Confidence_natGovt','Democratic_Quality', 'Delivery_Quality']
fig, ax = plt.subplots(figsize=(6,4))
corrmat = whr[gov_col].corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0,annot=True);
In [ ]:
whr_ext = whr[ext_col].copy() # Using an extended, but not quite full, version of dataset
In [16]:
whr_ext.groupby('Country').Year.count().describe()
Out[16]:
count    165.000000
mean      10.327273
std        3.371624
min        1.000000
25%        9.000000
50%       12.000000
75%       13.000000
max       13.000000
Name: Year, dtype: float64
In [17]:
whr_ext.groupby('Country').Year.count().hist(bins=13);

Almost half of all countries in the dataset have an entry for all years that the survey has been conducted. Additionally, 75% of countries have at least 9 years worth of data entries.

In [300]:
#From:  00BF11 -> BF2200
rygscale = [
    [0,'rgb(191, 34, 0)'],
    [0.2,'rgb(191, 75, 0)'],
    [0.3,'rgb(191, 116, 0)'],
    [0.4,'rgb(191, 156, 0)'],
    [0.5,'rgb(184, 191, 0)'],
    [0.6,'rgb(144, 191, 0)'],
    [0.7,'rgb(103, 191, 0)'],
    #[0.8,'rgb(63, 191, 0)'],
    #[0.9,'rgb(22, 191, 0)'],
    [0.8,'rgb(0, 191, 17)'],#new
    [0.9,'rgb(30, 223, 29)'],#new
    [1,'rgb(61, 255, 41)']]#new
# https://convertingcolors.com/rgb-color-191_34_0.html
# http://www.perbang.dk/rgbgradient/
In [301]:
whr_recent = whr_ext.iloc[whr_ext.groupby('Country').Year.idxmax()]
data = [
    go.Choropleth(
        locations = whr_recent.Country,#whrffl_imp.index, 
        locationmode = 'country names',
        z = whr_recent['Life_Ladder'],
        text = ['{} ({})'.format(c,y) for c, y in zip(whr_recent.Country, whr_recent.Year)],
        hoverinfo='z+text',
        colorscale = rygscale,
        marker = go.choropleth.Marker(line = go.choropleth.marker.Line(color = 'rgb(255,255,255)',width = 0.15)),
        colorbar = go.choropleth.ColorBar(title = 'Happiness Score')
    )
]

layout = go.Layout(
    title = go.layout.Title(text = 'World Happiness 2019<br>(Cantrill Life Ladder)'),
    geo = go.layout.Geo(
        showcoastlines = True,
        landcolor = 'lightgray',
        showland = True,
        projection = go.layout.geo.Projection(type = 'equirectangular')#'natural earth')
    ),
    width=960,
    annotations = [
        go.layout.Annotation(x = 0.96,y = 0.01,xref = 'paper',yref = 'paper',
        text = 'Source: <a href="https://worldhappiness.report/ed/2019/#read">World Happiness Report 2019</a>',
        showarrow = False),
        go.layout.Annotation(x = 0, y = -0.15, xref = 'paper',yref = 'paper',align='left',font={'size':9},
        text = '''
        "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top.<br>
         The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you.<br>
         On which step of the ladder would you say you personally feel you stand at this time?"
        ''',
        showarrow = False)
    ]
)

fig = go.Figure(data = data, layout = layout)
iplot(fig, filename = 'd3-world-map')
In [112]:
# format: whr.Country : world.name
whr_world = {
    'Bahrain' : None,
    'Bosnia and Herzegovina' : 'Bosnia and Herz.',
    'Central African Republic' : 'Central African Rep.',
    'Congo (Brazzaville)' : 'Congo',
    'Congo (Kinshasa)' : 'Dem. Rep. Congo',
    'Czech Republic' : 'Czech Rep.',
    'Dominican Republic' : 'Dominican Rep.',
    'Hong Kong S.A.R. of China' : None,
    'Ivory Coast' : "Côte d'Ivoire",
    'Laos' : 'Lao PDR',
    'Malta' : None,
    'Mauritius' : None,
    'North Cyprus' : 'Cyprus',
    'Palestinian Territories' : 'Palestine',
    'Singapore' : None,
    'Somaliland region' : 'Somalia',
    'South Korea' : 'Korea',
    'South Sudan' : 'S. Sudan',
    'Taiwan Province of China' : 'Taiwan'}
In [ ]:
 

Aggregating

Ideally, we would simply use the most up to data that we have for each country, unfortunately, that causes some problems.

Take latest

In [221]:
whr_ext.iloc[whr_ext.groupby('Country').Year.idxmax()]
#whrl = whr_ext.iloc[latest_idx].set_index('Country')
Out[221]:
Country Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
10 Afghanistan 2018 2.694303 7.494588 0.507516 52.599998 0.373536 -0.084888 0.927606 0.424125 0.404904 0.364666 NaN NaN NaN 0.290681
21 Albania 2018 5.004403 9.412399 0.683592 68.699997 0.824212 0.005385 0.899129 0.713300 0.318997 0.435338 NaN NaN 0.303250 0.456174
28 Algeria 2018 5.043086 9.557952 0.798651 65.900002 0.583381 -0.172413 0.758704 0.591043 0.292946 NaN NaN NaN 0.276000 0.667872
32 Angola 2014 3.794838 8.741481 0.754615 54.599998 0.374542 -0.157062 0.834076 0.578517 0.367864 0.572346 -0.739363 -1.168539 0.473500 0.440699
45 Argentina 2018 5.792797 9.809972 0.899912 68.800003 0.845895 -0.206937 0.855255 0.820310 0.320502 0.261352 NaN NaN 0.460938 0.405356
58 Armenia 2018 5.062449 9.119424 0.814449 66.900002 0.807644 -0.149109 0.676826 0.581488 0.454840 0.670828 NaN NaN 0.319250 0.406403
70 Australia 2018 7.176993 10.721021 0.940137 73.599998 0.916028 0.137795 0.404647 0.759019 0.187456 0.468837 NaN NaN 0.342750 0.429814
81 Austria 2018 7.396002 10.741893 0.911668 73.000000 0.904112 0.051552 0.523061 0.752350 0.226059 0.488679 NaN NaN 0.302692 0.299504
94 Azerbaijan 2018 5.167995 9.678014 0.781230 65.500000 0.772449 -0.251795 0.561206 0.592575 0.191392 0.834372 NaN NaN 0.211000 0.260410
103 Bahrain 2017 6.227321 10.675694 0.875747 68.500000 0.905859 0.128193 NaN 0.813571 0.289760 NaN -1.167434 0.226644 NaN 0.446609
116 Bangladesh 2018 4.499217 8.220746 0.705556 64.300003 0.901471 -0.038008 0.701421 0.541345 0.361238 0.831693 NaN NaN 0.327750 0.367609
129 Belarus 2018 5.233770 9.778739 0.904569 66.099998 0.643602 -0.181865 0.718455 0.450333 0.235729 0.421279 NaN NaN 0.281294 0.293444
141 Belgium 2018 6.892172 10.672445 0.929816 72.000000 0.808387 -0.127278 0.630412 0.749563 0.250297 0.441945 NaN NaN 0.284308 0.299525
143 Belize 2014 5.955647 8.987144 0.756932 62.220001 0.873569 0.004827 0.782105 0.754977 0.281604 0.384267 0.284336 -0.524305 NaN 0.446026
153 Benin 2018 5.819827 7.663907 0.503544 54.299999 0.713264 0.024661 0.746511 0.646655 0.467872 0.639220 NaN NaN 0.432667 0.606243
156 Bhutan 2015 5.082129 8.954588 0.847574 60.200001 0.830102 0.286635 0.633956 0.809641 0.311589 0.946393 0.469945 0.309431 0.392667 0.422514
169 Bolivia 2018 5.915734 8.860531 0.827159 63.599998 0.863247 -0.087568 0.786045 0.741973 0.387469 0.399588 NaN NaN 0.521600 0.450113
180 Bosnia and Herzegovina 2018 5.887401 9.402726 0.835890 67.800003 0.658846 0.124627 0.912858 0.642940 0.277365 0.254097 NaN NaN 0.325600 0.385817
191 Botswana 2018 3.461366 9.680226 0.794936 58.900002 0.817621 -0.259084 0.806945 0.729643 0.267084 0.718788 NaN NaN 0.626000 0.616160
204 Brazil 2018 6.190922 9.557933 0.881505 66.400002 0.750609 -0.126327 0.763251 0.749728 0.349656 0.168187 NaN NaN 0.547286 0.420397
214 Bulgaria 2018 5.098814 9.873219 0.923853 66.800003 0.724336 -0.179110 0.952014 0.639022 0.189091 0.218996 NaN NaN 0.354667 0.341207
226 Burkina Faso 2018 4.927236 7.470520 0.664859 53.900002 0.720743 -0.004381 0.757399 0.710884 0.342866 0.622255 NaN NaN 0.394667 0.605102
231 Burundi 2018 3.775283 6.541033 0.484715 53.400002 0.646399 -0.019334 0.598608 0.666442 0.362767 NaN NaN NaN 0.360000 0.680813
244 Cambodia 2018 5.121838 8.253352 0.794605 61.599998 0.958305 0.033787 NaN 0.844593 0.414346 NaN NaN NaN NaN 0.603439
257 Cameroon 2018 5.250738 8.133471 0.676825 52.700001 0.816305 0.032507 0.884442 0.642437 0.355642 0.645226 NaN NaN 0.438333 0.521751
270 Canada 2018 7.175497 10.701248 0.922719 73.599998 0.945783 0.097966 0.371741 0.823669 0.259398 0.610467 NaN NaN 0.336800 0.465442
275 Central African Republic 2017 3.475862 6.494117 0.319589 45.200001 0.645252 0.093754 0.889566 0.613865 0.599335 0.650285 -1.523122 -1.538733 0.499000 0.715371
288 Chad 2018 4.486325 7.472575 0.577254 48.200001 0.650355 0.011340 0.762879 0.552737 0.543836 0.577436 NaN NaN 0.415500 0.607655
301 Chile 2018 6.436221 10.065920 0.890085 69.900002 0.788530 -0.070616 0.816297 0.832562 0.275820 0.334744 NaN NaN 0.491571 0.434313
314 China 2018 5.131434 9.694376 0.787605 69.300003 0.895378 -0.174899 NaN 0.855784 0.189640 NaN NaN NaN 0.425000 0.538206
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1398 South Sudan 2017 2.816622 NaN 0.556823 51.000000 0.456011 NaN 0.761270 0.585602 0.517364 0.461551 -2.138769 -2.018497 0.463000 0.703008
1411 Spain 2018 6.513371 10.465594 0.910315 74.400002 0.722251 -0.079351 0.776504 0.659188 0.357191 0.285196 NaN NaN 0.345385 0.359144
1423 Sri Lanka 2018 4.400223 9.400388 0.828065 67.199997 0.852628 0.086762 0.858017 0.831293 0.301279 0.576130 NaN NaN 0.393400 0.408002
1428 Sudan 2014 4.138673 8.340058 0.810616 55.119999 0.390096 -0.072781 0.793785 0.540845 0.302725 NaN -2.049856 -1.397204 0.354000 0.455439
1429 Suriname 2012 6.269287 9.624583 0.797262 62.240002 0.885488 -0.076575 0.751283 0.764223 0.250365 0.721765 0.210094 -0.218572 NaN 0.367604
1431 Swaziland 2018 4.211565 8.946771 0.779270 NaN 0.709974 -0.179938 0.692341 0.824355 0.252339 0.689549 NaN NaN 0.523000 0.732568
1444 Sweden 2018 7.374792 10.766932 0.930680 72.599998 0.941725 0.069573 0.262797 0.822676 0.160755 0.494396 NaN NaN 0.274154 0.381510
1452 Switzerland 2018 7.508587 10.975945 0.930291 74.099998 0.926415 0.096369 0.301260 0.792226 0.191520 0.849979 NaN NaN 0.328100 0.320725
1459 Syria 2015 3.461913 NaN 0.463913 55.200001 0.448271 NaN 0.685237 0.369440 0.642589 NaN -2.448228 -1.548680 0.358000 0.525934
1470 Taiwan Province of China 2018 6.467005 NaN 0.896459 NaN 0.741033 NaN 0.735971 0.848399 0.092696 0.311723 NaN NaN NaN 0.330178
1482 Tajikistan 2017 5.829234 7.971401 0.662693 63.799999 0.832002 0.122264 0.718337 0.602668 0.277725 0.929793 -1.195176 -1.216369 0.326600 0.367954
1495 Tanzania 2018 3.445023 7.928911 0.675330 57.500000 0.807142 0.141757 0.611534 0.762089 0.221005 0.914648 NaN NaN 0.384667 0.568629
1508 Thailand 2018 6.011562 9.734829 0.873052 67.199997 0.904828 0.251650 0.906596 0.843489 0.198190 0.605364 NaN NaN 0.396692 0.482407
1516 Togo 2018 4.022895 7.287405 0.596354 54.700001 0.611966 -0.007063 0.808538 0.608449 0.446454 0.323221 NaN NaN 0.437667 0.444904
1521 Trinidad and Tobago 2017 6.191860 10.266848 0.916029 63.500000 0.859140 -0.004833 0.911336 0.846467 0.248099 0.272541 0.420911 -0.046981 NaN 0.415465
1531 Tunisia 2018 4.741132 9.304474 0.732954 66.900002 0.649680 -0.203249 0.840117 0.591727 0.365014 0.349490 NaN NaN 0.381000 0.431391
1544 Turkey 2018 5.185689 10.148917 0.847027 66.800003 0.528629 -0.181654 0.804879 0.434654 0.350773 0.513677 NaN NaN 0.405800 0.362624
1553 Turkmenistan 2018 4.620602 9.749464 0.984489 62.200001 0.857774 0.237280 NaN 0.612210 0.189025 NaN NaN NaN NaN 0.271070
1566 Uganda 2018 4.321715 7.458709 0.739841 55.700001 0.728513 0.088241 0.856106 0.685169 0.390319 0.503853 NaN NaN 0.432200 0.655503
1579 Ukraine 2018 4.661909 9.012027 0.900937 64.599998 0.663055 -0.055042 0.942961 0.608771 0.221851 0.079710 NaN NaN 0.265000 0.349749
1590 United Arab Emirates 2018 6.603744 11.127678 0.851041 67.099998 0.943664 0.036494 NaN 0.787243 0.302042 NaN NaN NaN NaN 0.721079
1603 United Kingdom 2018 7.233445 10.596948 0.928484 72.300003 0.837508 0.221998 0.404276 0.783172 0.228276 0.420860 NaN NaN 0.341083 0.417473
1616 United States 2018 6.882685 10.922465 0.903856 68.300003 0.824607 0.107713 0.709928 0.815383 0.292226 0.313816 NaN NaN 0.408167 0.701418
1629 Uruguay 2018 6.371715 9.959661 0.917316 69.000000 0.876211 -0.108451 0.682916 0.876920 0.274946 0.361706 NaN NaN 0.427364 0.437542
1641 Uzbekistan 2018 6.205460 8.773365 0.920821 65.099998 0.969898 0.311695 0.520360 0.825422 0.208660 0.969356 NaN NaN 0.348000 0.384974
1654 Venezuela 2018 5.005663 9.270281 0.886882 66.500000 0.610855 -0.176156 0.827560 0.759221 0.373658 0.260700 NaN NaN 0.497167 NaN
1667 Vietnam 2018 5.295547 8.783416 0.831945 67.900002 0.909260 -0.039124 0.808423 0.692222 0.191061 NaN NaN NaN 0.362750 0.415666
1678 Yemen 2018 3.057514 NaN 0.789422 56.700001 0.552726 NaN 0.792587 0.461114 0.314870 0.308151 NaN NaN 0.357000 0.448597
1690 Zambia 2018 4.041488 8.223958 0.717720 55.299999 0.790626 0.036644 0.810731 0.702698 0.350963 0.606715 NaN NaN 0.527400 0.619443
1703 Zimbabwe 2018 3.616480 7.553395 0.775388 55.599998 0.762675 -0.038384 0.844209 0.710119 0.211726 0.550508 NaN NaN 0.432000 0.541772

165 rows × 16 columns

In [19]:
# Get latest year indices
latest_idx = whr_ext.groupby('Country').Year.idxmax()
whrl = whr_ext.iloc[latest_idx].set_index('Country')
# Check NAs in the core data set
whrl[whrl[core_col[1:]].isna().any(axis=1)]
Out[19]:
Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
Country
Bahrain 2017 6.227321 10.675694 0.875747 68.500000 0.905859 0.128193 NaN 0.813571 0.289760 NaN -1.167434 0.226644 NaN 0.446609
Cambodia 2018 5.121838 8.253352 0.794605 61.599998 0.958305 0.033787 NaN 0.844593 0.414346 NaN NaN NaN NaN 0.603439
China 2018 5.131434 9.694376 0.787605 69.300003 0.895378 -0.174899 NaN 0.855784 0.189640 NaN NaN NaN 0.425000 0.538206
Cuba 2006 5.417869 9.676425 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602 0.513176 -0.706359 -0.543394 NaN NaN
Cyprus 2018 6.276443 NaN 0.825573 73.699997 0.794215 NaN 0.848337 0.750122 0.298021 0.352440 NaN NaN 0.326167 0.448661
Egypt 2018 4.005451 9.293960 0.758824 61.700001 0.681654 -0.222930 NaN 0.492261 0.285184 NaN NaN NaN 0.312000 0.323929
Gambia 2018 4.922099 7.376554 0.684800 55.000000 0.718729 NaN 0.691070 0.804012 0.379208 0.757543 NaN NaN 0.422667 0.592391
Jordan 2018 4.638934 9.024435 0.799544 66.800003 0.762420 -0.183490 NaN NaN NaN NaN NaN NaN 0.343000 0.391051
Kosovo 2018 6.391826 NaN 0.822407 65.149826 0.889737 NaN 0.922078 0.778271 0.170248 0.347547 NaN NaN 0.289909 0.402302
Kuwait 2017 6.093905 11.090272 0.853491 66.500000 0.884182 -0.039014 NaN 0.692072 0.307321 NaN -0.323727 -0.114578 NaN 0.591861
Libya 2018 5.493978 NaN 0.824165 62.299999 0.780559 NaN 0.645839 0.705535 0.398903 NaN NaN NaN NaN 0.596754
Malta 2018 6.909711 NaN 0.931542 72.199997 0.927341 NaN 0.595200 0.721224 0.295699 0.757714 NaN NaN 0.291100 0.385407
North Cyprus 2018 5.608056 NaN 0.837392 NaN 0.797066 NaN 0.613837 0.480453 0.261868 0.378324 NaN NaN NaN 0.200969
Oman 2011 6.852982 10.648312 NaN 65.500000 0.916293 -0.008942 NaN NaN 0.295164 NaN -0.314025 0.295601 NaN 0.494790
Palestinian Territories 2018 4.553922 NaN 0.819479 NaN 0.654535 NaN 0.813780 0.610405 0.418929 0.392373 NaN NaN NaN 0.482421
Poland 2017 6.201268 10.211576 0.881854 68.900002 0.830843 -0.127978 NaN 0.677436 0.203388 0.502480 0.651249 0.678493 NaN 0.260088
Qatar 2015 6.374529 11.693157 NaN 68.300003 NaN NaN NaN NaN NaN NaN -0.074040 0.823927 NaN 0.653175
Saudi Arabia 2018 6.356393 10.797972 0.867848 66.300003 0.854922 -0.209564 NaN 0.764405 0.288380 NaN NaN NaN NaN 0.472084
Singapore 2018 6.374564 NaN 0.902841 76.800003 0.916078 NaN 0.096563 0.787093 0.106871 0.892469 NaN NaN NaN 0.399788
Somalia 2016 4.667941 NaN 0.594417 50.000000 0.917323 NaN 0.440802 0.891423 0.193282 0.700682 -2.134841 -2.125518 NaN 0.491746
Somaliland region 2012 5.057314 NaN 0.786291 NaN 0.758219 NaN 0.333832 0.735189 0.152428 0.651242 NaN NaN NaN 0.533575
South Sudan 2017 2.816622 NaN 0.556823 51.000000 0.456011 NaN 0.761270 0.585602 0.517364 0.461551 -2.138769 -2.018497 0.463000 0.703008
Swaziland 2018 4.211565 8.946771 0.779270 NaN 0.709974 -0.179938 0.692341 0.824355 0.252339 0.689549 NaN NaN 0.523000 0.732568
Syria 2015 3.461913 NaN 0.463913 55.200001 0.448271 NaN 0.685237 0.369440 0.642589 NaN -2.448228 -1.548680 0.358000 0.525934
Taiwan Province of China 2018 6.467005 NaN 0.896459 NaN 0.741033 NaN 0.735971 0.848399 0.092696 0.311723 NaN NaN NaN 0.330178
Turkmenistan 2018 4.620602 9.749464 0.984489 62.200001 0.857774 0.237280 NaN 0.612210 0.189025 NaN NaN NaN NaN 0.271070
United Arab Emirates 2018 6.603744 11.127678 0.851041 67.099998 0.943664 0.036494 NaN 0.787243 0.302042 NaN NaN NaN NaN 0.721079
Yemen 2018 3.057514 NaN 0.789422 56.700001 0.552726 NaN 0.792587 0.461114 0.314870 0.308151 NaN NaN 0.357000 0.448597

That is quite a few missing values from the core attributes. Dropping these values would certainly degrade the quality of conclusions we are able to draw. Let's try another means of aggregating the data.

Mean across years

In [20]:
whr_mean = whr_ext.groupby('Country').mean()
whr_mean[whr_mean[core_col[1:]].isna().any(axis=1)]
Out[20]:
Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
Country
China 2012.000000 4.984993 9.284457 0.780020 67.969230 0.829893 -0.191763 NaN 0.818346 0.158571 NaN -1.081073 -0.229781 0.425 0.519979
Cuba 2006.000000 5.417869 9.676425 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602 0.513176 -0.706359 -0.543394 NaN NaN
North Cyprus 2014.666667 5.682304 NaN 0.829782 NaN 0.779380 NaN 0.700919 0.646726 0.347834 0.414453 NaN NaN NaN 0.332236
Oman 2011.000000 6.852982 10.648312 NaN 65.500000 0.916293 -0.008942 NaN NaN 0.295164 NaN -0.314025 0.295601 NaN 0.494790
Somalia 2015.000000 5.183286 NaN 0.601511 49.899999 0.919690 NaN 0.435836 0.875515 0.195745 0.701591 -2.213750 -2.112246 NaN 0.508235
Somaliland region 2010.500000 4.909162 NaN 0.820706 NaN 0.795702 NaN 0.418910 0.768032 0.117528 0.634423 NaN NaN NaN 0.515988
Swaziland 2014.500000 4.539328 8.922841 0.808210 NaN 0.658565 -0.126601 0.804796 0.822484 0.251696 0.521415 -0.897581 -0.541850 0.523 0.710633
Turkmenistan 2013.888889 5.614522 9.517484 0.930100 60.826667 0.760401 -0.001878 NaN 0.641446 0.199920 NaN -1.031149 -1.533028 NaN 0.263699

We've improved in terms of NA quantity, but now we have a meaningless Year column and data that isn't representative of the most up to date information available. We need a method that can aggregate the data while still using the latest available information. Luckily, we already have most of the information needed to do this.

Forward Fill Latest

In [21]:
# Propagate the last available entry forward
whrffl = whr_ext.groupby('Country').ffill().iloc[latest_idx].set_index('Country')
whrffl[whrffl[core_col[1:]].isna().any(axis=1)]
Out[21]:
Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
Country
China 2018 5.131434 9.694376 0.787605 69.300003 0.895378 -0.174899 NaN 0.855784 0.189640 NaN -0.877810 -0.064555 0.425 0.538206
Cuba 2006 5.417869 9.676425 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602 0.513176 -0.706359 -0.543394 NaN NaN
North Cyprus 2018 5.608056 NaN 0.837392 NaN 0.797066 NaN 0.613837 0.480453 0.261868 0.378324 NaN NaN NaN 0.200969
Oman 2011 6.852982 10.648312 NaN 65.500000 0.916293 -0.008942 NaN NaN 0.295164 NaN -0.314025 0.295601 NaN 0.494790
Somalia 2016 4.667941 NaN 0.594417 50.000000 0.917323 NaN 0.440802 0.891423 0.193282 0.700682 -2.134841 -2.125518 NaN 0.491746
Somaliland region 2012 5.057314 NaN 0.786291 NaN 0.758219 NaN 0.333832 0.735189 0.152428 0.651242 NaN NaN NaN 0.533575
Swaziland 2018 4.211565 8.946771 0.779270 NaN 0.709974 -0.179938 0.692341 0.824355 0.252339 0.689549 -0.897581 -0.541850 0.523 0.732568
Turkmenistan 2018 4.620602 9.749464 0.984489 62.200001 0.857774 0.237280 NaN 0.612210 0.189025 NaN -1.153554 -1.545981 NaN 0.271070

Now this is where we want to be. Using forward fill, we are able to preserve the latest available information while still reducing NA values. To a certain extent, we've corrupted the accuracy of Year, since it no long is exact measures from that year, but rather the last available data in each column up to that year. We'll keep it around for now, but drop it before we do any modeling.

The NaNs we are left with indicate that certain countries have no data available in any year of surveying.

In [22]:
# Save NA country index for later use for future dataframe build
naidx = whrffl[whrffl[core_col[1:]].isna().any(axis=1)].index
# Save underlying 'natural' indices
naiidx = whrffl.reset_index()[whrffl.reset_index()[core_col[1:]].isna().any(axis=1)].index
#nacnty = whr_lateff[whr_lateff[core_col].isna().any(axis=1)].Country.values
whrffl[core_col[1:]].isna().sum()
Out[22]:
Year                     0
Life_Ladder              0
Log_GDP                  3
Social_support           1
Life_Expectancy          3
Freedom                  0
Generosity               4
Corruption_Perception    4
dtype: int64

We can't call it quits yet, these remaining missing values need to be addressed before we can do any sort of clustering. There are quite a few easy methods we could use to fill in these values, we could just enter 0 and move on, but let's try to be smart about this.

Imputation

Somewhere in between filling values with a constant and engineering values by hand is variable imputation. If the stakes were higher, we'd want to try things like crafting missing GDP values with giniInc or values even just pull from another external data source, but let's keep it local and let some algorithms do the work for us.

  • looking at region relationships (China and HongKong China)
  • training models for each column of interest with a missing value
  • pull for external sources
  • use Freedom,Confidence_natGovt,Democratic_Quality,Delivery_Quality to try and derive Corruption_Perception

There is a case to be made that forward filling prior to imputation is not optimal since we are not allowing the algorithm to fully transform true missing values. However, we must always consider what the data represents when making any decisions.

In the worst case, a country has data entry in 2005 (the first survey year) and has NaN values for every year thereafter. In such a case, using forward fill would propagate the value up to 2018 (latest survey year) potentially meaning it is outdated and no longer relevant. The alternative is to allow the imputer to derive these missing values with the other non-missing values as input. The question we must ask is do we value real, but potential outdated data, over fake, but temporally responsive data.

In all likelihood, the change that a given country experiences over a 13 year period is smaller than what we could accurately impute. If the dataset spanned, say, a generation (25 years) then perhaps more weight should be given to the imputation option.

As with most decision made during an analysis, this could be thoroughly vetted and a true optimal solution discovered, but again, the stakes are low enough to just leave well enough alone. Additionally, as shown below, the difference between each method is largely insignificant.

In [23]:
# fit on non-aggregated extended data
imputer = IterativeImputer(estimator=BayesianRidge(),random_state=RS,max_iter=15).fit(whr_ext.iloc[:,1:])

We fit on the non-mutated data to maintain data purity and increase the number of samples the imputer has at its disposal.

FFill difference

In [24]:
# Impute on latest data
whrl_imp = pd.DataFrame(imputer.transform(whrl), columns=whrl.columns,index=whrl.index)
# Impute on latest forward filled data
whrffl_imp = pd.DataFrame(imputer.transform(whrffl), columns=whrffl.columns,index=whrffl.index)
In [25]:
# whrl.loc[naidx] # Take latest Before
whrffl.loc[naidx] # FFill-latest Before
Out[25]:
Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
Country
China 2018 5.131434 9.694376 0.787605 69.300003 0.895378 -0.174899 NaN 0.855784 0.189640 NaN -0.877810 -0.064555 0.425 0.538206
Cuba 2006 5.417869 9.676425 0.969595 68.440002 0.281458 NaN NaN 0.646712 0.276602 0.513176 -0.706359 -0.543394 NaN NaN
North Cyprus 2018 5.608056 NaN 0.837392 NaN 0.797066 NaN 0.613837 0.480453 0.261868 0.378324 NaN NaN NaN 0.200969
Oman 2011 6.852982 10.648312 NaN 65.500000 0.916293 -0.008942 NaN NaN 0.295164 NaN -0.314025 0.295601 NaN 0.494790
Somalia 2016 4.667941 NaN 0.594417 50.000000 0.917323 NaN 0.440802 0.891423 0.193282 0.700682 -2.134841 -2.125518 NaN 0.491746
Somaliland region 2012 5.057314 NaN 0.786291 NaN 0.758219 NaN 0.333832 0.735189 0.152428 0.651242 NaN NaN NaN 0.533575
Swaziland 2018 4.211565 8.946771 0.779270 NaN 0.709974 -0.179938 0.692341 0.824355 0.252339 0.689549 -0.897581 -0.541850 0.523 0.732568
Turkmenistan 2018 4.620602 9.749464 0.984489 62.200001 0.857774 0.237280 NaN 0.612210 0.189025 NaN -1.153554 -1.545981 NaN 0.271070
In [26]:
# whrl_imp.loc[naidx] # Take latest After
whrffl_imp.loc[naidx] #FFill-latest After
Out[26]:
Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
Country
China 2018.0 5.131434 9.694376 0.787605 69.300003 0.895378 -0.174899 0.598125 0.855784 0.189640 0.681986 -0.877810 -0.064555 0.425000 0.538206
Cuba 2006.0 5.417869 9.676425 0.969595 68.440002 0.281458 -0.113951 0.842834 0.646712 0.276602 0.513176 -0.706359 -0.543394 0.325290 0.271920
North Cyprus 2018.0 5.608056 10.096858 0.837392 70.319717 0.797066 -0.142033 0.613837 0.480453 0.261868 0.378324 -0.045114 0.384631 0.209252 0.200969
Oman 2011.0 6.852982 10.648312 0.905146 65.500000 0.916293 -0.008942 0.621081 0.795146 0.295164 0.607523 -0.314025 0.295601 0.427632 0.494790
Somalia 2016.0 4.667941 6.700755 0.594417 50.000000 0.917323 0.130703 0.440802 0.891423 0.193282 0.700682 -2.134841 -2.125518 0.481401 0.491746
Somaliland region 2012.0 5.057314 8.887497 0.786291 60.650726 0.758219 0.074972 0.333832 0.735189 0.152428 0.651242 -0.133132 0.440827 0.355993 0.533575
Swaziland 2018.0 4.211565 8.946771 0.779270 57.651606 0.709974 -0.179938 0.692341 0.824355 0.252339 0.689549 -0.897581 -0.541850 0.523000 0.732568
Turkmenistan 2018.0 4.620602 9.749464 0.984489 62.200001 0.857774 0.237280 0.896775 0.612210 0.189025 0.649490 -1.153554 -1.545981 0.292400 0.271070
In [27]:
whrffl_imp.loc[naidx] - whrl_imp.loc[naidx] # difference
Out[27]:
Year Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
Country
China 0.0 0.0 0.0 0.000000e+00 0.000000 0.0 0.0 -0.045064 0.000000e+00 0.0 0.083032 -1.031326 -0.434279 0.000000 0.0
Cuba 0.0 0.0 0.0 0.000000e+00 0.000000 0.0 0.0 0.000000 0.000000e+00 0.0 0.000000 0.000000 0.000000 0.000000 0.0
North Cyprus 0.0 0.0 0.0 0.000000e+00 0.000000 0.0 0.0 0.000000 0.000000e+00 0.0 0.000000 0.000000 0.000000 0.000000 0.0
Oman 0.0 0.0 0.0 4.440892e-16 0.000000 0.0 0.0 0.000000 1.110223e-16 0.0 0.000000 0.000000 0.000000 0.000000 0.0
Somalia 0.0 0.0 0.0 0.000000e+00 0.000000 0.0 0.0 0.000000 0.000000e+00 0.0 0.000000 0.000000 0.000000 0.000000 0.0
Somaliland region 0.0 0.0 0.0 0.000000e+00 0.000000 0.0 0.0 0.000000 0.000000e+00 0.0 0.000000 0.000000 0.000000 0.000000 0.0
Swaziland 0.0 0.0 0.0 0.000000e+00 -0.099673 0.0 0.0 0.000000 0.000000e+00 0.0 0.000000 -0.503161 -0.216707 0.000000 0.0
Turkmenistan 0.0 0.0 0.0 0.000000e+00 0.000000 0.0 0.0 0.169159 0.000000e+00 0.0 -0.001948 -1.049260 -1.661919 0.045808 0.0

With a few exceptions, there is no difference between the filled data fields. Looking at core columns, Corruption_Perception is the only attribute with a non-negligible difference for Turkmenistan.

Models

  • K-means
  • Agglomerative Clustering
  • Affinity Propagation
  • Gaussian Mixture
  • DBSCAN
  • HDBSCAN

Helper functions

In [28]:
def plot_cluster(x, y, data, title='',centers=None, **kwargs):
    """ plot data from a clustering algorithm using dataframe column names
    
    Args:
        x, y : str 
            names of variables in ``data``
        data : pandas.Dataframe 
            desired plotting data
        title : str, optional 
            title of plot
        centers : array-like or pd.DataFrame, optional
            if provided, plots the given centers of the determined groups
        **kwargs : keyword arguments, optional
            arguments to pass to plt.scatter
    
    Returns:        
        ax : matplotlib Axes
            the Axes object with the plot drawn onto it.
    """
    
    fig, ax = plt.subplots(figsize=(8,5))
    
    labels = data[kwargs.get('c')]
    nlabels = labels.nunique()
    bounds = np.arange(labels.min(),nlabels+1)
    
    # 20 distinct colors, more visible and differentible than tab20 
    # https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
    cset = ['#3cb44b', '#ffe119', '#4363d8','#e6194b', 
        '#f58231','#911eb4', '#46f0f0', '#f032e6', '#bcf60c', 
        '#fabebe', '#008080', '#e6beff','#800000', '#aaffc3'] # take 14

    cm = (mpl.colors.ListedColormap(cset, N=nlabels) if labels.min() == 0 
          else mpl.colors.ListedColormap(['#000000']+cset, N=nlabels+1))    
    
    sct = ax.scatter(x,y,data=data,cmap=cm,edgecolors='face',**kwargs)
    
    if centers is not None:
        if isinstance(centers,np.ndarray):
            for g in centers[:,[data.columns.get_loc(x),data.columns.get_loc(y)]]:
                ax.plot(*g,'*r',markersize=12, alpha=0.6)
                
        if isinstance(centers,pd.DataFrame):
            ax.scatter(x,y,data=centers,marker='D',c=centers.index.values,cmap=cm,
                       s=np.exp(centers['Life_Ladder'])*75, # scale ♦ size by Life_Ladder score 
                       #s=(labels.value_counts().sort_index()/len(labels))*np.sqrt(nlabels)*200, #scale ♦ sizes by n
                       edgecolors='black',linewidths=1,alpha=0.7)
            
        ax.set_title('(color=group, ♦size=Happiness, ♦loc = group center)')
            
    ax.set_xlabel(x)
    ax.set_ylabel(y)
    
    fig.suptitle(title, fontsize=14)
    ax2 = fig.add_axes([0.95, 0.1, 0.03, 0.8]) # 'Magic' numbers for colorbar spacing
    norm = mpl.colors.BoundaryNorm(bounds,cm.N)
    cb = mpl.colorbar.ColorbarBase(ax2, cmap=cm, norm=norm,ticks=bounds+0.5, boundaries=bounds)
    cb.set_ticklabels(bounds)

    plt.show()
    return ax
In [29]:
def plot_boxolin(x,y,data):
    """ Plot a box plot and a violin plot.
    
    Args:
        x,y : str
            columns in `data` to be plotted. x is the 'groupby' attribute.
        data : pandas.DataFrame
            DataFrame containing `x` and `y` columns
    
    Returns:        
        axes : matplotlib Axes
            the Axes object with the plot drawn onto it.
    """
    fig,axes = plt.subplots(1,2,figsize=(10,5),sharey=True)
    whr_grps.boxplot(column=y,by=[x],ax=axes[0]) # could use sns.boxplot, but why not try something different
    sns.violinplot(x,y,data = whr_grps,scale='area',ax=axes[1])
    axes[0].set_title(None)
    axes[0].set_ylabel(axes[1].get_ylabel())
    axes[1].set_ylabel(None)
    plt.show()
    return axes
In [30]:
def cluster(model, X, **kwargs):
    """ Run a clustering model and return predictions.
    
    Args:
        model : {sklearn.cluster, sklearn.mixture, or hdbscan}
            Model to fit and predict
        X : pandas.DataFrame
            Data used to fit `model`
        **kwargs : `model`.fit_predict() args, optional
            Keyword arguments to be passed into `model`.fit_predict()
    Returns:
        (labels,centers) : tuple(array, pandas.DataFrame)
            A tuple containing cluster labels and a DataFrame of cluster centers formated with X columns
    """
    clust_labels = model.fit_predict(X,**kwargs)
    centers = X.assign(**{model.__class__.__name__ : clust_labels} # assign a temp column to X with model name
                      ).groupby(model.__class__.__name__,sort=True).mean() # group on temp, gather mean of labels
    
    return (clust_labels, centers)
In [31]:
def score_clusters(X,labels):
    """ Calculate silhouette, calinski-harabasz, and davies-bouldin scores
    
    Args:
        X : array-like, shape (``n_samples``, ``n_features``)
            List of ``n_features``-dimensional data points. Each row corresponds
            to a single data point.

        labels : array-like, shape (``n_samples``,)
            Predicted labels for each sample.
    Returns:
        scores : dict
            Dictionary containing the three metric scores
    """
    scores = {'silhouette':silhouette_score(X,labels),
              'calinski_harabasz':calinski_harabasz_score(X,labels),
              'davies_bouldin':davies_bouldin_score(X,labels)
             }
    return scores

Model data

In [32]:
ss = StandardScaler()
whrX = pd.DataFrame(ss.fit_transform(whrffl_imp.drop(columns='Year')), columns=whrffl_imp.drop(columns='Year').columns, index=whrffl_imp.index)
whrX.head()
Out[32]:
Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
Country
Afghanistan -2.499994 -1.435267 -2.398592 -1.636104 -3.047681 -0.458046 1.095187 -2.485333 1.123785 -0.792207 -1.930070 -1.417846 -1.260899 -1.406094
Albania -0.412826 0.132356 -0.972327 0.670231 0.356675 0.141452 0.942003 0.052647 0.236524 -0.439633 0.535487 -0.077580 -1.003994 -0.090691
Algeria -0.377876 0.251331 -0.040310 0.269130 -1.462538 -1.039297 0.186609 -1.020356 -0.032534 -0.890863 -0.848822 -0.781312 -1.333314 1.591967
Angola -1.505664 -0.416054 -0.397013 -1.349603 -3.040082 -0.937353 0.592058 -1.130291 0.741230 0.243888 -0.636459 -1.141828 1.053499 -0.213697
Argentina 0.299485 0.457333 0.779928 0.684557 0.520461 -1.268567 0.705990 0.991837 0.252072 -1.307633 0.593400 -0.106015 0.901679 -0.494615

Most clustering algorithms are sensitive to the scale of data, standard scaling is advised.

In [33]:
whr_grps = whrX.copy() # clone which we many append cluster groups to.

K-means

K-means clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares

https://scikit-learn.org/stable/modules/clustering.html#k-means

In [219]:
distortions = []
for n in range(2,10):
    model = KMeans(n_clusters=n,random_state=RS).fit(whrX)
    distortions.append(model.inertia_)
    labs = model.labels_
    print(f'n_clusters: {n}\n',score_clusters(whrX,labs))
    
fig,ax = plt.subplots(figsize=(6, 4))
ax.plot(range(2,10),distortions)
ax.set(**{'title':'Elbow curve','ylabel':'inertia','xlabel':'n_clusters'})
plt.show()
n_clusters: 2
 {'silhouette': 0.25062672721911344, 'calinski_harabasz': 68.24034195153395, 'davies_bouldin': 1.4735832588278601}
n_clusters: 3
 {'silhouette': 0.21842473200144147, 'calinski_harabasz': 56.332406869431544, 'davies_bouldin': 1.4582987467383164}
n_clusters: 4
 {'silhouette': 0.22315489072181924, 'calinski_harabasz': 51.81545435941449, 'davies_bouldin': 1.4672671574548002}
n_clusters: 5
 {'silhouette': 0.20704926195914414, 'calinski_harabasz': 46.643072783557756, 'davies_bouldin': 1.5650984113109518}
n_clusters: 6
 {'silhouette': 0.21286049487400313, 'calinski_harabasz': 42.543303221115266, 'davies_bouldin': 1.489634800160517}
n_clusters: 7
 {'silhouette': 0.18704542203305513, 'calinski_harabasz': 39.19929042406028, 'davies_bouldin': 1.5316037117968835}
n_clusters: 8
 {'silhouette': 0.1798248741675178, 'calinski_harabasz': 36.34018018648525, 'davies_bouldin': 1.6167793535472728}
n_clusters: 9
 {'silhouette': 0.18559846200887306, 'calinski_harabasz': 33.98962931508475, 'davies_bouldin': 1.5670829735303966}
In [35]:
km = KMeans(n_clusters=3,random_state=RS)
clabels_km, cent_km = cluster(km, whrX)
whr_grps['KMeans'] = clabels_km
cent_km
Out[35]:
Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
KMeans
0 0.217656 0.320396 0.434670 0.369709 0.055862 -0.362711 0.395972 0.082644 -0.247264 -0.364254 0.156262 0.051710 0.069050 -0.303925
1 1.392846 1.305680 0.945599 1.171077 0.884960 0.567926 -1.666063 0.737726 -0.851829 0.277077 1.332486 1.721006 -0.876892 -0.426572
2 -0.969489 -1.088649 -1.101580 -1.103676 -0.488587 0.302405 0.145345 -0.463053 0.769330 0.436994 -0.847172 -0.862191 0.291874 0.663599
In [36]:
plot_cluster('Log_GDP','Corruption_Perception', whr_grps, centers=cent_km, title='K-Means Cluster', c='KMeans');

The initial K-Means cluster plot seems to indicate that the populations with lower GDP per captia tend to believe there is more corruption in business/government and rate their lives lower on the Cantril Life Ladder. However, the relationship is not exact, as seen in cluster[0] happiness and GDP can increase to a certain extent while also increasing perceptions of corruption.

In [37]:
SilhouetteVisualizer(km).fit(whrX).poof()# https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html
In [38]:
InterclusterDistance(km,random_state=RS).fit(whrX).poof()#https://www.scikit-yb.org/en/latest/api/cluster/icdm.html
In [39]:
plot_boxolin('KMeans','Log_GDP', data = whr_grps);

Agglomerative Clustering

Agglomerative clustering performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The hierarchy of clusters is represented as a tree or dendrogram where the root of the tree is the unique cluster that gathers all the samples, and the leaves are the clusters with only one sample.

https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

In [40]:
ac = AgglomerativeClustering(n_clusters=3, affinity = 'euclidean', linkage = 'ward')
clabels_ac,cent_ac = cluster(ac, whrX)
whr_grps['AgglomerativeClustering'] = clabels_ac
cent_ac
Out[40]:
Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
AgglomerativeClustering
0 0.402045 0.454050 0.533844 0.480147 0.195623 -0.186684 0.266133 0.211264 -0.300157 -0.332329 0.214791 0.159060 -0.044172 -0.375563
1 -0.937664 -0.963593 -1.003734 -1.001937 -0.525592 0.081686 0.148570 -0.472275 0.663155 0.332682 -0.675974 -0.688059 0.324815 0.663159
2 1.517015 1.319089 1.016879 1.324479 1.024796 0.757579 -2.175405 0.712655 -0.980121 0.574577 1.533921 1.910826 -1.083430 -0.537756
In [41]:
plot_cluster('Log_GDP','Corruption_Perception',whr_grps, centers=cent_ac, title='Agglomerative Cluster',c='AgglomerativeClustering');
In [42]:
score_clusters(whrX,clabels_ac)
Out[42]:
{'silhouette': 0.19685197257469053,
 'calinski_harabasz': 48.287898116425936,
 'davies_bouldin': 1.4184695148698854}
In [43]:
plot_boxolin('AgglomerativeClustering','Log_GDP',whr_grps);

Affinity Propagation

In [44]:
ap = AffinityPropagation(damping = 0.5, max_iter = 250, affinity = 'euclidean')
clabels_ap, cent_ap = cluster(ap,whrX)
whr_grps['AffinityPropagation'] = clabels_ap
cent_ap
Out[44]:
Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
AffinityPropagation
0 -0.500617 -0.567265 -0.493516 -0.063212 0.293655 -0.052284 0.059636 -0.889806 0.309342 1.238099 -0.803796 -0.438167 -0.705119 -0.571035
1 -0.747142 -1.272132 -1.024236 -1.285140 -0.239862 0.221882 0.249245 0.015651 0.581525 0.484327 -0.424674 -0.730613 0.567298 0.978746
2 -1.279836 -1.313996 -1.903009 -1.561613 -1.223334 0.629845 0.294700 -1.372540 2.123510 -0.048624 -1.307729 -1.353406 0.344303 1.161366
3 0.222655 0.477397 0.609701 0.623043 -0.670309 -0.611355 0.745696 -0.853739 -0.286430 -1.206889 0.423385 0.222335 -0.646431 -0.864157
4 0.466757 0.073115 0.479527 0.338572 0.494534 -0.339993 0.483445 1.031721 -0.023339 -0.865418 0.270854 -0.206464 1.170024 0.098470
5 1.545599 1.302150 1.021691 1.311734 1.017160 0.768865 -1.940286 0.743073 -0.972178 0.444538 1.533331 1.879546 -1.083620 -0.663310
6 0.165519 -0.073751 0.656839 0.041118 0.987783 1.986966 0.606842 0.647387 -0.879889 0.636642 -0.532076 -0.595852 -0.587825 -0.545105
7 -1.006123 0.039446 0.132144 -0.950234 -0.139271 -1.012953 0.390493 0.606351 -0.366652 0.578280 0.364302 0.126509 2.534934 1.858689
8 0.723209 0.964495 0.369699 0.405695 0.909633 -0.012255 -0.688852 0.755897 -0.082930 0.769516 -0.092241 0.444170 0.454449 0.716381
9 0.730402 0.902029 0.792365 0.911248 0.224374 -0.714315 0.244477 0.061031 -0.770132 -0.517626 0.954993 0.963504 -0.925618 -1.038152
10 -1.408119 -0.783961 -0.254358 -0.947068 -2.798362 -0.613527 0.499679 -1.359079 0.285656 -0.599311 -1.384850 -1.129199 -0.317092 -0.555974
11 -0.898008 -0.861645 -0.733028 -0.804342 0.618704 1.440850 -0.997173 0.791620 -0.404495 1.501329 -0.515023 -0.413840 0.259245 0.480501
12 -0.567129 0.163784 -0.343996 0.153978 -0.790771 -0.888230 0.515507 -0.985809 0.473750 -0.640716 -0.706652 -0.416651 -0.116206 -0.114559
In [45]:
plot_cluster('Log_GDP','Freedom',whr_grps, centers=cent_ap,title='Affinity Propagation',c='AffinityPropagation');
In [46]:
score_clusters(whrX,clabels_ap)
Out[46]:
{'silhouette': 0.1807914178898567,
 'calinski_harabasz': 27.84691838858024,
 'davies_bouldin': 1.5462700501887794}
In [47]:
plot_boxolin('AffinityPropagation','Log_GDP',whr_grps);

Violin plot loses a fair bit of its aesthetics when we ramp up the cluster count

Gaussian Mixture

A probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

https://scikit-learn.org/stable/modules/mixture.html#mixture

In [48]:
gm = GaussianMixture(n_components=3,init_params='kmeans', random_state=RS)
clabels_gm,cent_gm = cluster(gm,whrX)
whr_grps['GaussianMixture'] = clabels_gm
cent_gm
Out[48]:
Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
GaussianMixture
0 0.121003 0.290848 0.329489 0.305264 0.128359 -0.278011 0.404304 0.102988 -0.194147 -0.334905 0.082581 -0.020472 0.155404 -0.172694
1 1.366164 1.290100 0.946226 1.152916 0.845298 0.504250 -1.563515 0.725152 -0.836110 0.228849 1.328231 1.680884 -0.898514 -0.469166
2 -0.848250 -1.078975 -0.974230 -1.035616 -0.609042 0.194823 0.116398 -0.511258 0.708173 0.416979 -0.769506 -0.777091 0.188000 0.497728
In [49]:
plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_gm, title='Gaussian Mixture',c='GaussianMixture');
In [50]:
score_clusters(whrX,clabels_gm)
Out[50]:
{'silhouette': 0.17602676065311476,
 'calinski_harabasz': 46.98962716751056,
 'davies_bouldin': 1.6456170460570363}
In [51]:
plot_boxolin('GaussianMixture','Log_GDP',whr_grps);

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them.

In [52]:
db = DBSCAN(eps=0.3)
clabels_db,cent_db = cluster(db,whrX)
whr_grps['DBSCAN'] = clabels_db
cent_db
Out[52]:
Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
DBSCAN
-1 3.095167e-16 -5.800074e-16 -2.508095e-16 -8.504981e-16 -5.887546e-17 -4.642751e-17 2.839479e-16 -4.750409e-16 3.297026e-17 5.070018e-16 1.480297e-17 -1.076580e-17 2.893308e-16 6.331636e-16
In [53]:
plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_db,title='DBSCAN',c='DBSCAN');

Running with with a scaled version of the data failed to categorize any of the points at all. It considered every point to be too noisy group.

The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.

The usage of centroid distance limits the distance metric to Euclidean space. A good value reported by this method does not imply the best information retrieval.

In [55]:
plot_boxolin('DBSCAN','Log_GDP',whr_grps);

Given this dataset has relatively low density, this model had substandard performance w.r.t the other contenders.

HDBSCAN

Hierarchical Density-Based Spatial Clustering of Applications with Noise. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

In [56]:
hd = hdbscan.HDBSCAN()
clabels_hd, cent_hd = cluster(hd,whrX)
whr_grps['HDBSCAN'] = clabels_hd
cent_hd
Out[56]:
Life_Ladder Log_GDP Social_support Life_Expectancy Freedom Generosity Corruption_Perception Positive_affect Negative_affect Confidence_natGovt Democratic_Quality Delivery_Quality giniIncWBavg giniIncGallup
HDBSCAN
-1 -0.325745 -0.191294 -0.233152 -0.234759 -0.124759 0.082251 0.087954 -0.227406 0.115121 0.213518 -0.318442 -0.256041 0.006019 0.152915
0 -0.594262 -1.306554 -1.067221 -1.435553 -0.471403 0.337794 0.436063 0.145287 0.848195 0.262831 -0.547068 -0.673820 0.313713 0.926564
1 0.548691 0.145915 0.517590 0.398214 0.540105 -0.527929 0.466860 1.033615 -0.025376 -0.973451 0.332459 -0.177761 1.218192 0.047480
2 1.012796 0.955901 0.821159 0.996843 0.254934 -0.072167 -0.715967 0.167328 -0.629233 -0.261801 1.103414 1.216625 -0.836442 -0.831162
In [57]:
plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_hd,title='HDBSCAN',c='HDBSCAN');
In [58]:
score_clusters(whrX,clabels_hd)
Out[58]:
{'silhouette': -0.06289074714996327,
 'calinski_harabasz': 14.315184058617332,
 'davies_bouldin': 1.9554553185648444}
In [59]:
plot_boxolin('HDBSCAN','Log_GDP',whr_grps);

Conclusions

This notebook explored the World Happiness Dataset using a total of 6 models:
K-means, Agglomerative Clustering, Affinity Propagation, Gaussian Mixture, DBSCAN, and HDBSCAN.

The was a somewhat significant degree of variation between the examined models, but those that were not prescribed a certain number of clusters arrived at a 9 or 10 groups. With this many groupings, however, it became much more difficult to see exactly how a model was making clustering decisions.

For the models which we assigned a group count of 3, K-means, Agglomerative Clustering, and Gaussian Mixture, two diagonal or vertical lines could nearly be drawn between decision boundaries by way of GDP considerations.

Only one version of one model failed to perform at all, that was DBSCAN with scaled data, the rest found some form of suitable clustering. However, the Boxplots showed that when models were given free reign over the number of clusters, they tend to have one cluster serve to explain a large range of values and another to explain an extremely tightly grouped set with many outliers.

Future work

One thought that was left untested was looking to see if the previous model's metrics in anyway influenced successive models. Since a new column was appended to the dataset and not removed, there is indeed a possibility of this happening. A good deal more EDA could be done on this dataset by looking at relationships between variables in a market basic analysis fashion. As always, there are many different hyperparamaters that could still be experimented with, as well as other clustering models like Spectral, Ward, and MeanShift that might yield interesting results. Additionally, and perhaps most poignantly, only Log_GDP and Corruption_Perception were explored, the rest were left untouched, thus leaving a significant portion of the data not truly explored to its fullest.