Clustering on the World Happiness Report 2019
Overview¶
This analysis attempts to group together countries based on nothing more than the professed happiness of its citizens. To do so, we will be using data from the 2019 World Happiness Report to build several clustering models. The primary models discussed in this analysis are:
- K-means
- Agglomerative Clustering
- Affinity Propagation
- Gaussian Mixture
Additionally, these models will be briefly demonstrated:
- DBSCAN
- HDBSCAN
Each model will be visualized in 3 different forms:
- A scatter plot using unaltered data
- A scatter plot using scaled data
- A Boxplot of the unaltered data
Glossary:
- GWP - Gallup World Poll
- WVS - World Value Surveys
Imports¶
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, DBSCAN
from sklearn.mixture import GaussianMixture
import hdbscan
# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics.cluster import silhouette_score, silhouette_samples, calinski_harabasz_score, davies_bouldin_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer, InterclusterDistance
import geopandas
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
RS = 404 # Random state/seed
pd.set_option("display.max_columns",30) # Increase columns shown
Data¶
The 2019 World Happiness Report dataset may be obtained from
https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls
The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. This year’s World Happiness Report focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.
It has 2 descriptor/ID variables (Country,Year), one response (Life Ladder), six proposed determinants of the response, and several additional variable that were either calculated or gather from external sources.
Country
- Name of the country.Year
- Year of data collection.Life Ladder
- (survey,0-10) Cantril Life Ladder/Happiness score/subjective well-being. The national average response to the following question:- "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?"
Six Hypothesized Underlying Determinants:
Log_GDP
- (calculated-normalized,external), Log GDP per capita in purchasing power parity. Constant 2011 international dollar prices from World Development Indicators (November 14, 2018)Life_Expectancy
- (partial-interpolated,external), Healthy life expectancies at birth are based on the data extracted from the World Health Organization's Global Health Observatory data repositorySocial_support
- (survey,binary) National average response to:- "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?"
Freedom
- (survey,binary), National average response to:- "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"
Generosity
- (survey,binary,calculated-residual) is the residual of regressing national average of response to:- "Have you donated money to a charity in the past month?" on GDP per capita.
Corruption_Perception
- (survey,2x-binary), National average response to two questions:- "Is corruption widespread throughout the government or not"
- "Is corruption widespread within businesses or not?"
Additional inclusions:
Positive affect
- (survey,3x-binary), Average of three GWP positive affect measures (waves 3-7): happiness, laugh and enjoyment in Gallup World Poll(waves 3-7). Responses to the following three questions, respectively:- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Happiness?",
- "Did you smile or laugh a lot yesterday?"
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Enjoyment?"
Negative affect
- (survey,3x-binary), Average of three GWP negative affect measures: worry, sadness and anger in Responses to the following three questions, respectively:- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Worry?",
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Sadness?"
- "Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Anger?"
"giniLadder" - (multi,calculated-metastat), Inequality/distribution statistics of happiness scores by WP5-year from the GWP release. WP5 is GWP's coding of countries, including some sub-country territories.
sdLadder
Standard deviation of ladder by country-yearcvLadder
Standard deviation/Mean of ladder by country-year
giniIncGallup
- (calculated-normalized,external), Household Income International Dollars. Income variables are created by converting local currency to International Dollars (ID) using purchasing power parity (PPP) ratios.giniIncWB
- (partial,external), Unbalanced panel of yearly index. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departmentsginiIncWBavg
- (calculated,external), the average ofginiIncWB
in the period 2000-2016. Most countries are missing some gini index period data.Confidence_natGovt
- (survey,binary,external), citizens' confidence in key institutions (WP139) Response to:- "Do you have confidence in each of the following, or not? How about the national government?"
"WGI indicators of governance quality" - (survey, amalgam, calculated, external), based on over 30 individual data sources produced by a variety of survey institutes, think tanks, non-governmental organizations, international organizations, and private sector firms, enterprise, citizen and expert survey respondents.
Democratic Quality
- average "Voice and Accountability" and "Political Stability and Absence of Violence"Delivery Quality
- average "Government Effectiveness", "Regulatory Quality", "Rule of Law", "Control of Corruption"
Expanded data:
trust_Gallup
andtrust_WVS*
- (survey,binary), Percentage of respondents with positive-trust response to:- "Generally speaking, would you say that most people can be trusted or that you [have,need] to be [very] careful in dealing with people?"
Primary definitions:
https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix1.pdf
Definitions (Democratic Quality, Delivery Quality, Confidence_natGovt):
https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix2.pdf
Relevant files:
data/
- contains happiness_2016.csv
the externally obtained data for this analysis.
whr = pd.read_excel('data/WHR2019.xls')
#whr = pd.read_csv('data/happiness_2016.csv')
whr.columns
# Shortened and cleaned names, most are derived from WHR2019 paper
full_colnames = [
'Country', 'Year', 'Life_Ladder',
'Log_GDP','Social_support', 'Life_Expectancy', 'Freedom', 'Generosity','Corruption_Perception',
'Positive_affect', 'Negative_affect',
'Confidence_natGovt', 'Democratic_Quality','Delivery_Quality',
'sdLadder','cvLadder',
'giniIncWB','giniIncWBavg','giniIncGallup',
'trust_Gallup',
'trust_WVS81_84','trust_WVS89_93','trust_WVS94_98','trust_WVS99_2004','trust_WVS2005_09','trust_WVS2010_14'
]
core_col = full_colnames[:9]
ext_col = full_colnames[:14] + full_colnames[17:19]
whr.columns = full_colnames
# Shorten and Clean names for dot access
whr.columns = whr.columns.str.replace('Most people can be trusted','trust_in_people')
whr.columns = whr.columns.str.replace(' ','_')
whr.columns = whr.columns.str.replace('[(),]','') # Strip parens and commas
whr.columns
whr.iloc[np.r_[0:3,-3:0]] # HeadTail
Exploratory Data Analysis¶
The dataset contains NA values, however there are a few examples where a field's contribution to happiness is 0.0. This is likely a side effect of having a modeled rather than purely gathered dataset. One possibility is that if a country ranked the lowest for that particular characteristic it was simply zeroed out.
Happiness_Score is the summation of Economy_GDP_per_Capita, Family, Health_Life_Expectancy, Freedom, Trust_Government_Corruption, Generosity, and Dystopia_Residual within a margin of error between the confidence intervals.
Other than Country, Region, and Happiness Rank, all of the variables are continuous floating point.
whr.info()
whr.isna().sum()
whr[whr[core_col].isna().any(axis=1)].shape # 188 entries have at least 1 missing value from the core attributes
We can see a substantial portion of the values are missing, particularly in WVS reports of people's perceived trust in others.
whr.describe()
Most of the data seems to be of a similar magnitude, with Ladder, Log GDP, and life expectancy being the largest exceptions
fig, ax = plt.subplots(figsize=(10,8))
corrmat = whr.drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0);
fig, ax = plt.subplots(figsize=(6,4))
corrmat = whr[core_col].drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0,annot=True);
fig, ax = plt.subplots(figsize=(8,6))
corrmat = whr[ext_col].drop(columns='Year').corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0);
The upper left-hand square was likely by design, if happiness (life_ladder) is the statistic we are looking to understand, the attributes to immediately follow are likely what most consider large contributing factors.
Intuitive correlations:
- correlation between all trust statistics
- Democratic Quality <+> Delivery Quality
- SDMean Ladder <-> Ladder
- GINI index <+> GINI index mean
Potentially interesting correlations:
- trust WVS 81-84 <+> Democratic+Delivery Quality
- trust WVS 81-84 <+> log GDP
- trust WVS 81-84 <-> Perceptions of corruption
While the trust 81-84 could be an interesting variable to investigate, it is also worth remembering that this attribute has the most null values out of all, so these should be taken with a grain of salt.
gov_col = ['Freedom', 'Corruption_Perception', 'Confidence_natGovt','Democratic_Quality', 'Delivery_Quality']
fig, ax = plt.subplots(figsize=(6,4))
corrmat = whr[gov_col].corr() # Omit year
sns.heatmap(corrmat,-1,1,ax=ax,center=0,annot=True);
whr_ext = whr[ext_col].copy() # Using an extended, but not quite full, version of dataset
whr_ext.groupby('Country').Year.count().describe()
whr_ext.groupby('Country').Year.count().hist(bins=13);
Almost half of all countries in the dataset have an entry for all years that the survey has been conducted. Additionally, 75% of countries have at least 9 years worth of data entries.
#From: 00BF11 -> BF2200
rygscale = [
[0,'rgb(191, 34, 0)'],
[0.2,'rgb(191, 75, 0)'],
[0.3,'rgb(191, 116, 0)'],
[0.4,'rgb(191, 156, 0)'],
[0.5,'rgb(184, 191, 0)'],
[0.6,'rgb(144, 191, 0)'],
[0.7,'rgb(103, 191, 0)'],
#[0.8,'rgb(63, 191, 0)'],
#[0.9,'rgb(22, 191, 0)'],
[0.8,'rgb(0, 191, 17)'],#new
[0.9,'rgb(30, 223, 29)'],#new
[1,'rgb(61, 255, 41)']]#new
# https://convertingcolors.com/rgb-color-191_34_0.html
# http://www.perbang.dk/rgbgradient/
whr_recent = whr_ext.iloc[whr_ext.groupby('Country').Year.idxmax()]
data = [
go.Choropleth(
locations = whr_recent.Country,#whrffl_imp.index,
locationmode = 'country names',
z = whr_recent['Life_Ladder'],
text = ['{} ({})'.format(c,y) for c, y in zip(whr_recent.Country, whr_recent.Year)],
hoverinfo='z+text',
colorscale = rygscale,
marker = go.choropleth.Marker(line = go.choropleth.marker.Line(color = 'rgb(255,255,255)',width = 0.15)),
colorbar = go.choropleth.ColorBar(title = 'Happiness Score')
)
]
layout = go.Layout(
title = go.layout.Title(text = 'World Happiness 2019<br>(Cantrill Life Ladder)'),
geo = go.layout.Geo(
showcoastlines = True,
landcolor = 'lightgray',
showland = True,
projection = go.layout.geo.Projection(type = 'equirectangular')#'natural earth')
),
width=960,
annotations = [
go.layout.Annotation(x = 0.96,y = 0.01,xref = 'paper',yref = 'paper',
text = 'Source: <a href="https://worldhappiness.report/ed/2019/#read">World Happiness Report 2019</a>',
showarrow = False),
go.layout.Annotation(x = 0, y = -0.15, xref = 'paper',yref = 'paper',align='left',font={'size':9},
text = '''
"Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top.<br>
The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you.<br>
On which step of the ladder would you say you personally feel you stand at this time?"
''',
showarrow = False)
]
)
fig = go.Figure(data = data, layout = layout)
iplot(fig, filename = 'd3-world-map')
# format: whr.Country : world.name
whr_world = {
'Bahrain' : None,
'Bosnia and Herzegovina' : 'Bosnia and Herz.',
'Central African Republic' : 'Central African Rep.',
'Congo (Brazzaville)' : 'Congo',
'Congo (Kinshasa)' : 'Dem. Rep. Congo',
'Czech Republic' : 'Czech Rep.',
'Dominican Republic' : 'Dominican Rep.',
'Hong Kong S.A.R. of China' : None,
'Ivory Coast' : "Côte d'Ivoire",
'Laos' : 'Lao PDR',
'Malta' : None,
'Mauritius' : None,
'North Cyprus' : 'Cyprus',
'Palestinian Territories' : 'Palestine',
'Singapore' : None,
'Somaliland region' : 'Somalia',
'South Korea' : 'Korea',
'South Sudan' : 'S. Sudan',
'Taiwan Province of China' : 'Taiwan'}
Aggregating¶
Ideally, we would simply use the most up to data that we have for each country, unfortunately, that causes some problems.
Take latest¶
whr_ext.iloc[whr_ext.groupby('Country').Year.idxmax()]
#whrl = whr_ext.iloc[latest_idx].set_index('Country')
# Get latest year indices
latest_idx = whr_ext.groupby('Country').Year.idxmax()
whrl = whr_ext.iloc[latest_idx].set_index('Country')
# Check NAs in the core data set
whrl[whrl[core_col[1:]].isna().any(axis=1)]
That is quite a few missing values from the core attributes. Dropping these values would certainly degrade the quality of conclusions we are able to draw. Let's try another means of aggregating the data.
Mean across years¶
whr_mean = whr_ext.groupby('Country').mean()
whr_mean[whr_mean[core_col[1:]].isna().any(axis=1)]
We've improved in terms of NA quantity, but now we have a meaningless Year column and data that isn't representative of the most up to date information available. We need a method that can aggregate the data while still using the latest available information. Luckily, we already have most of the information needed to do this.
Forward Fill Latest¶
# Propagate the last available entry forward
whrffl = whr_ext.groupby('Country').ffill().iloc[latest_idx].set_index('Country')
whrffl[whrffl[core_col[1:]].isna().any(axis=1)]
Now this is where we want to be. Using forward fill, we are able to preserve the latest available information while still reducing NA values. To a certain extent, we've corrupted the accuracy of Year
, since it no long is exact measures from that year, but rather the last available data in each column up to that year. We'll keep it around for now, but drop it before we do any modeling.
The NaNs we are left with indicate that certain countries have no data available in any year of surveying.
# Save NA country index for later use for future dataframe build
naidx = whrffl[whrffl[core_col[1:]].isna().any(axis=1)].index
# Save underlying 'natural' indices
naiidx = whrffl.reset_index()[whrffl.reset_index()[core_col[1:]].isna().any(axis=1)].index
#nacnty = whr_lateff[whr_lateff[core_col].isna().any(axis=1)].Country.values
whrffl[core_col[1:]].isna().sum()
We can't call it quits yet, these remaining missing values need to be addressed before we can do any sort of clustering. There are quite a few easy methods we could use to fill in these values, we could just enter 0 and move on, but let's try to be smart about this.
Imputation¶
Somewhere in between filling values with a constant and engineering values by hand is variable imputation. If the stakes were higher, we'd want to try things like crafting missing GDP values with giniInc or values even just pull from another external data source, but let's keep it local and let some algorithms do the work for us.
- looking at region relationships (China and HongKong China)
- training models for each column of interest with a missing value
- pull for external sources
- use Freedom,Confidence_natGovt,Democratic_Quality,Delivery_Quality to try and derive
Corruption_Perception
There is a case to be made that forward filling prior to imputation is not optimal since we are not allowing the algorithm to fully transform true missing values. However, we must always consider what the data represents when making any decisions.
In the worst case, a country has data entry in 2005 (the first survey year) and has NaN values for every year thereafter. In such a case, using forward fill would propagate the value up to 2018 (latest survey year) potentially meaning it is outdated and no longer relevant. The alternative is to allow the imputer to derive these missing values with the other non-missing values as input. The question we must ask is do we value real, but potential outdated data, over fake, but temporally responsive data.
In all likelihood, the change that a given country experiences over a 13 year period is smaller than what we could accurately impute. If the dataset spanned, say, a generation (25 years) then perhaps more weight should be given to the imputation option.
As with most decision made during an analysis, this could be thoroughly vetted and a true optimal solution discovered, but again, the stakes are low enough to just leave well enough alone. Additionally, as shown below, the difference between each method is largely insignificant.
# fit on non-aggregated extended data
imputer = IterativeImputer(estimator=BayesianRidge(),random_state=RS,max_iter=15).fit(whr_ext.iloc[:,1:])
We fit on the non-mutated data to maintain data purity and increase the number of samples the imputer has at its disposal.
FFill difference¶
# Impute on latest data
whrl_imp = pd.DataFrame(imputer.transform(whrl), columns=whrl.columns,index=whrl.index)
# Impute on latest forward filled data
whrffl_imp = pd.DataFrame(imputer.transform(whrffl), columns=whrffl.columns,index=whrffl.index)
# whrl.loc[naidx] # Take latest Before
whrffl.loc[naidx] # FFill-latest Before
# whrl_imp.loc[naidx] # Take latest After
whrffl_imp.loc[naidx] #FFill-latest After
whrffl_imp.loc[naidx] - whrl_imp.loc[naidx] # difference
With a few exceptions, there is no difference between the filled data fields. Looking at core columns, Corruption_Perception
is the only attribute with a non-negligible difference for Turkmenistan.
Models¶
- K-means
- Agglomerative Clustering
- Affinity Propagation
- Gaussian Mixture
- DBSCAN
- HDBSCAN
Helper functions¶
def plot_cluster(x, y, data, title='',centers=None, **kwargs):
""" plot data from a clustering algorithm using dataframe column names
Args:
x, y : str
names of variables in ``data``
data : pandas.Dataframe
desired plotting data
title : str, optional
title of plot
centers : array-like or pd.DataFrame, optional
if provided, plots the given centers of the determined groups
**kwargs : keyword arguments, optional
arguments to pass to plt.scatter
Returns:
ax : matplotlib Axes
the Axes object with the plot drawn onto it.
"""
fig, ax = plt.subplots(figsize=(8,5))
labels = data[kwargs.get('c')]
nlabels = labels.nunique()
bounds = np.arange(labels.min(),nlabels+1)
# 20 distinct colors, more visible and differentible than tab20
# https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
cset = ['#3cb44b', '#ffe119', '#4363d8','#e6194b',
'#f58231','#911eb4', '#46f0f0', '#f032e6', '#bcf60c',
'#fabebe', '#008080', '#e6beff','#800000', '#aaffc3'] # take 14
cm = (mpl.colors.ListedColormap(cset, N=nlabels) if labels.min() == 0
else mpl.colors.ListedColormap(['#000000']+cset, N=nlabels+1))
sct = ax.scatter(x,y,data=data,cmap=cm,edgecolors='face',**kwargs)
if centers is not None:
if isinstance(centers,np.ndarray):
for g in centers[:,[data.columns.get_loc(x),data.columns.get_loc(y)]]:
ax.plot(*g,'*r',markersize=12, alpha=0.6)
if isinstance(centers,pd.DataFrame):
ax.scatter(x,y,data=centers,marker='D',c=centers.index.values,cmap=cm,
s=np.exp(centers['Life_Ladder'])*75, # scale ♦ size by Life_Ladder score
#s=(labels.value_counts().sort_index()/len(labels))*np.sqrt(nlabels)*200, #scale ♦ sizes by n
edgecolors='black',linewidths=1,alpha=0.7)
ax.set_title('(color=group, ♦size=Happiness, ♦loc = group center)')
ax.set_xlabel(x)
ax.set_ylabel(y)
fig.suptitle(title, fontsize=14)
ax2 = fig.add_axes([0.95, 0.1, 0.03, 0.8]) # 'Magic' numbers for colorbar spacing
norm = mpl.colors.BoundaryNorm(bounds,cm.N)
cb = mpl.colorbar.ColorbarBase(ax2, cmap=cm, norm=norm,ticks=bounds+0.5, boundaries=bounds)
cb.set_ticklabels(bounds)
plt.show()
return ax
def plot_boxolin(x,y,data):
""" Plot a box plot and a violin plot.
Args:
x,y : str
columns in `data` to be plotted. x is the 'groupby' attribute.
data : pandas.DataFrame
DataFrame containing `x` and `y` columns
Returns:
axes : matplotlib Axes
the Axes object with the plot drawn onto it.
"""
fig,axes = plt.subplots(1,2,figsize=(10,5),sharey=True)
whr_grps.boxplot(column=y,by=[x],ax=axes[0]) # could use sns.boxplot, but why not try something different
sns.violinplot(x,y,data = whr_grps,scale='area',ax=axes[1])
axes[0].set_title(None)
axes[0].set_ylabel(axes[1].get_ylabel())
axes[1].set_ylabel(None)
plt.show()
return axes
def cluster(model, X, **kwargs):
""" Run a clustering model and return predictions.
Args:
model : {sklearn.cluster, sklearn.mixture, or hdbscan}
Model to fit and predict
X : pandas.DataFrame
Data used to fit `model`
**kwargs : `model`.fit_predict() args, optional
Keyword arguments to be passed into `model`.fit_predict()
Returns:
(labels,centers) : tuple(array, pandas.DataFrame)
A tuple containing cluster labels and a DataFrame of cluster centers formated with X columns
"""
clust_labels = model.fit_predict(X,**kwargs)
centers = X.assign(**{model.__class__.__name__ : clust_labels} # assign a temp column to X with model name
).groupby(model.__class__.__name__,sort=True).mean() # group on temp, gather mean of labels
return (clust_labels, centers)
def score_clusters(X,labels):
""" Calculate silhouette, calinski-harabasz, and davies-bouldin scores
Args:
X : array-like, shape (``n_samples``, ``n_features``)
List of ``n_features``-dimensional data points. Each row corresponds
to a single data point.
labels : array-like, shape (``n_samples``,)
Predicted labels for each sample.
Returns:
scores : dict
Dictionary containing the three metric scores
"""
scores = {'silhouette':silhouette_score(X,labels),
'calinski_harabasz':calinski_harabasz_score(X,labels),
'davies_bouldin':davies_bouldin_score(X,labels)
}
return scores
Model data¶
ss = StandardScaler()
whrX = pd.DataFrame(ss.fit_transform(whrffl_imp.drop(columns='Year')), columns=whrffl_imp.drop(columns='Year').columns, index=whrffl_imp.index)
whrX.head()
Most clustering algorithms are sensitive to the scale of data, standard scaling is advised.
whr_grps = whrX.copy() # clone which we many append cluster groups to.
K-means¶
K-means clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares
https://scikit-learn.org/stable/modules/clustering.html#k-means
distortions = []
for n in range(2,10):
model = KMeans(n_clusters=n,random_state=RS).fit(whrX)
distortions.append(model.inertia_)
labs = model.labels_
print(f'n_clusters: {n}\n',score_clusters(whrX,labs))
fig,ax = plt.subplots(figsize=(6, 4))
ax.plot(range(2,10),distortions)
ax.set(**{'title':'Elbow curve','ylabel':'inertia','xlabel':'n_clusters'})
plt.show()
km = KMeans(n_clusters=3,random_state=RS)
clabels_km, cent_km = cluster(km, whrX)
whr_grps['KMeans'] = clabels_km
cent_km
plot_cluster('Log_GDP','Corruption_Perception', whr_grps, centers=cent_km, title='K-Means Cluster', c='KMeans');
The initial K-Means cluster plot seems to indicate that the populations with lower GDP per captia tend to believe there is more corruption in business/government and rate their lives lower on the Cantril Life Ladder. However, the relationship is not exact, as seen in cluster[0] happiness and GDP can increase to a certain extent while also increasing perceptions of corruption.
SilhouetteVisualizer(km).fit(whrX).poof()# https://www.scikit-yb.org/en/latest/api/cluster/silhouette.html
InterclusterDistance(km,random_state=RS).fit(whrX).poof()#https://www.scikit-yb.org/en/latest/api/cluster/icdm.html
plot_boxolin('KMeans','Log_GDP', data = whr_grps);
Agglomerative Clustering¶
Agglomerative clustering performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The hierarchy of clusters is represented as a tree or dendrogram where the root of the tree is the unique cluster that gathers all the samples, and the leaves are the clusters with only one sample.
https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
ac = AgglomerativeClustering(n_clusters=3, affinity = 'euclidean', linkage = 'ward')
clabels_ac,cent_ac = cluster(ac, whrX)
whr_grps['AgglomerativeClustering'] = clabels_ac
cent_ac
plot_cluster('Log_GDP','Corruption_Perception',whr_grps, centers=cent_ac, title='Agglomerative Cluster',c='AgglomerativeClustering');
score_clusters(whrX,clabels_ac)
plot_boxolin('AgglomerativeClustering','Log_GDP',whr_grps);
Affinity Propagation¶
ap = AffinityPropagation(damping = 0.5, max_iter = 250, affinity = 'euclidean')
clabels_ap, cent_ap = cluster(ap,whrX)
whr_grps['AffinityPropagation'] = clabels_ap
cent_ap
plot_cluster('Log_GDP','Freedom',whr_grps, centers=cent_ap,title='Affinity Propagation',c='AffinityPropagation');
score_clusters(whrX,clabels_ap)
plot_boxolin('AffinityPropagation','Log_GDP',whr_grps);
Violin plot loses a fair bit of its aesthetics when we ramp up the cluster count
Gaussian Mixture¶
A probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
https://scikit-learn.org/stable/modules/mixture.html#mixture
gm = GaussianMixture(n_components=3,init_params='kmeans', random_state=RS)
clabels_gm,cent_gm = cluster(gm,whrX)
whr_grps['GaussianMixture'] = clabels_gm
cent_gm
plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_gm, title='Gaussian Mixture',c='GaussianMixture');
score_clusters(whrX,clabels_gm)
plot_boxolin('GaussianMixture','Log_GDP',whr_grps);
DBSCAN¶
Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them.
db = DBSCAN(eps=0.3)
clabels_db,cent_db = cluster(db,whrX)
whr_grps['DBSCAN'] = clabels_db
cent_db
plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_db,title='DBSCAN',c='DBSCAN');
Running with with a scaled version of the data failed to categorize any of the points at all. It considered every point to be too noisy group.
The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.
The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.
The usage of centroid distance limits the distance metric to Euclidean space. A good value reported by this method does not imply the best information retrieval.
plot_boxolin('DBSCAN','Log_GDP',whr_grps);
Given this dataset has relatively low density, this model had substandard performance w.r.t the other contenders.
HDBSCAN¶
Hierarchical Density-Based Spatial Clustering of Applications with Noise. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.
hd = hdbscan.HDBSCAN()
clabels_hd, cent_hd = cluster(hd,whrX)
whr_grps['HDBSCAN'] = clabels_hd
cent_hd
plot_cluster('Log_GDP','Corruption_Perception',whr_grps,centers=cent_hd,title='HDBSCAN',c='HDBSCAN');
score_clusters(whrX,clabels_hd)
plot_boxolin('HDBSCAN','Log_GDP',whr_grps);
Conclusions¶
This notebook explored the World Happiness Dataset using a total of 6 models:
K-means, Agglomerative Clustering, Affinity Propagation, Gaussian Mixture, DBSCAN, and HDBSCAN.
The was a somewhat significant degree of variation between the examined models, but those that were not prescribed a certain number of clusters arrived at a 9 or 10 groups. With this many groupings, however, it became much more difficult to see exactly how a model was making clustering decisions.
For the models which we assigned a group count of 3, K-means, Agglomerative Clustering, and Gaussian Mixture, two diagonal or vertical lines could nearly be drawn between decision boundaries by way of GDP considerations.
Only one version of one model failed to perform at all, that was DBSCAN with scaled data, the rest found some form of suitable clustering. However, the Boxplots showed that when models were given free reign over the number of clusters, they tend to have one cluster serve to explain a large range of values and another to explain an extremely tightly grouped set with many outliers.
Future work¶
One thought that was left untested was looking to see if the previous model's metrics in anyway influenced successive models. Since a new column was appended to the dataset and not removed, there is indeed a possibility of this happening.
A good deal more EDA could be done on this dataset by looking at relationships between variables in a market basic analysis fashion. As always, there are many different hyperparamaters that could still be experimented with, as well as other clustering models like Spectral, Ward, and MeanShift that might yield interesting results. Additionally, and perhaps most poignantly, only Log_GDP
and Corruption_Perception
were explored, the rest were left untouched, thus leaving a significant portion of the data not truly explored to its fullest.
References¶
Code & Docs:¶
- https://www.kaggle.com/dhanyajothimani/basic-visualization-and-clustering-in-python/notebook
- https://stackoverflow.com/a/42505051
- https://matplotlib.org/api/colorbar_api.html?highlight=colorbar#module-matplotlib.colorbar
- https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
- https://stackoverflow.com/questions/14777066/matplotlib-discrete-colorbar
- https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
- https://plot.ly/python/choropleth-maps/
- https://community.plot.ly/t/additional-measures-in-hover-text/5747
Text:¶
- https://www.kaggle.com/unsdsn/world-happiness#2016.csv
- https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation
- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
- https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
- https://scikit-learn.org/stable/modules/clustering.html
Other:¶
- https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
- https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87
- https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
- https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779