Dimensionality Reduction and Feature Analysis

25 minute read


This notebook will take a second look at the 2016 Kaggle Caravan Insurance Challenge.

This time around, we will look at dimensionality reduction with:

  • Principal Component Analysis (PCA)
  • t-SNE
  • UMAP

As well as a few methods to feature selection:

  • Stepwise feature selection
  • Recursive feature elimination
  • Feature importance analysis

performed on a Logistic Regressor and Random Forest Classifier where applicable.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import warnings
from sklearn.feature_selection import RFE, SelectKBest, chi2
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import statsmodels.api as sm
import scipy.stats as ss
import joblib
from mlxtend.feature_selection import SequentialFeatureSelector
from openTSNE import TSNE # formerly fastTSNE
from openTSNE.callbacks import ErrorLogger
from umap import UMAP
C:\Users\Rygu\Anaconda3\envs\i4061\lib\site-packages\sklearn\externals\joblib\__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
In [2]:
RS = 404 # random_state
IS_AWS = False # conditional calls when running locally vs on Amazon Web Service
In [3]:
if not IS_AWS:
    with warnings.catch_warnings():
        # Ignore UserWarning sklearn 0.20.3 vs 0.21 
        lbfgs_bwd = joblib.load('saves/lbfgs_backward.pkl')
        lbfgs_fwd = joblib.load('saves/lbfgs_forward.pkl')
        lbfgs_pfwd = joblib.load('saves/lbfgs_pforward.pkl')
        rfc_bwd = joblib.load('saves/rf_backward.pkl')
        rfc_fwd = joblib.load('saves/rf_forward.pkl')


The dataset used is from the CoIL Challenge 2000 datamining competition. It may be obtained from: https://www.kaggle.com/uciml/caravan-insurance-challenge

It contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data derived from zip area codes.

A description of each variable may be found at the link listed above.

Each number corresponds with a certain key, specific to each variable. There are 5 levels of keys, L0-L4 each key represents a different group range. As a sample:

L0 - Customer subtype (1-41)
    1: High Income, expensive child
    2: Very Important Provincials
    3: High status seniors
L1 - average age keys (1-6):
    1: 20-30 years 
    2: 30-40 years 
    3: 40-50 years 
L2 - customer main type keys (1-10):
    1: Successful hedonists
    2: Driven Growers
    3: Average Family
L3 - percentage keys (0-9):
    0: 0%
    1: 1 - 10%
    2: 11 - 23%
    3: 24 - 36%
L4 - total number keys (0-9):
    0: 0
    1: 1 - 49
    2: 50 - 99
    3: 100 - 199

The variable descriptions are quite important as it appears as though the variable names themselves are abbreviated in Dutch. One helpful pattern to notice is the letters that variables begin with:

  • M - demographic statistics of the postal code
  • A - numerical representation of product ownership insurance statistics
  • P - percentage representation of product ownership insurance statistics


P. van der Putten and M. van Someren (eds) . CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000.

In [4]:
coil2k_raw = pd.read_csv('data/caravan-insurance-challenge.csv')
coil = coil2k_raw.drop(columns=['ORIGIN']).copy()
X,y = coil.drop(columns='CARAVAN'), coil.CARAVAN

Exploratory Data Analysis

No NA values, all variables are type of int64. The data is peculiar in that every numeric value stands for an attribute of a postal code demographic. Even variables that could be continuous, such as income, have been binned. In this sense, this dataset is entirely comprised of Categorical and Ordinal values. Other than potential collinearity between percentage and range values, the data is mostly clean.

In contrast with the previous notebook, this time around we will not try and mimic competition settings and instead just use the whole dataset throughout since we are more focused on describing the data rather than predicting.

In this section, we must be careful not to read to closely into the outputs. We are dealing with demographic data, so do not have the level of granularity require to make claims about the exact type of person that would buy caravan insurance. At best, we can say that a person who lives in a postal code where most people have these attributes may be more inclined to purchase caravan insurance.

Dimensionality Reduction

In [5]:
# plotting helper function
def scatter_density(data, labels, sca_title='', den_title='', **kwargs):
    """Plot a scatter plot and a density plot
        data : 2-d array, shape (n_samples,2)
        labels: array-like, class labels to be used for coloring scatter plot
        sca_title: str, scatter plot title
        den_title: str, density plot title
        **kwargs: keyword arguments passed to seaborn.kdeplot
        ax, matplotlib axis object
    fig, ax = plt.subplots(1,2,figsize=(10,4),sharey=True,sharex=True)#,gridspec_kw={'width_ratios':[48,48,4]})
    dataneg = data[labels == 0]
    datapos = data[labels == 1]
    sns.scatterplot(data[:,0], data[:,1],hue=labels, ax=ax[0])
    #sns.scatterplot(dataneg[:,0], dataneg[:,1], palette='Blues', ax=ax[0],alpha=0.06)
    #sns.scatterplot(datapos[:,0], datapos[:,1], palette='Oranges', ax=ax[0],alpha=1)

    sns.kdeplot(datapos[:,0], datapos[:,1], ax=ax[1], cmap='Oranges',**kwargs) #,cbar=True,cbar_ax=ax[2])
    sns.kdeplot(dataneg[:,0], dataneg[:,1], ax=ax[1], cmap='Blues',n_levels=30,**kwargs,shade=True,shade_lowest=False)#,cbar=True,cbar_ax=ax[2])

    return ax

PCA (Principal component analysis)

Since PCA is effected by differences in magnitude well begin by scaling the data.

In [6]:
Xs = pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns)
In [7]:
pca = PCA(random_state=RS)

As a brief demonstration, we can see what happens when attempting perform PCA without proper scaling.

In [8]:
_Xpca_raw = PCA(n_components=2, random_state=RS).fit_transform(X)
scatter_density(_Xpca_raw, y, 'PCA Scatter Unscaled', 'PCA Density Unscaled');

The density plot shows a clear separation between two groups and even the scatter plot shows some degree of misleading grouping. Properly scaled, things will look quite a bit different.

In [9]:
scatter_density(Xpca, y, 'PCA Scaled: Scatter', 'PCA Scaled: Density');

Now we are dealing with the accurate representation of the data, an amorphous point mass. This is why it is so important to check that the assumptions are model are met, otherwise it is all too easy to head down a path leading to dead ends or invalid conclusions.

In [10]:
array([0.11035515, 0.05773411])

About 16% of variance can be explained by these first two abstract components.

In [11]:
plt.annotate('(64, 0.993)', xy=(64, 0.993), xytext=(64,0.8), fontsize='medium',arrowprops={'arrowstyle':'->','mutation_scale':15})
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.title('Explained variance')

From this plot, we can see that about 50% of our variance is explained with the first 16 components. By 40 components, around 90% of variance is explained and at ~60 components, just about all variance is accounted for.

In the previous analysis, we discussed that having both a percentage and numeric representation of the same data was likely redundant and unnecessary for modeling. There are 21 number/percent variables with assumed overlapped, 99.3% of variance can be explained with 64 (85-21) components, this further supports our previous hypothesis. If the point selection seems somewhat arbitrary, know there is some method to the madness. The 64th component marks the first occurrence where adding an additional component does not show any change to the 3rd decimal.

t-SNE (T-distributed Stochastic Neighbor Embedding)

In contrast with PCA, t-SNE is a non-linear dimensionality reduction technique that maps data in 2 or 3 dimensions in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

The largest downside to t-SNE is that it runs quite slowly, running in quadric time prior to optimization. There have been some advances by way of Barnes-Hut implementations and Fast Fourier Transform interpolation but it is still orders of magnitude slower than PCA.

In [12]:
tsne = TSNE(perplexity=75, learning_rate=500, n_iter=1000, metric='euclidean', 
            negative_gradient_method='bh', n_jobs=4, callbacks=ErrorLogger(), random_state=RS)
Xembd = tsne.fit(Xs)
Iteration   50, KL divergence  4.4702, 50 iterations in 2.7536 sec
Iteration  100, KL divergence  4.4649, 50 iterations in 2.7606 sec
Iteration  150, KL divergence  4.4645, 50 iterations in 2.9711 sec
Iteration  200, KL divergence  4.4643, 50 iterations in 3.2134 sec
Iteration  250, KL divergence  4.4642, 50 iterations in 3.0040 sec
Iteration   50, KL divergence  2.0479, 50 iterations in 2.6928 sec
Iteration  100, KL divergence  1.7084, 50 iterations in 2.7117 sec
Iteration  150, KL divergence  1.5606, 50 iterations in 2.7586 sec
Iteration  200, KL divergence  1.4822, 50 iterations in 2.7945 sec
Iteration  250, KL divergence  1.4382, 50 iterations in 2.8703 sec
Iteration  300, KL divergence  1.4125, 50 iterations in 2.9611 sec
Iteration  350, KL divergence  1.3957, 50 iterations in 2.8963 sec
Iteration  400, KL divergence  1.3847, 50 iterations in 3.0688 sec
Iteration  450, KL divergence  1.3764, 50 iterations in 2.9352 sec
Iteration  500, KL divergence  1.3705, 50 iterations in 2.9571 sec
Iteration  550, KL divergence  1.3651, 50 iterations in 3.0449 sec
Iteration  600, KL divergence  1.3614, 50 iterations in 3.2054 sec
Iteration  650, KL divergence  1.3579, 50 iterations in 3.0937 sec
Iteration  700, KL divergence  1.3553, 50 iterations in 2.9371 sec
Iteration  750, KL divergence  1.3523, 50 iterations in 3.1635 sec
Iteration  800, KL divergence  1.3501, 50 iterations in 3.0538 sec
Iteration  850, KL divergence  1.3485, 50 iterations in 2.9910 sec
Iteration  900, KL divergence  1.3471, 50 iterations in 2.8923 sec
Iteration  950, KL divergence  1.3454, 50 iterations in 2.9172 sec
Iteration  1000, KL divergence  1.3443, 50 iterations in 2.8793 sec

Here we have chosen to set negative_gradient_method='bh' to use the Barnes-Hut algorithm over the interpolation method due to the number of samples present in the data set. Generally, until the observations begins to exceed ~10k the overhead required to set up interpolation outweighs the efficiency gains.

In [13]:
scatter_density(Xembd, y,'t-SNE scatter','t-SNE density');

Although we do begin to see some small clusters taking shape, one must exercise caution before making any definitive claims. Depending on parameter choice, t-SNE has been shown to spuriously cluster entirely random data. In any case, the highest density areas overlap between positive and negative samples and there are only a few small pockets where they have successfully separated.

UMAP (Uniform Manifold Approximation and Projection)

UMAP is a relatively recent development in non-linear dimensionality reduction. Its authors claims arguably better clustering and substantially faster performance. From the analysis below, the latter has proven true and the former is up to interpretation.

In [14]:
ump = UMAP(n_neighbors=30, min_dist=0.2, random_state=RS, verbose=True)
Xumap = ump.fit_transform(Xs,y)
UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
     learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=0.2, n_components=2, n_epochs=None,
     n_neighbors=30, negative_sample_rate=5, random_state=404,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=True)
Construct fuzzy simplicial set
Tue Jun 18 11:10:17 2019 Finding Nearest Neighbors
Tue Jun 18 11:10:17 2019 Building RP forest with 10 trees
Tue Jun 18 11:10:20 2019 NN descent for 13 iterations
	 0  /  13
	 1  /  13
	 2  /  13
	 3  /  13
	 4  /  13
Tue Jun 18 11:10:34 2019 Finished Nearest Neighbor Search
Tue Jun 18 11:10:39 2019 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Tue Jun 18 11:11:29 2019 Finished embedding
In [15]:
scatter_density(Xumap, y, 'UMAP: Scatter', 'UMAP: Density');

UMAP managed to tightly cluster most positive samples distinct from the primary negative grouping, of course, by passing the labels to the algorithm we have effectively turned this into a supervised problem trained on the full dataset.

In [16]:
ump = UMAP(n_neighbors=30, min_dist=0.2, random_state=RS, verbose=False)
Xumap = ump.fit_transform(Xs)
In [17]:
scatter_density(Xumap, y, 'UMAP: Scatter', 'UMAP: Density');

Unsurprisingly, depriving the model of the target values dramatically changes the output. The separation seen in the previous has entirely vanished and the highest density areas are now heavily overlapping.

This problem is not one that can likely be solved by using dimensionality reduction techniques such as these. There are a few points working against us when trying to solve using these types of approaches:

  1. The data lacks the appropriate level granularity to find latent space. We may see better results if the data modeled an individual person rather than a demographic statistic.
  2. It is a heavily imbalanced dataset which means even less subtext can be gleaned from the already coarse data.
  3. The differences between the demographics data of those who purchased caravan insurance and those who did not is rather minimal.


Stepwise feature selection

2 Models will be used, both from the sci-kit learn library, a RandomForestClassifer and a LogisticRegression model.

Stepwise selection is an iterative process that repeatedly constructs new models by adding or removing a single predictor and observing the impact on a specified criterion. A threshold can be specified to alter the process by which candidate features are kept or discarded.

In [18]:
# minor QoL tweaks to mlxtend.plotting.plot_sequential_feature_selection
def plot_sfs(selector, title='', kind='std_dev', color='blue', bcolor='steelblue', marker='o', 
             alpha=0.2, ylabel=None, confidence_interval=0.95, ax=None, **kwargs):
    """Plot feature selection results.

    selector : a fitted mlxtend.SequentialFeatureSelector object
    title : str (default: '')
        plot title
    kind : str (default: "std_dev")
        The kind of error bar or confidence interval in
        {'std_dev', 'std_err', 'ci', None}.
    color : str (default: "blue")
        Color of the lineplot (accepts any matplotlib color name)
    bcolor : str (default: "steelblue").
        Color of the error bars / confidence intervals
        (accepts any matplotlib color name).
    marker : str (default: "o")
        Marker of the line plot
        (accepts any matplotlib marker name).
    alpha : float in [0, 1] (default: 0.2)
        Transparency of the error bars / confidence intervals.
    ylabel : str (default: None)
        Y-axis label, if not overriden, will be set to ``selector``.scoring
    confidence_interval : float (default: 0.95)
        Confidence level if `kind='ci'`.
    ax : matplotlib Axes, optional
        Axes object to draw the plot onto, otherwise uses the current Axes.  
    **kwargs : key, value mappings
        Other keyword arguments are passed through to ``matplotlib.pyplot.subplots``.

    fig, ax : matplotlib.pyplot subplot objects
        Figure and axis elements of the subplot.

    For usage examples, please see


    allowed = {'std_dev', 'std_err', 'ci', None}
    if kind not in allowed:
        raise AttributeError('kind not in %s' % allowed)

    fig, ax = (plt.gcf(), ax) if ax is not None else plt.subplots(**kwargs)
    metric_dict = selector.get_metric_dict()
    k_feat = sorted(metric_dict.keys())
    avg = [metric_dict[k]['avg_score'] for k in k_feat]

    if kind:
        upper, lower = [], []
        if kind == 'ci':
            kind = 'ci_bound'

        for k in k_feat:
            upper.append(metric_dict[k]['avg_score'] + metric_dict[k][kind])
            lower.append(metric_dict[k]['avg_score'] - metric_dict[k][kind])

        ax.fill_between(k_feat, upper, lower, alpha=alpha, color=bcolor, lw=1)

        if kind == 'ci_bound':
            kind = 'Confidence Interval (%d%%)' % (confidence_interval * 100)

    ax.plot(k_feat, avg, color=color, marker=marker)
    ax.set_ylabel(ylabel if ylabel else selector.scoring)
    ax.set_xlabel('Number of Features')
    feature_min = len(metric_dict[k_feat[0]]['feature_idx'])
    feature_max = len(metric_dict[k_feat[-1]]['feature_idx'])
    ax.set_xticks(range(feature_min, feature_max + 1), range(feature_min, feature_max + 1))
    return ax

p-value criterion

Define a stepwise p-value based feature selection class. The class is a heavily-modified recursive implementation of this stackexchange post which had a troublesome while True statement causing issues with infinite looping.

In [19]:
class StepSelect:
    """Perform a forward-backward feature selection based on p-value from statsmodels.api.OLS
        X : pandas.DataFrame, containing candidate features
        y : list-like, target/response
        threshold_in : float, if p-value < threshold_in, include feature
        threshold_out : float, if p-value > threshold_out exclude feature
        verbose : boolean, if True print the sequence of inclusions and exclusions
        list : selected features     
    See also:
    def __init__(self, X, y, threshold_in=0.01, threshold_out=0.05, verbose=True):
        self.X = X
        self.y = y 
        self.threshold_in = threshold_in
        self.threshold_out = threshold_out
        self.verbose = verbose
        assert self.threshold_in < threshold_out, "inclusive p-value should be smaller than exclusive"
        self._init_list = []
        self._dlist=[] # dropped values
        self._updated = False # sentinel value
    def __call__(self):
        return self._step(self.X, self.y, self._init_list)
    def _drop_list(self,add_drop=None):
        """Tracks dropped values"""
        if add_drop is not None:
        return self._dlist
    def _forward(self, included, excluded):
        # forward step
        new_pval = pd.Series(index=excluded)
        for new_col in excluded:
            model = sm.OLS(self.y, sm.add_constant(pd.DataFrame(self.X[included+[new_col]]))).fit()
            new_pval[new_col] = model.pvalues[new_col] # build Series - feat : pvalue

        best_pval = new_pval.min()
        if best_pval < self.threshold_in:
            feat_added = new_pval.idxmin() # name of feature with lowest p-value
            if self.verbose:
                print('Add  {:30} with p-value {:.6}'.format(feat_added, best_pval))
        return included
    def _backward(self, included):
        # backward step
        model = sm.OLS(self.y, sm.add_constant(pd.DataFrame(self.X[included]))).fit()
        pvalues = model.pvalues.iloc[1:] 

        worst_pval = pvalues.max()
        if worst_pval > self.threshold_out:
            feat_removed = pvalues.idxmax() # name of feature with highest p-value
            self._drop_list(feat_removed) # add to drop list
            included.remove(feat_removed) # remove from inclusions
            self._updated = True
            if self.verbose:
                print('Drop {:30} with p-value {:.6}'.format(feat_removed, worst_pval))
        return included
    def _step(self, X, y, included):
        included = list(included)
        excluded = list(set(X.columns)-set(included))

        included = self._forward(included, excluded)
        included = self._backward(included)

        # sanity check for infinite loop, if any value has been dropped 
        # more than twice the number of unique values, return included
        drop_vc = pd.Series(self._drop_list()).value_counts()
        is_inf_loop = (drop_vc > 2*len(drop_vc)).any()
        if self._updated and not is_inf_loop:
            return self._step(X, y, included) 
        return np.array(included)
In [20]:
with warnings.catch_warnings():
    warnings.simplefilter('ignore',FutureWarning) # Ignore FutureWarning numpy.ptp  
    selected_pval = StepSelect(Xs,y,threshold_in=0.01,threshold_out=0.05)()
Add  PPERSAUT                       with p-value 2.14684e-42
Add  MKOOPKLA                       with p-value 1.36739e-21
Add  PWAPART                        with p-value 3.66711e-15
Add  APLEZIER                       with p-value 8.20766e-15
Add  MOPLHOOG                       with p-value 4.25236e-06
Add  PBRAND                         with p-value 3.92829e-06
Add  MBERBOER                       with p-value 8.31838e-06
Add  MRELGE                         with p-value 1.41977e-05
Add  PWALAND                        with p-value 0.000361295
Add  ABRAND                         with p-value 0.000937601
Add  AZEILPL                        with p-value 0.00153041
Add  MINK123M                       with p-value 0.00152554
Add  PBYSTAND                       with p-value 0.00243579
Add  PGEZONG                        with p-value 0.00485648
Add  AGEZONG                        with p-value 0.00450709
Add  MHHUUR                         with p-value 0.00630075

Now that we've gone through and wrapped the function in a nice reusable class, rest assured, we will never be using it again. There is a reason that no popular python packages implement p-value based stepwise selection, and it's certainly not because it's a novel idea. Stepwise selection is rarely a good idea and basing it on p-values is much worse.

Traditionally, stepwise selection uses F-tests/t-tests, AIC/BIC, or Adjusted $R^2$ for feature selection criterion, and even then, it is still prone to overfitting and can provide biased, if not completely flawed results especially in the presence of multicollinearity.

By using p-values as our criterion, not only do we have the aforementioned issues, now we introduce the Multiple Comparisons Problem, so it stands to reason that the features selected by this method disagree more so than the others.

Stepwise selection ROC AUC

As an alternative to p-value, we can instead directly optimize for a chosen evaluation metric, in this case roc_auc.

In [21]:
lr_lbfgs = LogisticRegression(class_weight='balanced', solver='lbfgs', max_iter=1000)

We'll be using the lbfgs solver here due to its performance on larger datasets. As a brief note, experimentation with liblinear was also previously performed but was removed after finding a negligible improvement in AUC and having a significantly longer run time. All timings below were carried on a c5.4xlarge instance (16 core, 32gb ram) from AWS, attempting these locally on a 4 core machine took well over 30 minutes per linear selector.

LBFGS Forward Best
In [15]:
lbfgs_fwd = SequentialFeatureSelector(lr_lbfgs, k_features="best", forward=True, verbose=0, scoring='roc_auc', n_jobs=-1)
lbfgs_fwd = lbfgs_fwd.fit(Xs, y, custom_feature_names=Xs.columns)
CPU times: user 21.2 s, sys: 1.85 s, total: 23.1 s
Wall time: 2min 58s
In [22]:
print('Forward Best, nfeat={} (AUC: {:.3f}):\n{}'.format(
    len(lbfgs_fwd.k_feature_names_), lbfgs_fwd.k_score_, lbfgs_fwd.k_feature_names_))
Forward Best, nfeat=29 (AUC: 0.755):
In [23]:
plot_sfs(lbfgs_fwd, 'lbfgs : Forward best', figsize=(6,4));
In [24]:
pd.DataFrame(lbfgs_fwd.get_metric_dict()).T.query('avg_score > 0.75').head(10)
avg_score ci_bound cv_scores feature_idx feature_names std_dev std_err
14 0.750819 0.011137 [0.750527368112114, 0.764957264957265, 0.75303... (10, 15, 20, 31, 36, 40, 42, 43, 45, 46, 58, 6... (MRELSA, MOPLHOOG, MBERBOER, MAUT1, MINKM30, M... 0.00866497 0.00433248
15 0.75156 0.0112257 [0.751123523369286, 0.7660030819207865, 0.7539... (10, 15, 20, 31, 36, 40, 42, 43, 45, 46, 52, 5... (MRELSA, MOPLHOOG, MBERBOER, MAUT1, MINKM30, M... 0.00873395 0.00436697
16 0.752097 0.0116983 [0.7516646489104116, 0.7670743501820925, 0.755... (1, 10, 15, 20, 31, 36, 40, 42, 43, 45, 46, 52... (MAANTHUI, MRELSA, MOPLHOOG, MBERBOER, MAUT1, ... 0.0091017 0.00455085
17 0.752578 0.0113802 [0.7512610976594027, 0.7675394148052513, 0.754... (1, 10, 15, 20, 31, 36, 40, 42, 43, 45, 46, 52... (MAANTHUI, MRELSA, MOPLHOOG, MBERBOER, MAUT1, ... 0.00885417 0.00442709
18 0.753024 0.011405 [0.7458246202949592, 0.7683237775278923, 0.755... (0, 1, 10, 15, 20, 31, 36, 40, 42, 43, 45, 46,... (MOSTYPE, MAANTHUI, MRELSA, MOPLHOOG, MBERBOER... 0.00887347 0.00443674
19 0.75353 0.0119217 [0.7440361545234426, 0.7693927320348544, 0.755... (0, 1, 10, 15, 20, 31, 36, 40, 42, 43, 45, 46,... (MOSTYPE, MAANTHUI, MRELSA, MOPLHOOG, MBERBOER... 0.00927553 0.00463776
20 0.753897 0.0120354 [0.744311303103676, 0.7697444226951536, 0.7565... (0, 1, 10, 15, 20, 31, 36, 40, 42, 43, 45, 46,... (MOSTYPE, MAANTHUI, MRELSA, MOPLHOOG, MBERBOER... 0.00936396 0.00468198
21 0.754208 0.0131567 [0.7456732885758309, 0.7710378113734908, 0.757... (0, 1, 10, 15, 17, 20, 31, 36, 40, 42, 43, 45,... (MOSTYPE, MAANTHUI, MRELSA, MOPLHOOG, MOPLLAAG... 0.0102364 0.00511818
22 0.754406 0.0124859 [0.7452926663731749, 0.7708064359390836, 0.757... (0, 1, 10, 15, 17, 20, 31, 36, 40, 42, 43, 45,... (MOSTYPE, MAANTHUI, MRELSA, MOPLHOOG, MOPLLAAG... 0.00971448 0.00485724
23 0.754573 0.0135995 [0.7465835717954362, 0.7722802974562585, 0.758... (0, 1, 10, 15, 17, 20, 31, 36, 40, 42, 43, 45,... (MOSTYPE, MAANTHUI, MRELSA, MOPLHOOG, MOPLLAAG... 0.0105809 0.00529043

Even though 32 features was selected as the optimal value, we can see that only 14 features are required to reach a roc_auc of 0.75. The question then becomes a matter of simplicity or accuracy, does the use case warrant doubling the feature set in exchange for a 0.6% increase in score? If not, SFS provides an out of the box method to optimize for smaller feature sets.

LBFGS Forward parsimonious
In [19]:
lbfgs_pfwd = SequentialFeatureSelector(lr_lbfgs, k_features="parsimonious", forward=True, verbose=0, scoring='roc_auc', n_jobs=-1)
lbfgs_pfwd = lbfgs_pfwd.fit(Xs, y, custom_feature_names=Xs.columns)
CPU times: user 21.9 s, sys: 1.39 s, total: 23.3 s
Wall time: 2min 45s
In [25]:
print('Forward Parsimonious, nfeat={} (AUC: {:.3f}):\n{}'.format(
    len(lbfgs_pfwd.k_feature_names_),lbfgs_pfwd.k_score_, lbfgs_pfwd.k_feature_names_))
Forward Parsimonious, nfeat=19 (AUC: 0.754):

In setting k_features='parsimonious' we are effectively attempting to minimize the the number of features while maximizing the score or, as the documentation states, "the smallest feature subset that is within one standard error of the cross-validation performance will be selected".

LBFGS backward best
In [21]:
lbfgs_bwd = SequentialFeatureSelector(lr_lbfgs, k_features="best", forward=False, verbose=0, scoring='roc_auc', n_jobs=-1)
lbfgs_bwd = lbfgs_bwd.fit(Xs, y, custom_feature_names=Xs.columns)
CPU times: user 45.3 s, sys: 14.2 s, total: 59.5 s
Wall time: 8min 41s
In [26]:
print('Backward Best, nfeat={} (AUC: {:.3f}):\n{}'.format(
    len(lbfgs_bwd.k_feature_names_),lbfgs_bwd.k_score_, lbfgs_bwd.k_feature_names_))
Backward Best, nfeat=35 (AUC: 0.755):

Backwards stepping with lbfgs produced the largest feature set thus far saw no metric improvement to compensate. This likely has to do with the model initially failing to converge, a symptom not present when starting with few features and incrementally adding more.

In [27]:
plot_sfs(lbfgs_bwd,'lbfgs : Backward Best',ylabel='AUC',figsize=(6,4));

Now that we've had our fun with stepwise selection on linear models, we'll move on to a non linear method, namely, random forests.

In [28]:
# Obtained from RandomizedSearchCV. Omitted: out of scope for analysis
rfparams = {
    'n_estimators': 891,
    'min_samples_split': 2,
    'min_samples_leaf': 4,
    'max_features': 'auto',
    'max_depth': 3,
    'bootstrap': False,
In [29]:
rfc = RandomForestClassifier(**rfparams)
Random Forest forward best
In [30]:
rfc_fwd = SequentialFeatureSelector(rfc, scoring='roc_auc', verbose=0, k_features="best", forward=True, n_jobs=-1)
rfc_fwd = rfc_fwd.fit(X.values, y, custom_feature_names=X.columns)

print('Forward RF Best, nfeat={} (AUC: {:.3f}):\n{}'.format(
    len(rfc_fwd.k_feature_names_),rfc_fwd.k_score_, rfc_fwd.k_feature_names_))
Forward RF Best, nfeat=24 (AUC: 0.765):
CPU times: user 22.6 s, sys: 892 ms, total: 23.5 s
Wall time: 1h 3min 46s

Perhaps the most striking difference between a random forest based approach and logistic regression is the number of features the model determined to be significant. The maximum roc_auc achieved decreased by ~0.07 but only 6 features were selected for the resultant best model.

In [30]:
plot_sfs(rfc_fwd,'RF : Forward Best', figsize=(6,4));

A quick illustration reveals how the selection took place, quickly shooting up to the best score, plateauing, and then beginning to degrade. It should be noted that there is a significantly wider error band compared to previous models, factoring this into consideration, the up tick at ~50-55 features begins to look more trustworthy.

Random Forest backward best
In [36]:
rfc_bwd = SequentialFeatureSelector(rfc, scoring='roc_auc', verbose=0, k_features="best", forward=False, n_jobs=-1)
rfc_bwd = rfc_bwd.fit(X.values, y, custom_feature_names=X.columns)

print('Backward RF Best, nfeat={} (AUC: {:.3f}):\n{}'.format(
    len(rfc_bwd.k_feature_names_),rfc_bwd.k_score_, rfc_bwd.k_feature_names_))
Backward RF Best, nfeat=30 (AUC: 0.764):
CPU times: user 49.5 s, sys: 868 ms, total: 50.4 s
Wall time: 1h 28min 34s

Taking a backward stepping approach yields even few features at the cost of 0.006 off the final evaluation. The inconsistencies between forward and backward selection does not shine a favorable light on this method of feature selection. Since this is not an exhaustive search over all feature combinations, we are greatly limited by selection order.

In [31]:
if IS_AWS:

Recursive Feature Elimination

This method is very similar to the approach that SequentialFeatureSelector uses but rather than specifying a metric to test against, RFE will use weight coefficients in the case of linear models and feature importance for tree-based models.

In [32]:
rfe_lr = RFE(LogisticRegression(solver='lbfgs',class_weight='balanced',random_state=RS,max_iter=1000)).fit(Xs,y)
print('Num features:',rfe_lr.n_features_)
feats_lr = X.columns[rfe_lr.support_];feats_lr
Num features: 42
       'MAUT1', 'MAUT2', 'MAUT0', 'MZFONDS', 'MZPART', 'MINKM30', 'MINK3045',
In [33]:
rfe_rf = RFE(RandomForestClassifier(n_jobs=-1, **rfparams)).fit(X,y)
print('Num features:',rfe_rf.n_features_)
feats_rf = X.columns[rfe_rf.support_];feats_rf
Num features: 42
       'MINKM30', 'MINK3045', 'MINK4575', 'MINKGEM', 'MKOOPKLA', 'PWAPART',
In [34]:
print('Common Features:',len(feats_lr & feats_rf))
feats_lr & feats_rf
Common Features: 26
       'MZPART', 'MINKM30', 'MINK3045', 'MINK4575', 'MINKGEM', 'MKOOPKLA',

class_weight is set to balanced to help deal with the imbalance of this dataset since this notebook we forwent any form of undersampling. RFE with both Logistic Regression and Random Forest Classification left us with 42 features to consider. However, there is only an overlap of 26 features between the two sets, a clear demonstration of how using multiple models can better an analysis.

In [35]:
kb_chi = SelectKBest(score_func=chi2, k='all').fit(X,y)

There are two potential problems with choosing Chi-Square with SelectKBest. First, though the dataset is technically all categorical variables, some represent binned continuous values, which is not ideal. Second, we did not address issues with multicollinearity before this test so the results could potentially be off. However, since this will just be used as a means of comparison and not a direct indicator, it is fine for our purposes.

Feature Importance

In sklearn, the default measure of feature importance for random forests classifiers is measure by gini-impurity. More specifically, it is the total decrease in node impurity, weighted by the proportion of samples reaching that node, averaged over all trees. This approach is not without its limitations, it has a tendency to weight higher cardinality variable more heavily than others, for instance. In the next notebook, we will be addressing this idea in greater depth using SHAP, permutation importance, and LIME among others.

In [36]:
rfc = RandomForestClassifier(n_jobs=-1, **rfparams).fit(X,y)
In [37]:
fi = pd.DataFrame({'feature':X.columns,'importance':rfc.feature_importances_,'pvalue':kb_chi.pvalues_}).sort_values('importance', ascending=False)
In [38]:
feature importance pvalue
46 PPERSAUT 0.189921 7.450789e-118
67 APERSAUT 0.135713 1.284432e-24
58 PBRAND 0.096029 4.055311e-40
43 PWAPART 0.067278 9.737212e-27
42 MKOOPKLA 0.052901 2.037681e-21
In [39]:
In [40]:
In [41]:
sns.relplot('feature','importance',data=fi[:42],hue='pvalue', palette='coolwarm',
            aspect=1.75, hue_norm=(0,0.05)).set_xticklabels(rotation=90);

3 features, PPERSAUT, APERSAUT, and PBRAND have far more importance than any others, alone they represent about 40% of over all importance of the classifier. Save for a few exceptions, the same features that were important to the Random Forest also had very low p-values with our KBest chi-square selector.

One interesting thing to note is PPERSAUT and APERSAUT are different forms of the same data, both pertaining to car policies. So, in reality car policies is the likely strongest indicator of who will purchase caravan insurance.

In [42]:
g = sns.lineplot(data=np.cumsum(fi.importance).values)
g.set_xlabel('Num Features')
g.set_ylabel('Cumulative Importance')
g.set_title('Random Forest Feature Importance');

Though there still exists some commonality between the cumulative sum of explained variance and feature importance, the initial slope is significantly steeper. The 90% threshold for feature importance occurred a full 20 features prior to the variance and only ~40 features were required to account for nearly all importance compared to 60 for explained variance.

In [44]:
seqsels = [lbfgs_bwd,lbfgs_fwd, lbfgs_pfwd, rfc_bwd,rfc_fwd]
selfeats = [np.array(s.k_feature_names_) for s in seqsels]
allfeats = [x.astype('object') for x in [*selfeats,selected_pval,feats_lr.values,feats_rf.values]]
feats_ser = pd.Series([i for s in allfeats for i in s if i])
In [49]:
PBRAND      8
MRELGE      7
MINK123M    7
MINKM30     6
dtype: int64

A total of 4 features (PPERSAUT, PBRAND, MBERBOER, and APLEZIER) were present in all 8 models. If we are to trust the 'wisdom of the crowd', so to speak, these features should be given most attention when attempting to determine who is most likely to purchase caravan insurance.

In [50]:
print(len(feats_ser.value_counts()[feats_ser.value_counts() == 1])) # only once occurrance
X.columns ^ feats_ser.unique()

These 11 features were never chosen in any of the various feature selection models, and 18 others only were included once. Following the same logic, it may be safe to factor these attributes into less consideration when differentiating prospective caravan customers.


This notebook was a complement to, or extension of, the previous previous notebook looking at the CoIL 2000 dataset. The focus this time was more on finding interesting features rather than trying many models. We used p-values from Ordinary Least Squares to perform stepwise feature selection and found 16 statistically significant features.

Thereafter, we performed PCA on both a non-standardized and standardized dataset and plotted scatter and density for each. Following that, we used two non-linear dimensionality reduction techniques, t-SNE and UMAP and looked at how those faired with the data.

We closeout with by taking a look at feature importances with a Random Forest model and doing some Recursive Feature Elimination on both that and a Logistic Regressor. We find that the two models share a handful of features in common after elimination, both founding 42 variables worth keeping but only 26 of which were common between them.

Across all models we found 4 features to consistently appear: PPERSAUT, PBRAND, MBERBOER, and APLEZIER. Additionally, 11 features were never included in any selection technique and 18 only appeared once.

Future work:

There are quite a few other dimensionality reduction techniques that could be tried, ISOMAP, SOM, Sammson mapping, and many more, all of which could provide more insight into the data. The question still remains as to whether or not the presumed redundant variables can be removed without negatively impacting predictive power. This should be explored in great depth now that we have more evidence supporting the claim. Additionally, it would be interesting to see if some form of semi-supervised learning approach could be used via k-means clustering or a similar approach.