Updated:

26 minute read

Python Counterpart

Overview

This notebook will take a second look at the 2016 Kaggle Caravan Insurance Challenge.

This time around, EDA will include: * Stepwise feature selection based on p-values * Principal component analysis * t-SNE (optional) * UMAP (optional)

As well as modeling with: * Recursive feature elimination * Feature importance analysis

performed on a Logistic Regressor and Random Forest Classifier where applicable.

Imports

library(tidyverse)
library(caret)
library(MASS)
library(SignifReg) # p-value stepwise feature selection
library(factoextra) # pretty PCA plot
# setup for parallel processing
library(doParallel)
cl <- makePSOCKcluster(8)
registerDoParallel(cl)

Data

The dataset used is from the CoIL Challenge 2000 datamining competition. It may be obtained from: https://www.kaggle.com/uciml/caravan-insurance-challenge

It contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data derived from zip area codes.

A description of each variable may be found at the link in this cell listed above

The variable descriptions are quite important as it appears as though the variable names themselves are abbreviated in Dutch. One helpful pattern to notice is the letters that variables begin with:

  • M - primary demographics?, no guess for abbreviation
  • A - numbers, likely for dutch word aantal
  • P - percents, likely for dutch word procent

Acknowledgements:

P. van der Putten and M. van Someren (eds) . CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000.

head(cic <- read.csv('data/caravan-insurance-challenge.csv'))

Exploratory Data Analysis

No NA values, all variables are type of int64. The data is peculiar in that every numeric value stands for an attribute of a person. Even variables that could be continuous, such as income, have been binned. In this sense, this dataset is entirely comprised of Categorical and Ordinal values. Other than potential collinearity between percentage and range values, the data is mostly clean.

In contrast with the previous notebook, this time around we will not try and mimic competition settings and instead just use the whole dataset throughout since we are more focused on describing the data rather than predicting.

cic_data <- subset(cic, select=-c(ORIGIN))

Stepwise Feature Selection

We will be using the SignifReg package to carryout both foward and backward stepwise feature selection. This process iteratively builds a linear model by examining how the model responds to modifications in the predictor variable set. The parameter alpha serves as our threshold p-value, correction specifies the method for adjusting p-value scoring to help mitigate the Multiple comparisons problem.

In the forward case, the model is built up one variable at a time, each step selecting the variable whose addition resulted in the lowest overall p-value. The process continues until there are no possible additions that can be made that satisfy the p-value threshold criterion.

When the backward method is used, the model begins with the full set of predictor variables then, one by one, begins reducing the feature set by removing the variable with the highest p-value. The process is repeated until there are no variables left whose p-value is greater than the alpha threshold criterion.

Using p-values

# Forward pass
step.pvmodel.fwd <- SignifReg(lm(CARAVAN ~ ., data=cic_data),
                    alpha=0.05, direction="forward", 
                    criterion="p-value",adjust.method="fdr", trace = FALSE)

summary(step.pvmodel.fwd)
## 
## Call:
## lm(formula = CARAVAN ~ ., data = cic_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59686 -0.08677 -0.04731 -0.00935  1.04207 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5783847  0.3591997   1.610 0.107386    
## MOSTYPE      0.0017034  0.0017257   0.987 0.323620    
## MAANTHUI    -0.0030908  0.0058102  -0.532 0.594765    
## MGEMOMV      0.0018860  0.0055376   0.341 0.733425    
## MGEMLEEF     0.0081546  0.0038467   2.120 0.034037 *  
## MOSHOOFD    -0.0058416  0.0077350  -0.755 0.450137    
## MGODRK      -0.0011922  0.0043126  -0.276 0.782208    
## MGODPR       0.0016262  0.0046996   0.346 0.729335    
## MGODOV       0.0033623  0.0042110   0.798 0.424622    
## MGODGE      -0.0006484  0.0045023  -0.144 0.885486    
## MRELGE       0.0073003  0.0058266   1.253 0.210267    
## MRELSA       0.0017438  0.0055644   0.313 0.754002    
## MRELOV       0.0058132  0.0058812   0.988 0.322965    
## MFALLEEN    -0.0068756  0.0049762  -1.382 0.167098    
## MFGEKIND    -0.0076840  0.0051113  -1.503 0.132780    
## MFWEKIND    -0.0055513  0.0053244  -1.043 0.297159    
## MOPLHOOG     0.0049231  0.0052219   0.943 0.345822    
## MOPLMIDD    -0.0014082  0.0054568  -0.258 0.796371    
## MOPLLAAG    -0.0058663  0.0056023  -1.047 0.295063    
## MBERHOOG    -0.0013785  0.0034649  -0.398 0.690743    
## MBERZELF     0.0031623  0.0040297   0.785 0.432620    
## MBERBOER    -0.0105737  0.0038122  -2.774 0.005554 ** 
## MBERMIDD     0.0019283  0.0034542   0.558 0.576681    
## MBERARBG    -0.0005993  0.0034539  -0.174 0.862241    
## MBERARBO    -0.0002782  0.0034164  -0.081 0.935095    
## MSKA        -0.0022806  0.0040211  -0.567 0.570614    
## MSKB1       -0.0039062  0.0038612  -1.012 0.311728    
## MSKB2       -0.0033504  0.0034535  -0.970 0.331988    
## MSKC        -0.0019335  0.0037918  -0.510 0.610116    
## MSKD        -0.0005069  0.0036463  -0.139 0.889447    
## MHHUUR      -0.0238305  0.0307539  -0.775 0.438432    
## MHKOOP      -0.0216805  0.0307182  -0.706 0.480337    
## MAUT1        0.0116511  0.0057850   2.014 0.044037 *  
## MAUT2        0.0080192  0.0052851   1.517 0.129216    
## MAUT0        0.0077346  0.0055589   1.391 0.164140    
## MZFONDS     -0.0589186  0.0369015  -1.597 0.110377    
## MZPART      -0.0600217  0.0368420  -1.629 0.103311    
## MINKM30      0.0055613  0.0039235   1.417 0.156386    
## MINK3045     0.0050760  0.0037877   1.340 0.180234    
## MINK4575     0.0034224  0.0038337   0.893 0.372026    
## MINK7512     0.0013611  0.0040272   0.338 0.735389    
## MINK123M    -0.0130376  0.0053561  -2.434 0.014944 *  
## MINKGEM      0.0076177  0.0035222   2.163 0.030583 *  
## MKOOPKLA     0.0043055  0.0017605   2.446 0.014477 *  
## PWAPART      0.0352508  0.0126978   2.776 0.005511 ** 
## PWABEDR     -0.0112595  0.0165518  -0.680 0.496358    
## PWALAND     -0.0149981  0.0263041  -0.570 0.568568    
## PPERSAUT     0.0094506  0.0019336   4.887 1.04e-06 ***
## PBESAUT      0.0009264  0.0097120   0.095 0.924010    
## PMOTSCO     -0.0076253  0.0068607  -1.111 0.266405    
## PVRAAUT     -0.0180541  0.0231880  -0.779 0.436237    
## PAANHANG     0.0335455  0.0432778   0.775 0.438287    
## PTRACTOR     0.0066051  0.0099400   0.664 0.506387    
## PWERKT      -0.0165037  0.0278616  -0.592 0.553633    
## PBROM        0.0048050  0.0114610   0.419 0.675042    
## PLEVEN      -0.0114336  0.0049549  -2.308 0.021047 *  
## PPERSONG    -0.0158814  0.0285207  -0.557 0.577652    
## PGEZONG      0.1843040  0.0547670   3.365 0.000768 ***
## PWAOREG      0.0290693  0.0222813   1.305 0.192043    
## PBRAND       0.0148741  0.0027718   5.366 8.22e-08 ***
## PZEILPL     -0.0480886  0.1239554  -0.388 0.698061    
## PPLEZIER    -0.0011871  0.0216523  -0.055 0.956279    
## PFIETS      -0.0179992  0.0413143  -0.436 0.663090    
## PINBOED     -0.0376564  0.0220828  -1.705 0.088182 .  
## PBYSTAND     0.0351617  0.0224860   1.564 0.117916    
## AWAPART     -0.0501723  0.0247235  -2.029 0.042451 *  
## AWABEDR      0.0202559  0.0451784   0.448 0.653908    
## AWALAND     -0.0095720  0.0925739  -0.103 0.917649    
## APERSAUT    -0.0008198  0.0092918  -0.088 0.929698    
## ABESAUT     -0.0287920  0.0436043  -0.660 0.509076    
## AMOTSCO      0.0171126  0.0272229   0.629 0.529619    
## AVRAAUT      0.0438739  0.0807139   0.544 0.586748    
## AAANHANG    -0.0352543  0.0749002  -0.471 0.637877    
## ATRACTOR    -0.0226762  0.0233543  -0.971 0.331590    
## AWERKT       0.0219802  0.0561490   0.391 0.695465    
## ABROM       -0.0110523  0.0346947  -0.319 0.750068    
## ALEVEN       0.0340398  0.0116835   2.914 0.003582 ** 
## APERSONG     0.0193591  0.0793315   0.244 0.807214    
## AGEZONG     -0.3776992  0.1322457  -2.856 0.004299 ** 
## AWAOREG     -0.1183782  0.1172535  -1.010 0.312716    
## ABRAND      -0.0249035  0.0089953  -2.769 0.005642 ** 
## AZEILPL      0.3166501  0.2350619   1.347 0.177982    
## APLEZIER     0.2220372  0.0672979   3.299 0.000973 ***
## AFIETS       0.0328653  0.0310300   1.059 0.289560    
## AINBOED      0.0658758  0.0506052   1.302 0.193029    
## ABYSTAND    -0.0638010  0.0758703  -0.841 0.400413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2303 on 9736 degrees of freedom
## Multiple R-squared:  0.06284,    Adjusted R-squared:  0.05465 
## F-statistic:  7.68 on 85 and 9736 DF,  p-value: < 2.2e-16
# Reduce down to best/worst to save time
sumfwd <- summary(step.pvmodel.fwd)
pvals <- sumfwd$coefficients[,4]
feat_extremes <- pvals[pvals < 0.1 | pvals > 0.9] %>% names() %>% .[. != "(Intercept)"]

fmla <- as.formula(paste("CARAVAN ~ ", paste(feat_extremes, collapse= "+")))
fit_extreme <- lm(fmla, data = cic_data)
fit_extreme
## 
## Call:
## lm(formula = fmla, data = cic_data)
## 
## Coefficients:
## (Intercept)     MGEMLEEF     MBERBOER     MBERARBO        MAUT1     MINK123M  
##  -0.0763392    0.0056824   -0.0102959   -0.0020071    0.0040516   -0.0130215  
##     MINKGEM     MKOOPKLA      PWAPART     PPERSAUT      PBESAUT       PLEVEN  
##   0.0089861    0.0059283    0.0367790    0.0090817   -0.0055774   -0.0114734  
##     PGEZONG       PBRAND     PPLEZIER      PINBOED      AWAPART      AWALAND  
##   0.1808067    0.0157212    0.0064432   -0.0096923   -0.0524946   -0.0688962  
##    APERSAUT       ALEVEN      AGEZONG       ABRAND     APLEZIER  
##   0.0004565    0.0344180   -0.3590850   -0.0269530    0.2136198
# Backward pass
step.pvmodel.bkwd <- SignifReg(fit = fit_extreme,
                      alpha=0.05, direction="backward",
                      criterion="p-value", adjust.method="fdr", trace = FALSE)

summary(step.pvmodel.bkwd)
## 
## Call:
## lm(formula = CARAVAN ~ MBERBOER + MBERARBO + MINK123M + MKOOPKLA + 
##     PWAPART + PGEZONG + PBRAND + AGEZONG, data = cic_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33462 -0.08085 -0.05370 -0.02634  1.00847 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.009551   0.008330   1.147 0.251577    
## MBERBOER    -0.011691   0.002191  -5.336 9.73e-08 ***
## MBERARBO    -0.003975   0.001506  -2.640 0.008314 ** 
## MINK123M    -0.010126   0.004279  -2.366 0.017978 *  
## MKOOPKLA     0.009571   0.001290   7.420 1.27e-13 ***
## PWAPART      0.015597   0.002844   5.485 4.24e-08 ***
## PGEZONG      0.191883   0.055175   3.478 0.000508 ***
## PBRAND       0.007772   0.001455   5.343 9.36e-08 ***
## AGEZONG     -0.373704   0.132862  -2.813 0.004922 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2337 on 9813 degrees of freedom
## Multiple R-squared:  0.02724,    Adjusted R-squared:  0.02644 
## F-statistic: 34.34 on 8 and 9813 DF,  p-value: < 2.2e-16

There is a very good reason this p-value based approach is not included in more common packages like stats or MASS, the results can often be very miss leading. If a stepwise feature selection approach is used, it is far better to use a different scoring criterion,such as AIC, or Mallows’ Cp. However,even using these feature selection criteria, it is still prone to overfitting and can provide biased, if not completely flawed results, especially in the presence of multicollinearity. By using p-values as our criterion, not only do we have the aforementioned issues, now we introduce the Multiple Comparisons Problem, so it stands to reason that the features selected by this method disagree more so than the others.

Using AIC

full_model <- lm(CARAVAN ~., data = cic_data)
step.aicmodel.both <- stepAIC(full_model, direction = "both", trace = FALSE, steps=10)

Using AIC as the evaluating metric rather than p-value allows us to use a bidirectional approach through the feature set, providing a more thoroughly vetted set of predictor variables.

summary(step.aicmodel.both)
## 
## Call:
## lm(formula = CARAVAN ~ MOSTYPE + MAANTHUI + MGEMOMV + MGEMLEEF + 
##     MOSHOOFD + MGODPR + MGODOV + MRELGE + MRELSA + MRELOV + MFALLEEN + 
##     MFGEKIND + MFWEKIND + MOPLHOOG + MOPLMIDD + MOPLLAAG + MBERHOOG + 
##     MBERZELF + MBERBOER + MBERMIDD + MSKA + MSKB1 + MSKB2 + MSKC + 
##     MHHUUR + MHKOOP + MAUT1 + MAUT2 + MAUT0 + MZFONDS + MZPART + 
##     MINKM30 + MINK3045 + MINK4575 + MINK7512 + MINK123M + MINKGEM + 
##     MKOOPKLA + PWAPART + PWABEDR + PWALAND + PPERSAUT + PMOTSCO + 
##     PVRAAUT + PAANHANG + PTRACTOR + PWERKT + PBROM + PLEVEN + 
##     PPERSONG + PGEZONG + PWAOREG + PBRAND + PZEILPL + PFIETS + 
##     PINBOED + PBYSTAND + AWAPART + AWABEDR + ABESAUT + AMOTSCO + 
##     AVRAAUT + AAANHANG + ATRACTOR + AWERKT + ABROM + ALEVEN + 
##     AGEZONG + AWAOREG + ABRAND + AZEILPL + APLEZIER + AFIETS + 
##     AINBOED + ABYSTAND, data = cic_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59327 -0.08685 -0.04757 -0.00938  1.04216 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.5724299  0.3570989   1.603 0.108967    
## MOSTYPE      0.0016864  0.0017196   0.981 0.326764    
## MAANTHUI    -0.0030670  0.0057924  -0.529 0.596476    
## MGEMOMV      0.0019597  0.0055155   0.355 0.722372    
## MGEMLEEF     0.0081143  0.0038394   2.113 0.034588 *  
## MOSHOOFD    -0.0057459  0.0077099  -0.745 0.456132    
## MGODPR       0.0024028  0.0015469   1.553 0.120392    
## MGODOV       0.0039067  0.0025186   1.551 0.120896    
## MRELGE       0.0069933  0.0056925   1.228 0.219289    
## MRELSA       0.0014456  0.0054296   0.266 0.790054    
## MRELOV       0.0055055  0.0057076   0.965 0.334774    
## MFALLEEN    -0.0069371  0.0049260  -1.408 0.159089    
## MFGEKIND    -0.0077321  0.0050739  -1.524 0.127565    
## MFWEKIND    -0.0056485  0.0052725  -1.071 0.284052    
## MOPLHOOG     0.0046980  0.0051598   0.911 0.362578    
## MOPLMIDD    -0.0016588  0.0053805  -0.308 0.757868    
## MOPLLAAG    -0.0061594  0.0054637  -1.127 0.259636    
## MBERHOOG    -0.0010733  0.0024275  -0.442 0.658383    
## MBERZELF     0.0033968  0.0035678   0.952 0.341090    
## MBERBOER    -0.0102245  0.0029125  -3.511 0.000449 ***
## MBERMIDD     0.0023020  0.0018508   1.244 0.213612    
## MSKA        -0.0019936  0.0035439  -0.563 0.573766    
## MSKB1       -0.0036625  0.0030206  -1.213 0.225343    
## MSKB2       -0.0031161  0.0025099  -1.242 0.214439    
## MSKC        -0.0017154  0.0023368  -0.734 0.462925    
## MHHUUR      -0.0235186  0.0305618  -0.770 0.441589    
## MHKOOP      -0.0214107  0.0305246  -0.701 0.483055    
## MAUT1        0.0113245  0.0056760   1.995 0.046055 *  
## MAUT2        0.0077296  0.0051937   1.488 0.136716    
## MAUT0        0.0073747  0.0054296   1.358 0.174414    
## MZFONDS     -0.0588420  0.0367675  -1.600 0.109546    
## MZPART      -0.0599352  0.0367201  -1.632 0.102666    
## MINKM30      0.0053045  0.0038309   1.385 0.166194    
## MINK3045     0.0048150  0.0036846   1.307 0.191324    
## MINK4575     0.0031321  0.0037030   0.846 0.397675    
## MINK7512     0.0010579  0.0039042   0.271 0.786433    
## MINK123M    -0.0134459  0.0052262  -2.573 0.010103 *  
## MINKGEM      0.0076835  0.0035113   2.188 0.028679 *  
## MKOOPKLA     0.0043447  0.0017496   2.483 0.013037 *  
## PWAPART      0.0351616  0.0126789   2.773 0.005561 ** 
## PWABEDR     -0.0110759  0.0165222  -0.670 0.502640    
## PWALAND     -0.0176335  0.0057591  -3.062 0.002206 ** 
## PPERSAUT     0.0093065  0.0008407  11.070  < 2e-16 ***
## PMOTSCO     -0.0076492  0.0068517  -1.116 0.264284    
## PVRAAUT     -0.0178551  0.0230878  -0.773 0.439329    
## PAANHANG     0.0334802  0.0431691   0.776 0.438029    
## PTRACTOR     0.0065646  0.0099107   0.662 0.507745    
## PWERKT      -0.0156597  0.0268700  -0.583 0.560046    
## PBROM        0.0047397  0.0114522   0.414 0.678980    
## PLEVEN      -0.0113667  0.0049440  -2.299 0.021521 *  
## PPERSONG    -0.0095662  0.0124760  -0.767 0.443241    
## PGEZONG      0.1844539  0.0546213   3.377 0.000736 ***
## PWAOREG      0.0292366  0.0222448   1.314 0.188771    
## PBRAND       0.0148643  0.0027664   5.373 7.92e-08 ***
## PZEILPL     -0.0480655  0.1229430  -0.391 0.695837    
## PFIETS      -0.0180838  0.0412741  -0.438 0.661296    
## PINBOED     -0.0378742  0.0220526  -1.717 0.085929 .  
## PBYSTAND     0.0354185  0.0224146   1.580 0.114103    
## AWAPART     -0.0500644  0.0246897  -2.028 0.042613 *  
## AWABEDR      0.0200906  0.0451182   0.445 0.656122    
## ABESAUT     -0.0251562  0.0198435  -1.268 0.204925    
## AMOTSCO      0.0171959  0.0271983   0.632 0.527243    
## AVRAAUT      0.0430724  0.0802005   0.537 0.591239    
## AAANHANG    -0.0352425  0.0747284  -0.472 0.637218    
## ATRACTOR    -0.0226610  0.0232812  -0.973 0.330398    
## AWERKT       0.0197019  0.0534293   0.369 0.712325    
## ABROM       -0.0108239  0.0346618  -0.312 0.754841    
## ALEVEN       0.0339156  0.0116517   2.911 0.003613 ** 
## AGEZONG     -0.3782742  0.1317989  -2.870 0.004112 ** 
## AWAOREG     -0.1188004  0.1171097  -1.014 0.310399    
## ABRAND      -0.0248331  0.0089837  -2.764 0.005717 ** 
## AZEILPL      0.3163373  0.2324447   1.361 0.173572    
## APLEZIER     0.2186388  0.0300053   7.287 3.42e-13 ***
## AFIETS       0.0330478  0.0310011   1.066 0.286442    
## AINBOED      0.0664611  0.0504942   1.316 0.188133    
## ABYSTAND    -0.0646050  0.0755661  -0.855 0.392602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2302 on 9746 degrees of freedom
## Multiple R-squared:  0.06281,    Adjusted R-squared:  0.0556 
## F-statistic: 8.709 on 75 and 9746 DF,  p-value: < 2.2e-16

A quick summary shows how this method chose to keep far more values than the stepwise p-value approach.

(feats.pv <- intersect(names(step.pvmodel.bkwd$coefficients), names(step.pvmodel.fwd$coefficients)))
## [1] "(Intercept)" "MBERBOER"    "MBERARBO"    "MINK123M"    "MKOOPKLA"   
## [6] "PWAPART"     "PGEZONG"     "PBRAND"      "AGEZONG"
feats.shared <- intersect(names(step.aicmodel.both$coefficients),feats.pv)
length(feats.shared)
## [1] 8
(feats.aic <- setdiff(names(step.aicmodel.both$coefficients),feats.pv))
##  [1] "MOSTYPE"  "MAANTHUI" "MGEMOMV"  "MGEMLEEF" "MOSHOOFD" "MGODPR"  
##  [7] "MGODOV"   "MRELGE"   "MRELSA"   "MRELOV"   "MFALLEEN" "MFGEKIND"
## [13] "MFWEKIND" "MOPLHOOG" "MOPLMIDD" "MOPLLAAG" "MBERHOOG" "MBERZELF"
## [19] "MBERMIDD" "MSKA"     "MSKB1"    "MSKB2"    "MSKC"     "MHHUUR"  
## [25] "MHKOOP"   "MAUT1"    "MAUT2"    "MAUT0"    "MZFONDS"  "MZPART"  
## [31] "MINKM30"  "MINK3045" "MINK4575" "MINK7512" "MINKGEM"  "PWABEDR" 
## [37] "PWALAND"  "PPERSAUT" "PMOTSCO"  "PVRAAUT"  "PAANHANG" "PTRACTOR"
## [43] "PWERKT"   "PBROM"    "PLEVEN"   "PPERSONG" "PWAOREG"  "PZEILPL" 
## [49] "PFIETS"   "PINBOED"  "PBYSTAND" "AWAPART"  "AWABEDR"  "ABESAUT" 
## [55] "AMOTSCO"  "AVRAAUT"  "AAANHANG" "ATRACTOR" "AWERKT"   "ABROM"   
## [61] "ALEVEN"   "AWAOREG"  "ABRAND"   "AZEILPL"  "APLEZIER" "AFIETS"  
## [67] "AINBOED"  "ABYSTAND"
length(feats.aic )
## [1] 68

There were a total of 15 shared features identified as a statistically significant using the stepwise p-value approach. In addition to these variables, the bidirectional AIC approach included 20 other features that did not meet the 0.05 p-value threshold, but did meet the AIC statistic.

Principal Components Analysis (PCA)

In contrast the previous stepwise approach to feature selection, PCA does not filter out variables based on a metric, but rather, attempts to combine them in such a manner that the conglomerate variables explain the largest amount of variance.

In using PCA, we are not attempting reduce the overall feature set, we are reducing the dimensionality of the dataset. In this example, the coil2k dataset has 86 columns (85 predictors + 1 response), plotting this many features in 2 or even 3 dimensional space is simply not practically possible. With PCA, we exact useful interactions (Eigenvectors) that can be plotted in these lower dimensional spaces.

cic.pca <- prcomp(~ . -CARAVAN, data=cic_data, center = TRUE, scale. = TRUE)
(spca <- summary(cic.pca))
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5    PC6     PC7
## Standard deviation     3.0627 2.21527 1.99574 1.82836 1.71113 1.6259 1.50398
## Proportion of Variance 0.1104 0.05773 0.04686 0.03933 0.03445 0.0311 0.02661
## Cumulative Proportion  0.1104 0.16809 0.21495 0.25428 0.28872 0.3198 0.34643
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.49545 1.46284 1.4520 1.41766 1.39902 1.38336 1.37946
## Proportion of Variance 0.02631 0.02518 0.0248 0.02364 0.02303 0.02251 0.02239
## Cumulative Proportion  0.37274 0.39792 0.4227 0.44637 0.46939 0.49191 0.51430
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     1.36700 1.35872 1.34751 1.34003 1.31395 1.29922 1.28370
## Proportion of Variance 0.02198 0.02172 0.02136 0.02113 0.02031 0.01986 0.01939
## Cumulative Proportion  0.53628 0.55800 0.57936 0.60049 0.62080 0.64066 0.66004
##                           PC22    PC23   PC24    PC25    PC26    PC27    PC28
## Standard deviation     1.27285 1.24669 1.2056 1.19470 1.17662 1.14819 1.12893
## Proportion of Variance 0.01906 0.01829 0.0171 0.01679 0.01629 0.01551 0.01499
## Cumulative Proportion  0.67910 0.69739 0.7145 0.73128 0.74757 0.76308 0.77807
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     1.09927 1.09201 1.06524 1.02851 0.98657 0.97396 0.93596
## Proportion of Variance 0.01422 0.01403 0.01335 0.01245 0.01145 0.01116 0.01031
## Cumulative Proportion  0.79229 0.80632 0.81967 0.83212 0.84357 0.85473 0.86503
##                           PC36    PC37    PC38   PC39    PC40    PC41    PC42
## Standard deviation     0.92309 0.90232 0.89200 0.8793 0.86819 0.84464 0.82404
## Proportion of Variance 0.01002 0.00958 0.00936 0.0091 0.00887 0.00839 0.00799
## Cumulative Proportion  0.87506 0.88464 0.89400 0.9031 0.91196 0.92035 0.92834
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.78679 0.77184 0.74966 0.68517 0.68063 0.61773 0.58161
## Proportion of Variance 0.00728 0.00701 0.00661 0.00552 0.00545 0.00449 0.00398
## Cumulative Proportion  0.93562 0.94263 0.94924 0.95477 0.96022 0.96471 0.96869
##                           PC50    PC51    PC52    PC53   PC54    PC55    PC56
## Standard deviation     0.56714 0.44414 0.41342 0.38276 0.3802 0.36606 0.35593
## Proportion of Variance 0.00378 0.00232 0.00201 0.00172 0.0017 0.00158 0.00149
## Cumulative Proportion  0.97247 0.97479 0.97680 0.97853 0.9802 0.98180 0.98329
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.33961 0.32794 0.32222 0.31307 0.30212 0.29990 0.28570
## Proportion of Variance 0.00136 0.00127 0.00122 0.00115 0.00107 0.00106 0.00096
## Cumulative Proportion  0.98465 0.98591 0.98714 0.98829 0.98936 0.99042 0.99138
##                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
## Standard deviation     0.27841 0.27363 0.25798 0.24969 0.23323 0.20981 0.19857
## Proportion of Variance 0.00091 0.00088 0.00078 0.00073 0.00064 0.00052 0.00046
## Cumulative Proportion  0.99229 0.99317 0.99396 0.99469 0.99533 0.99585 0.99631
##                           PC71    PC72    PC73    PC74    PC75    PC76    PC77
## Standard deviation     0.19012 0.18577 0.17967 0.17695 0.16922 0.16705 0.16225
## Proportion of Variance 0.00043 0.00041 0.00038 0.00037 0.00034 0.00033 0.00031
## Cumulative Proportion  0.99674 0.99714 0.99752 0.99789 0.99823 0.99856 0.99887
##                           PC78    PC79    PC80   PC81    PC82    PC83    PC84
## Standard deviation     0.14070 0.13942 0.13586 0.1305 0.12348 0.07384 0.02532
## Proportion of Variance 0.00023 0.00023 0.00022 0.0002 0.00018 0.00006 0.00001
## Cumulative Proportion  0.99910 0.99933 0.99955 0.9998 0.99993 0.99999 1.00000
##                           PC85
## Standard deviation     0.01621
## Proportion of Variance 0.00000
## Cumulative Proportion  1.00000

Looking the Cumulative Proportion of variance explained, it does not look like this dataset will provide a great deal of insight from PCA plotting. Only 16.8% of overall variance is explained by the second variable, ideally we’d like that to quite a bit larger.

spca$importance[3,] %>% enframe(value = "var_explained") %>% 
  ggplot(aes(x=1:85, y=var_explained)) +
  geom_line() +
  geom_hline(aes(yintercept = .90), color="red", linetype=2) +
  scale_x_continuous(breaks = seq(0,85,5)) +
  xlab("n_components") + 
  ylab("variance explained") +
  ggtitle("Cumulative Variance Explained")

However, after only 39 components we are able to account for 90% of total variance, and by 62 components we reach 99%. This may be an indication that some variables in the dataset are providing little aditional information to the analysis.

fviz_pca_ind(cic.pca, geom = "point", 
             pointshape = 21, 
             fill.ind = as.factor(cic_data$CARAVAN), 
             palette = "jco", 
             addEllipses = TRUE,
             repel = TRUE,
             legend.title = "CARAVAN") +
  ggtitle("PCA-plot full feature set") +
  theme(plot.title = element_text(hjust = 0.5))

Using all 80 features makes our illustration a cluttered, overlaping mess, instead, let’s reduce the feature set to the 15 common features between the various methods of stepwise feature selection.

cic_sm <- cic_data[feats.shared[2:length(feats.shared)]]
cic_sm.pca <- prcomp(~ ., data=cic_sm, center = TRUE, scale. = TRUE)
biplot(cic_sm.pca)

Even reducing the feature set down to just 15 variables didn’t help clear up the component plot all that much. Let’s step up to a more advanced method of plotting to see if we can coherse some information out of this analysis.

fviz_pca_biplot(cic_sm.pca, geom="point",
          pointshape=21,
          fill.ind = as.factor(cic_data$CARAVAN),
          col.ind = as.factor(cic_data$CARAVAN),
          col.var = "black",
          legend.title="CARAVAN",
          palette = "jco",
          addEllipses = TRUE,
          repel = TRUE, title="PCA-2d plot, 15 feature set")

Using factoextra’s biplot certainly improves plot aesthetics but as for uncovering additional useful information, that is debateable. We get a better sense of the how disjoint the two groups are, but we would need to dig deeper by experiementing with other features to realistically understand if there is any insight to be gained here. At best, we can say we have slightly stronger evidence supporting the idea that A and P prefixes are just two ways of representing the same value, but the correlation matrix in the previous notebook still supports that claim far better.

Recursive feature elimination

Recursive feature elimination is a greedy optimization algorithm that serves the same purpose as stepwise feature selection but does so in a different manner.

Rather than using a rigidly defined criteria metric such as p-value or AIC, RFE uses feature importance as determined by the model on which it is run. Both methods iterate through the model multiple times making modifications to the feature set, but RFE restores all features after each iteration, keeping track of how important each variable has been to the model in previous generations.

# Swap to random forest instead of linear model
ctrl <- rfeControl(method = "cv", number = 5, returnResamp = "all", functions = rfFuncs) 

rfProfile <- rfe(subset(cic_data, select = -CARAVAN), 
                 as.factor(cic_data$CARAVAN),
                 sizes=c(4,8,16,32,64,85),
                 rfeControl = ctrl)

rfProfile
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy   Kappa AccuracySD KappaSD Selected
##          4   0.9405 0.01225  0.0004134 0.01276        *
##          8   0.9402 0.01154  0.0007610 0.01258         
##         16   0.9326 0.06899  0.0027723 0.01965         
##         32   0.9331 0.08176  0.0034233 0.02369         
##         64   0.9337 0.06494  0.0030347 0.01088         
##         85   0.9340 0.05360  0.0030838 0.01741         
## 
## The top 4 variables (out of 4):
##    PPERSAUT, PBRAND, APLEZIER, MOSTYPE
plot(rfProfile, type = c("g", "o"))

Interestingly, we can see here how additional variables actually decrease the accuracy of the random forest model we are using. As mentioned in the previous notebook’s analysis of this dataset, 94% is the accuracy score a model would achieve by simply predicting all FALSE values. So, to a certain degree, it is a good thing that the accuracy decreased at first. Otherwise, if the accuracy stayed at 94%, the model would not have even attempted to classify any TRUE values at all.

plot1 <- xyplot(rfProfile, 
                type = c("g", "p", "smooth"), 
                ylab = "RMSE CV Estimates")
plot2 <- densityplot(rfProfile, 
                     subset = Variables < 5, 
                     adjust = 1.25, 
                     as.table = TRUE, 
                     xlab = "RMSE CV Estimates", 
                     pch = "|")
print(plot1, split=c(1,1,1,2), more=TRUE)
print(plot2, split=c(1,2,1,2))

Again, here we can see how 0.940 is the central accurary score around which all others are distributed. With fewer variables, each fold of the cross validation is clustered closely around this score, but as new variables are introduced, they begin to spread across a slightly larger spectrum.

ctrl <- rfeControl(method = "cv", number = 5, returnResamp = "all", functions = rfFuncs)

ctrl$returnResamp <- "all"
rfProfile <- rfe(subset(cic_data, select = -CARAVAN),
                 as.factor(cic_data$CARAVAN),
                 sizes=c(4,8,16,32,64,85),
                 rfeControl = ctrl)
rfProfile
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy    Kappa AccuracySD KappaSD Selected
##          4   0.9403 0.008826  0.0009731 0.02143        *
##          8   0.9398 0.013427  0.0004208 0.01682         
##         16   0.9305 0.056817  0.0013730 0.01397         
##         32   0.9310 0.051068  0.0015589 0.02396         
##         64   0.9331 0.046476  0.0025591 0.03512         
##         85   0.9349 0.050660  0.0017526 0.02981         
## 
## The top 4 variables (out of 4):
##    PPERSAUT, PBRAND, MOSTYPE, APLEZIER

T-SNE

library(Rtsne)
# t-sne requires no duplicate values in the data, even if they are distinct and valid entries
cic_unq <- unique(subset(cic_data))
cic.tsne <- Rtsne(cic_unq, 
                  theta=0.1, # precision where 0=exact t-sne, 1=fastest
                  num_threads = 4, #  multicore
                  partial_pca = TRUE, # Fast PCA
                  pca_scale=TRUE,
                  initial_dims = 2, 
                  perplexity=30, 
                  verbose=TRUE, 
                  max_iter = 1000)
## Performing PCA
## Read the 8380 x 2 data matrix successfully!
## OpenMP is working. 4 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.100000
## Computing input similarities...
## Building tree...
## Done in 0.29 seconds (sparsity = 0.012404)!
## Learning embedding...
## Iteration 50: error is 96.981510 (50 iterations in 2.21 seconds)
## Iteration 100: error is 78.790863 (50 iterations in 2.26 seconds)
## Iteration 150: error is 72.329301 (50 iterations in 1.82 seconds)
## Iteration 200: error is 69.006548 (50 iterations in 1.70 seconds)
## Iteration 250: error is 67.163294 (50 iterations in 1.83 seconds)
## Iteration 300: error is 2.232485 (50 iterations in 1.87 seconds)
## Iteration 350: error is 1.757471 (50 iterations in 1.83 seconds)
## Iteration 400: error is 1.477625 (50 iterations in 1.81 seconds)
## Iteration 450: error is 1.294077 (50 iterations in 1.80 seconds)
## Iteration 500: error is 1.165966 (50 iterations in 1.76 seconds)
## Iteration 550: error is 1.071931 (50 iterations in 1.76 seconds)
## Iteration 600: error is 1.000255 (50 iterations in 1.75 seconds)
## Iteration 650: error is 0.944040 (50 iterations in 1.74 seconds)
## Iteration 700: error is 0.898971 (50 iterations in 1.76 seconds)
## Iteration 750: error is 0.862316 (50 iterations in 1.71 seconds)
## Iteration 800: error is 0.831997 (50 iterations in 1.72 seconds)
## Iteration 850: error is 0.806537 (50 iterations in 1.69 seconds)
## Iteration 900: error is 0.784908 (50 iterations in 1.69 seconds)
## Iteration 950: error is 0.766340 (50 iterations in 1.69 seconds)
## Iteration 1000: error is 0.750230 (50 iterations in 1.69 seconds)
## Fitting performed in 36.10 seconds.
cic.tsne$Y %>% 
  as.data.frame() %>% 
  ggplot(aes(x=V1, y=V2, color=as.factor(cic_unq$CARAVAN))) + 
  geom_point() +
  scale_color_manual(values=c("gray", "red2")) +
  labs(title="T-SNE 2-dims", color="CARAVAN")

T-SNE does little to help delineate differences among the pos/neg samples. There are many small clusters, but each contains both 0 and 1 response outcomes. That said, one advantage tsne offers is a high degree of tweakability, adjustments to initial_dims or perplexity could yield more interesting results, but nothing immediately jumps out saying there is reason to explore these options in great depth.

UMAP

library(umap)

This R implementation of UMAP has some quirks about it, the easiest fix is just to make our own little wrapper around the function to get a better feel for what is going on.

# Extremely thin wrapper to be able to see all the possiable config params at a glance
umap.wrap <- 
  function(d, # matrix; input data
           n_neighbors = 15, # integer; number of nearest neighbors
           n_components = 2, # integer; dimension of target (output) space
           metric = "euclidean", # see below
           n_epochs = 200, # integer; number of iterations performed during layout optimization
           input = "data", # c("data", "dist"); determines whether `d` is treated as a data matrix or a distance matrix
           init = "spectral", # see below
           min_dist = 0.1, #  determines how close points appear in the final layout
           set_op_mix_ratio = 1, # range [0,1]; determines who the knn-graph is used to create a fuzzy simplicial graph
           local_connectivity = 1, # used during construction of fuzzy simplicial set
           bandwidth = 1, # used during construction of fuzzy simplicial set
           alpha = 1, # initial value of "learning rate" of layout optimization
           gamma = 1, # determines, together with alpha, the learning rate of layout optimization
           negative_sample_rate = 5, # determines how many non-neighbor points used per point/iteration during layout optimization
           a = NA, # contributes to gradient calculations during layout optimization. Auto calcs when = NA
           b = NA, # contributes to gradient calculations during layout optimization
           spread = 1, # used during automatic estimation of a/b parameters.
           random_state = NA, # seed for random number generation used during umap()
           transform_state = NA, # seed for random number generation used during predict()
           knn=NA, # object of class umap.knn; precomputed nearest neighbors
           knn_repeats = 1, # number of times to restart knn search
           verbose = FALSE, # determines whether to show progress messages
           umap_learn_args = NA, # vector of arguments to python package umap-learn
           method= c("naive", "umap-learn")) # see below
  {
  params <- as.list(environment())
  conf <- params[2:(length(params)-1)]
  class(conf) <- "umap.config"
  
  return(umap(d=d, config=conf, method=method))
  }
  • metric: character or function; determines how distances between data points are computed. When using a string, available metrics are: c(“euclidean”, “manhattan”). Other available generalized metrics are: c(“cosine”, “pearson”, “pearson2”). Note: the triangle inequality may not be satisfied by some generalized metrics, hence knn search may not be optimal. When using metric.function as a function, the signature must be function(matrix, origin, target) and should compute a distance between the origin column and the target columns

  • init: character or matrix; The default string “spectral” computes an initial embedding using eigenvectors of the connectivity graph matrix. An alternative is the string “random”, which creates an initial layout based on random coordinates. This setting.can also be set to a matrix, in which case layout optimization begins from the provided coordinates.

  • method: character; implementation. Available methods are ‘naive’ (an implementation written in pure R) and ‘umap-learn’ (requires python package ‘umap-learn’)

Taken directly from https://cran.r-project.org/web/packages/umap/umap.pdf with slight stylistic modifications

cic.umap <- umap.wrap(subset(cic_data, select = -CARAVAN), n_epochs = 750, alpha=500, min_dist=.2)
cic.umap$layout %>% 
  as.data.frame() %>% 
  ggplot(aes(V1, V2)) +
  geom_point(aes(color=as.factor(cic_data$CARAVAN)))

Our initial experimentation with UMAP does not indicate any clearly distinguishable difference between the classes. Perhaps with a bit of parameter tweaking we can coerce useful information out of this algorithm.

cic.umap.lt <- umap.wrap(d=subset(cic_data, select = -CARAVAN), n_neighbors = 10, n_epochs=350, verbose = TRUE)
cic.umap.lt$layout %>% as.data.frame() %>% 
  ggplot(aes(V1, V2)) +
  geom_point(aes(color=as.factor(cic_data$CARAVAN)))

An interesting pattern, to say the least, but we can see by the large intermixed cluster at 0,0 UMAP also is not yielding ideal results in terms of interpretability.

# TODO : Feature importance random forest
# rfFuncs$rank()

Conclusions

This notebook was a complement to, or extension of, the previous previous notebook looking at the CoIL 2000 dataset. The focus this time was more on finding interesting features rather than trying many models. We used p-values from Ordinary Least Squares to perform step-wise feature selection and found 16 statistically significant features.

Thereafter, we performed PCA on both a non-standardized and standardized dataset and plotted scatter and density for each. Following that, we used two non-linear dimensionality reduction techniques, t-SNE and UMAP and looked at how those fared with the data.

We closeout with by taking a look at feature importance with a Random Forest model and performing Recursive Feature Elimination on both that and a Logistic Regressor. We find that the two models share a handful of features in common after elimination, both finding 42 variables worth keeping but only 14 of which were common between them.

The sweet spot for features seems to be between 42 and 64, as this range can explain over 90% of variance and houses 90% of importance in our Random Forest model.

Future work:

There are quite a few other dimensionality reduction techniques that could be tried, ISOMAP, SOM, Sammson mapping, and many more, all of which could provide more insight into the data. The question still remains as to whether or not the presumed redundant variables can be removed without negatively impacting predictive power. This should be explored in great depth now that we have more evidence supporting the claim. Additionally, it would be interesting to see if some form of semi-supervised learning approach could be used via k-means clustering or a similar approach.

Tags: , , , , , , ,

Categories: , ,