Updated:

20 minute read

Python Counterpart

Overview

This document will fit a multiple linear model on two separate datasets: Boston from the MASS library, and Carseats from the ISLR library.

Various methods will be used to better the models created including:

  • Removal of insignificant predictors
  • Removal of highly collinear predictors
  • Addition of polynomial terms
  • Addition of interaction terms

Imports

library(MASS)
library(ISLR)
library(car)
#library(ggplot2)
library(tidyverse)
library(magrittr)

Boston

Data

This dataset is from the MASS library, it contains information collected by the U.S Census Service in 1970 concerning housing in the area of Boston Mass.

The dataset contains 506 and 14 columns:

  • crim - per capita crime rate by town.
  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus - proportion of non-retail business acres per town.
  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox - nitrogen oxides concentration (parts per 10 million).
  • rm - average number of rooms per dwelling.
  • age - proportion of owner-occupied units built prior to 1940.
  • dis - weighted mean of distances to five Boston employment centres.
  • rad - index of accessibility to radial highways.
  • tax - full-value property-tax rate per $10,000.
  • ptratio - pupil-teacher ratio by town.
  • black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat - lower status of the population (percent).
  • medv - median value of owner-occupied homes in $1000s.

Additional information can be found:
https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.926.5532&rep=rep1&type=pdf

or by typing ?Boston after importing MASS

head(bm <- MASS::Boston, n=10)

Exploratory Data Analysis

The Boston dataset has no missing values and contains only numeric data, though this far from an extensive data assessment, for the purposes of this demonstration, the data can be assumed clean enough. A more thorough assessment could be performed by reading the original paper to see if the author filled in any missing values or specified any other manipulations.

all(complete.cases(bm))
## [1] TRUE
str(bm)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
plot(bm)

From a quick plot, we can see a quite a few possible linear relationship candidates: rm:zn, dis:zn, nox:dis, nox:age, rm:lstat, …

Looking exclusively at medv relationships:
medv:lstat, medv:rm, medv:age, medv:zn

par(mfrow=c(2,2))
plot(medv~lstat + rm + age + zn, data=bm)

summary(lm(bm$medv ~ . ,data=bm))
## 
## Call:
## lm(formula = bm$medv ~ ., data = bm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Taking a more scientifically rigorous approach, we can see why simply looking at plots can be misleading.
lstat and rm are extremely good, with a p-value even smaller than machine epsilon, zn meets the p=0.05 threshold, but age turns out to be well beyond any useful significance level.

Additionally, there were several variables whose plots did not immediately show obvious correlation that also have extremely low p-values:

        t-value  p-value
dis      -7.398  6.01e-13 ***
ptratio  -7.283  1.31e-12 ***
par(mfrow=c(1,2))
plot(medv ~ dis + ptratio, data=bm)

Check the data for correlations with Variance Inflation Factors

vif(lm(medv ~ ., data=bm))
##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491
vif(lm(medv ~ rad+tax, data=bm))
##      rad      tax 
## 5.831426 5.831426

There are some highly correlated values among predictors, before we are able to make a suitable model, this issue will need to be addressed. As a general rule of thumb, VIF of > 5 means that the variable should be looked at to determine if there are issues, VIF > 10, that variable is almost certainly causing colinearity issues.

AIC(lm(medv ~ . ,data=bm))
## [1] 3027.609
BIC(lm(medv ~ . ,data=bm))
## [1] 3091.007

Models

The baseline values for our multiple linear model are:
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338, AIC: 3027.609, BIC: 3091.007

After a bit of feature selection and added polynomial terms the final model values are: Multiple R-squared: 0.8168, Adjusted R-squared: 0.8120, AIC: 2851.624, BIC: 2915.022

A fairly sizable improvement across the board

# exclude variables where p>0.05
summary(lm(medv ~. - age - indus, data=bm))
## 
## Call:
## lm(formula = medv ~ . - age - indus, data = bm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Excluding age and indus from the model slightly improves the adjusted R-squared value.

smry <- summary(lm(medv~. ,data=bm))
smry$coefficients
##                  Estimate  Std. Error      t value     Pr(>|t|)
## (Intercept)  3.645949e+01 5.103458811   7.14407419 3.283438e-12
## crim        -1.080114e-01 0.032864994  -3.28651687 1.086810e-03
## zn           4.642046e-02 0.013727462   3.38157628 7.781097e-04
## indus        2.055863e-02 0.061495689   0.33431004 7.382881e-01
## chas         2.686734e+00 0.861579756   3.11838086 1.925030e-03
## nox         -1.776661e+01 3.819743707  -4.65125741 4.245644e-06
## rm           3.809865e+00 0.417925254   9.11614020 1.979441e-18
## age          6.922246e-04 0.013209782   0.05240243 9.582293e-01
## dis         -1.475567e+00 0.199454735  -7.39800360 6.013491e-13
## rad          3.060495e-01 0.066346440   4.61289977 5.070529e-06
## tax         -1.233459e-02 0.003760536  -3.28000914 1.111637e-03
## ptratio     -9.527472e-01 0.130826756  -7.28251056 1.308835e-12
## black        9.311683e-03 0.002685965   3.46679256 5.728592e-04
## lstat       -5.247584e-01 0.050715278 -10.34714580 7.776912e-23
summary(lm(medv~. ,data=bm))
## 
## Call:
## lm(formula = medv ~ ., data = bm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
head(bm[smry$coefficients[,4] < 5e-4],20)
summary(lm(medv ~. -age -indus -crim -chas -zn -tax -rad -black, data=bm))
## 
## Call:
## lm(formula = medv ~ . - age - indus - crim - chas - zn - tax - 
##     rad - black, data = bm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7765  -3.0186  -0.6481   1.9752  27.7625 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.49920    4.61295   8.129 3.43e-15 ***
## nox         -17.99657    3.26095  -5.519 5.49e-08 ***
## rm            4.16331    0.41203  10.104  < 2e-16 ***
## dis          -1.18466    0.16842  -7.034 6.64e-12 ***
## ptratio      -1.04577    0.11352  -9.212  < 2e-16 ***
## lstat        -0.58108    0.04794 -12.122  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.994 on 500 degrees of freedom
## Multiple R-squared:  0.7081, Adjusted R-squared:  0.7052 
## F-statistic: 242.6 on 5 and 500 DF,  p-value: < 2.2e-16
# summary(lm(medv ~ nox +rm +dis +ptratio +lstat, data=bm)) # equivalently

Dropping all but the strongest indicators has reduced the R? value from 0.7348 to 0.7052 but has greatly simplified the model. Depending on the use case this may or may not be desirable, for the purposes of this demonstration, the simpler model will suffice.

vif(lm(bm$medv ~ lstat +rm +ptratio +dis +nox, data=bm))
##    lstat       rm  ptratio      dis      nox 
## 2.373021 1.697104 1.223051 2.546968 2.891388
vif(lm(medv ~ dis +nox, data=bm))
##      dis      nox 
## 2.449269 2.449269

By reducing the number of parameters, much of the issues with collinearity among predictors has been resolved. A moderate correlation still exists in dis and nox, but not likely strong enough to cause large distortions in the model.

# build a model from the five lowest p-values
lm_lrpdn <- lm(medv ~ lstat +rm +ptratio +dis +nox, data=bm)
summary(lm_lrpdn)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox, data = bm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7765  -3.0186  -0.6481   1.9752  27.7625 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.49920    4.61295   8.129 3.43e-15 ***
## lstat        -0.58108    0.04794 -12.122  < 2e-16 ***
## rm            4.16331    0.41203  10.104  < 2e-16 ***
## ptratio      -1.04577    0.11352  -9.212  < 2e-16 ***
## dis          -1.18466    0.16842  -7.034 6.64e-12 ***
## nox         -17.99657    3.26095  -5.519 5.49e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.994 on 500 degrees of freedom
## Multiple R-squared:  0.7081, Adjusted R-squared:  0.7052 
## F-statistic: 242.6 on 5 and 500 DF,  p-value: < 2.2e-16
lm_poly <- lm(medv ~ poly(lstat,5) +poly(rm,5) +ptratio +dis +nox, data=bm)
summary(lm_poly)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 5) + poly(rm, 5) + ptratio + 
##     dis + nox, data = bm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4690  -2.0743  -0.2638   1.5585  27.8069 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       46.48539    2.77936  16.725  < 2e-16 ***
## poly(lstat, 5)1 -110.31734    6.72048 -16.415  < 2e-16 ***
## poly(lstat, 5)2   27.60368    5.37979   5.131 4.15e-07 ***
## poly(lstat, 5)3   -5.04740    4.46669  -1.130 0.259022    
## poly(lstat, 5)4   12.78329    4.17450   3.062 0.002317 ** 
## poly(lstat, 5)5  -10.93446    4.22013  -2.591 0.009853 ** 
## poly(rm, 5)1      55.28702    5.85973   9.435  < 2e-16 ***
## poly(rm, 5)2      42.95462    4.85297   8.851  < 2e-16 ***
## poly(rm, 5)3      -0.50354    4.27075  -0.118 0.906191    
## poly(rm, 5)4     -19.46204    4.58248  -4.247 2.59e-05 ***
## poly(rm, 5)5     -15.27024    4.41022  -3.462 0.000582 ***
## ptratio           -0.67257    0.09579  -7.021 7.34e-12 ***
## dis               -0.91151    0.13711  -6.648 7.90e-11 ***
## nox              -14.56799    2.67333  -5.449 8.00e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.988 on 492 degrees of freedom
## Multiple R-squared:  0.8168, Adjusted R-squared:  0.812 
## F-statistic: 168.8 on 13 and 492 DF,  p-value: < 2.2e-16

From only two polynomial transforms we’ve improved the R-squared value significantly, now at 0.812, well beyond the original R-squared value of 0.7348 while using far fewer input variables.

# Before adding polynomial
print(paste('AIC:',round(AIC(lm_lrpdn),3), "BIC:",round(BIC(lm_lrpdn),3)),quote=FALSE)
## [1] AIC: 3071.439 BIC: 3101.024
# After adding polynomial
print(paste('AIC:',round(AIC(lm_poly),3), "BIC:",round(BIC(lm_poly),3)), quote=FALSE)
## [1] AIC: 2851.624 BIC: 2915.022

The poly model also saw a decrease in AIC and BIC values

Akaike’s information criterion (AIC) and Bayesian information criterion (BIC) can be used to estimate the ‘goodness’ of a statistical model, both addressing the trade-off between goodness of fit and complexity in slightly different ways.

http://r-statistics.co/Linear-Regression.html

par(mfrow=c(2,2))
plot(lm_poly)

Carseats

Data

The data used in this notebook is from ISLR’s Carseats dataset. It represents simulated data of child carseat sales at 400 different stores with 11 variables:

  • Sales - Unit sales (in thousands) at each location
  • CompPrice - Price charged by competitor at each location
  • Income - Community income level (in thousands of dollars)
  • Advertising - Local advertising budget for company at each location (in thousands of dollars)
  • Population - Population size in region (in thousands)
  • Price - Price company charges for car seats at each site
  • ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
  • Age - Average age of the local population
  • Education - Education level at each location
  • Urban - A factor with levels No and Yes to indicate whether the store is in an urban or rural location
  • US - A factor with levels No and Yes to indicate whether the store is in the US or not

Additional information can be found:
https://www.rdocumentation.org/packages/ISLR/versions/1.2/topics/Carseats

or by typing ?Carseats after importing ISLR

head(cs <- Carseats,n=20)

Exploratory Data Analysis

Carseats is a simulated data set with no missing values. Being artificial, there is no reason to believe that the data is anything but clean. However, not all of the variables are numeric, so that is something worth keeping in mind when constructing models. There are 3 categorical variables, 2 binary (Urban,US) and 1 ordinal (ShelveLoc)

all(complete.cases(cs))
## [1] TRUE
str(cs)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
plot(cs)

summary(lm(Sales ~ ., data=cs))
## 
## Call:
## lm(formula = Sales ~ ., data = cs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8692 -0.6908  0.0211  0.6636  3.4115 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.6606231  0.6034487   9.380  < 2e-16 ***
## CompPrice        0.0928153  0.0041477  22.378  < 2e-16 ***
## Income           0.0158028  0.0018451   8.565 2.58e-16 ***
## Advertising      0.1230951  0.0111237  11.066  < 2e-16 ***
## Population       0.0002079  0.0003705   0.561    0.575    
## Price           -0.0953579  0.0026711 -35.700  < 2e-16 ***
## ShelveLocGood    4.8501827  0.1531100  31.678  < 2e-16 ***
## ShelveLocMedium  1.9567148  0.1261056  15.516  < 2e-16 ***
## Age             -0.0460452  0.0031817 -14.472  < 2e-16 ***
## Education       -0.0211018  0.0197205  -1.070    0.285    
## UrbanYes         0.1228864  0.1129761   1.088    0.277    
## USYes           -0.1840928  0.1498423  -1.229    0.220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 388 degrees of freedom
## Multiple R-squared:  0.8734, Adjusted R-squared:  0.8698 
## F-statistic: 243.4 on 11 and 388 DF,  p-value: < 2.2e-16

R handles the dummy encoding of categorical variables for us so long as they are designated as factors rather than chars

summary(lm(cs$Sales ~. -Population -Education -Urban -US, data=cs))
## 
## Call:
## lm(formula = cs$Sales ~ . - Population - Education - Urban - 
##     US, data = cs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7728 -0.6954  0.0282  0.6732  3.3292 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.475226   0.505005   10.84   <2e-16 ***
## CompPrice        0.092571   0.004123   22.45   <2e-16 ***
## Income           0.015785   0.001838    8.59   <2e-16 ***
## Advertising      0.115903   0.007724   15.01   <2e-16 ***
## Price           -0.095319   0.002670  -35.70   <2e-16 ***
## ShelveLocGood    4.835675   0.152499   31.71   <2e-16 ***
## ShelveLocMedium  1.951993   0.125375   15.57   <2e-16 ***
## Age             -0.046128   0.003177  -14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 392 degrees of freedom
## Multiple R-squared:  0.872,  Adjusted R-squared:  0.8697 
## F-statistic: 381.4 on 7 and 392 DF,  p-value: < 2.2e-16

Following a similar process as before, features outside of the statistical significance threshold are excluded.

This time, the adjusted R-squared value decreased but by so little, it could be safely written off as noise.

vif(lm(Sales ~., data=cs))
##                 GVIF Df GVIF^(1/(2*Df))
## CompPrice   1.554618  1        1.246843
## Income      1.024731  1        1.012290
## Advertising 2.103136  1        1.450219
## Population  1.145534  1        1.070296
## Price       1.537068  1        1.239785
## ShelveLoc   1.033891  2        1.008367
## Age         1.021051  1        1.010471
## Education   1.026342  1        1.013086
## Urban       1.022705  1        1.011289
## US          1.980720  1        1.407380
scs <- subset(cs, select=-c(Population, Education, Urban, US))
vif(lm(Sales ~., data=scs))
##                 GVIF Df GVIF^(1/(2*Df))
## CompPrice   1.534883  1        1.238904
## Income      1.015448  1        1.007694
## Advertising 1.012935  1        1.006447
## Price       1.534425  1        1.238719
## ShelveLoc   1.015139  2        1.003763
## Age         1.016830  1        1.008380

Neither the original nor the reduced data appear to have issues with high collinearity, so no further action is required before modeling

plot(scs)

A quick plot shows potential for a linear relationships between Price and CompPrice/Sales

Models

The baseline values for our multiple linear model are:
Multiple R-squared: 0.8734, Adjusted R-squared: 0.8698, AIC: 1160.470, BIC: 1196.393

After feature selection, added polynomial and interaction terms, the final model values are:
Multiple R-squared: 0.8768, Adjusted R-squared: 0.8730, AIC: 1155.221, BIC: 1211.101

These results are far from ideal, only 3/4 metrics improved and only by a very small amount.

summary(lm(Sales ~ ., data=scs))
## 
## Call:
## lm(formula = Sales ~ ., data = scs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7728 -0.6954  0.0282  0.6732  3.3292 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.475226   0.505005   10.84   <2e-16 ***
## CompPrice        0.092571   0.004123   22.45   <2e-16 ***
## Income           0.015785   0.001838    8.59   <2e-16 ***
## Advertising      0.115903   0.007724   15.01   <2e-16 ***
## Price           -0.095319   0.002670  -35.70   <2e-16 ***
## ShelveLocGood    4.835675   0.152499   31.71   <2e-16 ***
## ShelveLocMedium  1.951993   0.125375   15.57   <2e-16 ***
## Age             -0.046128   0.003177  -14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 392 degrees of freedom
## Multiple R-squared:  0.872,  Adjusted R-squared:  0.8697 
## F-statistic: 381.4 on 7 and 392 DF,  p-value: < 2.2e-16
summary(lm(Sales ~ . + Income:Advertising + Price:Age, data=scs))# ILSR 3.6.6
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = scs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8199 -0.7116  0.0023  0.6831  3.2763 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.4243499  0.9646032   6.660 9.33e-11 ***
## CompPrice           0.0928079  0.0040924  22.678  < 2e-16 ***
## Income              0.0108875  0.0025863   4.210 3.18e-05 ***
## Advertising         0.0637965  0.0208420   3.061 0.002359 ** 
## Price              -0.1010926  0.0073986 -13.664  < 2e-16 ***
## ShelveLocGood       4.8326896  0.1522867  31.734  < 2e-16 ***
## ShelveLocMedium     1.9461356  0.1250819  15.559  < 2e-16 ***
## Age                -0.0586996  0.0158383  -3.706 0.000241 ***
## Income:Advertising  0.0007561  0.0002776   2.724 0.006748 ** 
## Price:Age           0.0001130  0.0001325   0.853 0.394279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 390 degrees of freedom
## Multiple R-squared:  0.8748, Adjusted R-squared:  0.8719 
## F-statistic: 302.7 on 9 and 390 DF,  p-value: < 2.2e-16

Including a couple of interaction terms increases the adjusted R-squared value by a small amount

lm_fin <- lm(Sales ~ CompPrice +ShelveLoc +Income*Advertising +Price*Age +poly(Price,4), data=scs)
summary(lm_fin)
## 
## Call:
## lm(formula = Sales ~ CompPrice + ShelveLoc + Income * Advertising + 
##     Price * Age + poly(Price, 4), data = scs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73745 -0.70109  0.01392  0.68709  3.10631 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.4466871  0.9638178   6.689 7.90e-11 ***
## CompPrice           0.0924781  0.0040927  22.596  < 2e-16 ***
## ShelveLocGood       4.8092343  0.1520931  31.620  < 2e-16 ***
## ShelveLocMedium     1.9504509  0.1249464  15.610  < 2e-16 ***
## Income              0.0103652  0.0025883   4.005 7.45e-05 ***
## Advertising         0.0622859  0.0207701   2.999 0.002885 ** 
## Price              -0.1007729  0.0073871 -13.642  < 2e-16 ***
## Age                -0.0579845  0.0158352  -3.662 0.000285 ***
## poly(Price, 4)1            NA         NA      NA       NA    
## poly(Price, 4)2     0.8416354  1.0121660   0.832 0.406193    
## poly(Price, 4)3    -0.4334704  1.0185101  -0.426 0.670641    
## poly(Price, 4)4    -2.3557643  1.0143846  -2.322 0.020732 *  
## Income:Advertising  0.0007827  0.0002768   2.827 0.004935 ** 
## Price:Age           0.0001097  0.0001324   0.829 0.407778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 387 degrees of freedom
## Multiple R-squared:  0.8768, Adjusted R-squared:  0.873 
## F-statistic: 229.5 on 12 and 387 DF,  p-value: < 2.2e-16

By using both polynomial and interaction terms, the highest R-squared we are able to achieve is 0.873. However, investing more time into deeper exploration of transforms and interactions would almost certainly garner better results.

AIC(lm(Sales ~ ., data=scs))
## [1] 1160.47
BIC(lm(Sales ~ ., data=scs))
## [1] 1196.393
AIC(lm_fin)
## [1] 1155.221
BIC(lm_fin)
## [1] 1211.101

Because the model complexity increased a fair deal from these modifications, the BIC value increased and AIC only decreased slightly. Thus, it is debatable whether making these changes actually would improve the model’s predictive power.

par(mfrow=c(2,2))
plot(lm_fin)

Conclusions

This notebook explored two datasets, Boston and Carseats, building multiple linear models for each.

The process was very similar for both datasets:

determine variable importance -> deal with collinearity -> dimension reduction -> add transforms/interactions -> assess

The results of following this process differed quite a bit between the two datasets, however. The Boston dataset responding very well to the transformed terms added, increasing the R-squared value significantly. The Carseats dataset was rather unresponsive to the applied transforms.

There could be several different reasons for the alternate outcomes, could be because one dataset was real and the other contrived, or because one had all continuous variables and the other had some categorical.

Future Work: A great deal more could be done with these datasets. For starters, a future task could be to try and make predictions with these models, adding more interactions or transforms to improve the R-squared values, stepping out of linear models and try using a GBM or RF to make more sense of the data, and more visualizations could be toyed with, expanding out into 3d or plotting on a map in the case of the Boston dataset.

Tags: , , ,

Categories: ,