# Multiple Linear Regression

## Overview¶

This document will demonstrate multiple approaches toward producing linear models in python using the US DoT airfare dataset.

Two Basic Linear Regression Models will are used:

Numpy's

`polyfit`

Scipy Stats'

`linregress`

Two Multiple Linear Regression Models are used:

sklearn's

`LinearRegression`

statsmodels'

`ols`

The most performant models in terms of adjusted R-squared often have some basic interaction terms applied that add useful complexity at the cost of a bit of interpretability.

#### Imports¶

```
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import sklearn.linear_model as lm
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
```

## Data¶

Dataset: airq402.dat

Source: U.S. Department of Transportation

Description: Airfares and passengers for U.S. Domestic Routes for 4th Quarter of 2002.

Variables/Columns:

`City1 (string)`

- Origin (From) city`City2 (string)`

- Destination (to) city`Average Fare`

- average cost (in USD) Origin -> Destination across all airlines`Distance (int)`

- distance (in miles) Origin -> Destination`Average weekly passengers`

- average number of people flying Origin -> Destination per week`market leading airline (string)`

- abbreviated name of airline with the largest % market share that offers Origin -> Destination flight`market share`

- percentage of market that the leading airline possesses`Average fare`

- average cost (in USD) that the leading airline charges for Origin -> Destination`Low price airline (string)`

- abbreviated name of airline offering the lowest price for Origin -> Destination flight`market share`

- percentage of market that the lowest price airline possesses`price`

- actual price paid for this flight instance Origin -> Destination

Columns with unspecified data types are numeric floating point

Data may be obtained from:

http://users.stat.ufl.edu/~winner/data/airq402.dat

Description provided by:

http://users.stat.ufl.edu/~winner/data/airq402.txt

```
### Uncomment line to download dataset locally ###
#!wget http://users.stat.ufl.edu/~winner/data/airq402.dat
```

```
# Assign new, compressed names to the columns
colnames=['CityFrom','CityTo','AvgFare','Distance','AvgWeekPsgrs',
'MktLeadArLn','MktShare','AvgFareLead','LowPriceArLn','MktShareLow','Price']
#airq = pd.read_fwf('airq402.dat', names=colnames)
airq = pd.read_fwf('http://users.stat.ufl.edu/~winner/data/airq402.dat', names=colnames)
```

```
with open('D:\\Desktop\\pandar_tables\\panda_df.html', mode='w') as f:
f.write(airq.to_html())
```

```
airq.head()
```

## Exploratory Data Analysis¶

The dataset has no missing values and 4 columns with object values, these could be treated as categorical, but they do have a high cardinality. Other than the handling of the object values, there is no preprocessing required, the data is clean enough for immediate analysis.

```
airq.info()
```

```
airq.select_dtypes(include=['object']).nunique()
```

```
cats = ['CityFrom','CityTo','MktLeadArLn','LowPriceArLn']
```

```
city_stack = airq.CityFrom.append(airq.CityTo)
airln_stack = airq.MktLeadArLn.append(airq.LowPriceArLn)
print('unique locations: {} | unique airlines: {}'.format(city_stack.nunique(), airln_stack.nunique()))
```

Unsurprisingly, there is a good deal of overlap between the two city columns and two airline columns. With only 19 total airlines, the model would not be *too* convoluted if we were to one-hot encode them. For locations, however, a cardinality of 104 would quickly clutter the model, and since one of the most attractive aspects of linear regression is its interpretability, this is likely a poor decision.

```
airq[cats] = airq[cats].astype('category')
```

What can be experimented with is a simple categorical encoding, wherein each unique entry is assigned it's own number. Pandas does with relative ease by assigning desired object columns to a `category`

dtype.

```
airq.dtypes
```

```
airq_enc = airq.apply(lambda x: x.cat.codes if x.dtype.name == 'category' else x)
```

In order to actually use the numeric representation, we need to get the underlying `cat.codes`

from pandas. Note the `.name`

in the equivalence check, because of the way pandas handles categorical dtypes, simply comparing like one would with a float or int does not work.

```
airq_enc.dtypes
```

```
sns.pairplot(airq);
```

From these graphs, we can see very strong linear relationships among the `AvgFare`

s and `Price`

. This comes as no surprise as quite often, the leading airline controls such a large portion of the market share that they effectively set the price.

```
fig,ax = plt.subplots(figsize=(8,6))
sns.heatmap(airq_enc.corr(),vmin=-0.8, annot=True, cmap='coolwarm',ax=ax);
```

A correlation matrix provides a bit more evidence to the previous graphical analysis. We see that `AvgFareLead`

and `AvgFare`

are nearly perfectly correlated, and price is also strongly correlated with each of these variables.

## Models¶

2 Simple linear regression models are made with `numpy`

and `scipy.stats`

followed by 2 Multiple linear regressions models in `sklearn`

and `StatModels`

.

Using only 1 variable yielded an R-squared of ~0.75 for the basic models.

Using `sklearn's`

an R-squared of ~0.816 is found.

Introducing interaction terms with StatModels' `ols`

found the highest R-squared value at 0.900

```
# Define the model inputs
X, y = airq_enc.drop(columns=['Price']), airq_enc.Price
X_nc = X.drop(columns=X.select_dtypes('int8'))
```

#### Numpy's LinReg¶

numpy's in-built least squares polynomial fit can be used with `deg=1`

to mimic the most basic form of linear regression. We are given back just two things, the slope and intercept of the fit line.

```
np.polyfit(X.AvgFare, y, deg=1)
```

#### Scipy stats' LinReg¶

Scipy's `linregress`

offers similar capabilities to that of numpy's `polyfit`

, but returns a far richer representation of the model, also including the p-value, standard error and r value.

```
ss.linregress(X.AvgFare,y)
```

```
slope, intercept, r_value, p_value, std_err = ss.linregress(X.AvgFare,y)
print(f'Slope: {slope}; Intercept: {intercept}; R-squared: %0.3f' % r_value**2)
plt.plot(X.AvgFare,y, 'o', label='original data')
plt.plot(X.AvgFare, intercept + slope*X.AvgFare, 'r', label='fitted line')
plt.legend()
plt.show()
```

seaborn also has `lmplot`

which serves as a much nicer way to handle plotting of a simple linear model.

```
sns.lmplot('AvgFare','Price',airq);
```

#### Sklearn's linear regression¶

Moving up to the multi-linear regression models, sklearn's `LinearRegression`

uses Ordinary Least Squares method with very little set up required, it's quick and effective.

```
linreg = lm.LinearRegression()
model = linreg.fit(X,y)
print('R-squared: %0.5f' % model.score(X,y))
```

```
print('R-squared: %0.5f' % lm.LinearRegression().fit(X_nc,y).score(X_nc,y))
```

Using the data without any categorical values only slightly decreased the $R^2$ value, entirely excluding them from a model is an option worth considering.

#### StatsModels' linear regression¶

StatsModels arguably provides the most feature rich, robust methods of linear regression. It supports two different syntaxes for modeling, a traditional pythonic approach and, for those accustomed to *R*, it has a formula based approach. Here, we will be using the latter as it is slightly more convenient.

```
rform = " + ".join(X.columns); rform # setup R-style RHS/X/exogenous variables
```

```
resf = smf.ols(f'Price ~ {rform}', data=airq).fit()
rtab = resf.summary()
```

```
from IPython.display import display_html
display_html(rtab.tables[0], rtab.tables[2])
print(rtab.extra_txt)
```

StatsModels resembles R's style very closely, meaning there is no need to pre-encode the categorical values as that is handled for us. Another nice feature that SM provides is warnings; a warning was displayed mentioning the possibility of strong multicollinearity.

Luckily, it also has a method for calculating VIF

```
exv = resf.model.exog
exv_names = resf.model.exog_names
vifs = [vif(exv, i) for i in range(exv.shape[1])]
```

```
vif_df = pd.DataFrame(data=[*zip(exv_names, vifs)],columns=['feature', 'VIF'])
vif_df[vif_df.VIF > 5].sort_values(by=['VIF'], ascending=False).head(10)
```

Looks like there are some very large issues with collinearity, this may indicate that allowing the default means of handling categorical variables doesn't quite work with cardinalities as high as they are.

```
resf.pvalues[resf.pvalues < 0.05].sort_values()
```

Though `LowPriceArLn`

and a handful of cities do fall within the realm of statistical significance, we add a large amount of unnecessary multicolinearity and clutter the model quite badly. So, lets try the model without any of the categorical variables and add in interaction terms.

```
rform_nc = '*'.join(X_nc.columns); rform_nc
```

Using `*`

rather than `+`

builds up interaction terms (`:`

) for every variable, as well as including the original variables.

```
resf_nc = smf.ols(f"Price ~ {rform_nc}", data=airq).fit()
rtab_nc = resf_nc.summary()
display_html(rtab_nc.tables[0], rtab_nc.tables[2])
print(rtab_nc.extra_txt)
```

After simply doing away with all non-numeric variables, we are left with an Adjusted $R^2$ of 0.911, a significant improvement over the previous, non-interacting model. We also can see that both the AIC and BIC scores have decreased, due to the reduction in model complexity.

```
Xsm = sm.add_constant(X_nc)
model_ols = sm.OLS(y, Xsm).fit()
```

```
model_ols.pvalues.sort_values()
```

## Conclusions¶

In this notebook several different means of doing linear regression in python were explored.

Basic linear regression was done in `numpy`

and `scipy.stats`

, multiple linear regression was performed with `sklearn`

and `StatsModels`

After spending a large amount of time considering the best way to handle all the string values in the data, it turned out that the best was not to deal with them at all. Dropping any non-numeric values improved the model significantly.

There were many issues with collinearity across variables. Average Fare and Price are extremely closely related, which makes intuitive sense, considering Average Fare is just the mean price for that particular trip.

#### Future work¶

Sklearn has many other learner models that were left unexplored, Lasso, Ridge, Huber, all potential candidates for further exploration. Additionally, there are many more transformation that could be tried on the data to fit a line better. Only interaction terms were tried between the `AvgFares`

and `MktShares`

, many other combinations could be attempted as well as polynomial transformations.

### References¶

#### Code & Docs¶

- http://www.statsmodels.org/stable/index.html
- https://etav.github.io/python/vif_factor_python.html
- https://medium.freecodecamp.org/data-science-with-python-8-ways-to-do-linear-regression-and-measure-their-speed-b5577d75f8b
- http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
- https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.polyfit.html
- http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model