Multiple Linear Regression
Overview¶
This document will demonstrate multiple approaches toward producing linear models in python using the US DoT airfare dataset.
Two Basic Linear Regression Models will are used:
Numpy's
polyfit
Scipy Stats'
linregress
Two Multiple Linear Regression Models are used:
sklearn's
LinearRegression
statsmodels'
ols
The most performant models in terms of adjusted R-squared often have some basic interaction terms applied that add useful complexity at the cost of a bit of interpretability.
Imports¶
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import sklearn.linear_model as lm
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
Data¶
Dataset: airq402.dat
Source: U.S. Department of Transportation
Description: Airfares and passengers for U.S. Domestic Routes for 4th Quarter of 2002.
Variables/Columns:
City1 (string)
- Origin (From) cityCity2 (string)
- Destination (to) cityAverage Fare
- average cost (in USD) Origin -> Destination across all airlinesDistance (int)
- distance (in miles) Origin -> DestinationAverage weekly passengers
- average number of people flying Origin -> Destination per weekmarket leading airline (string)
- abbreviated name of airline with the largest % market share that offers Origin -> Destination flightmarket share
- percentage of market that the leading airline possessesAverage fare
- average cost (in USD) that the leading airline charges for Origin -> DestinationLow price airline (string)
- abbreviated name of airline offering the lowest price for Origin -> Destination flightmarket share
- percentage of market that the lowest price airline possessesprice
- actual price paid for this flight instance Origin -> Destination
Columns with unspecified data types are numeric floating point
Data may be obtained from:
http://users.stat.ufl.edu/~winner/data/airq402.dat
Description provided by:
http://users.stat.ufl.edu/~winner/data/airq402.txt
### Uncomment line to download dataset locally ###
#!wget http://users.stat.ufl.edu/~winner/data/airq402.dat
# Assign new, compressed names to the columns
colnames=['CityFrom','CityTo','AvgFare','Distance','AvgWeekPsgrs',
'MktLeadArLn','MktShare','AvgFareLead','LowPriceArLn','MktShareLow','Price']
#airq = pd.read_fwf('airq402.dat', names=colnames)
airq = pd.read_fwf('http://users.stat.ufl.edu/~winner/data/airq402.dat', names=colnames)
with open('D:\\Desktop\\pandar_tables\\panda_df.html', mode='w') as f:
f.write(airq.to_html())
airq.head()
Exploratory Data Analysis¶
The dataset has no missing values and 4 columns with object values, these could be treated as categorical, but they do have a high cardinality. Other than the handling of the object values, there is no preprocessing required, the data is clean enough for immediate analysis.
airq.info()
airq.select_dtypes(include=['object']).nunique()
cats = ['CityFrom','CityTo','MktLeadArLn','LowPriceArLn']
city_stack = airq.CityFrom.append(airq.CityTo)
airln_stack = airq.MktLeadArLn.append(airq.LowPriceArLn)
print('unique locations: {} | unique airlines: {}'.format(city_stack.nunique(), airln_stack.nunique()))
Unsurprisingly, there is a good deal of overlap between the two city columns and two airline columns. With only 19 total airlines, the model would not be too convoluted if we were to one-hot encode them. For locations, however, a cardinality of 104 would quickly clutter the model, and since one of the most attractive aspects of linear regression is its interpretability, this is likely a poor decision.
airq[cats] = airq[cats].astype('category')
What can be experimented with is a simple categorical encoding, wherein each unique entry is assigned it's own number. Pandas does with relative ease by assigning desired object columns to a category
dtype.
airq.dtypes
airq_enc = airq.apply(lambda x: x.cat.codes if x.dtype.name == 'category' else x)
In order to actually use the numeric representation, we need to get the underlying cat.codes
from pandas. Note the .name
in the equivalence check, because of the way pandas handles categorical dtypes, simply comparing like one would with a float or int does not work.
airq_enc.dtypes
sns.pairplot(airq);
From these graphs, we can see very strong linear relationships among the AvgFare
s and Price
. This comes as no surprise as quite often, the leading airline controls such a large portion of the market share that they effectively set the price.
fig,ax = plt.subplots(figsize=(8,6))
sns.heatmap(airq_enc.corr(),vmin=-0.8, annot=True, cmap='coolwarm',ax=ax);
A correlation matrix provides a bit more evidence to the previous graphical analysis. We see that AvgFareLead
and AvgFare
are nearly perfectly correlated, and price is also strongly correlated with each of these variables.
Models¶
2 Simple linear regression models are made with numpy
and scipy.stats
followed by 2 Multiple linear regressions models in sklearn
and StatModels
.
Using only 1 variable yielded an R-squared of ~0.75 for the basic models.
Using sklearn's
an R-squared of ~0.816 is found.
Introducing interaction terms with StatModels' ols
found the highest R-squared value at 0.900
# Define the model inputs
X, y = airq_enc.drop(columns=['Price']), airq_enc.Price
X_nc = X.drop(columns=X.select_dtypes('int8'))
Numpy's LinReg¶
numpy's in-built least squares polynomial fit can be used with deg=1
to mimic the most basic form of linear regression. We are given back just two things, the slope and intercept of the fit line.
np.polyfit(X.AvgFare, y, deg=1)
Scipy stats' LinReg¶
Scipy's linregress
offers similar capabilities to that of numpy's polyfit
, but returns a far richer representation of the model, also including the p-value, standard error and r value.
ss.linregress(X.AvgFare,y)
slope, intercept, r_value, p_value, std_err = ss.linregress(X.AvgFare,y)
print(f'Slope: {slope}; Intercept: {intercept}; R-squared: %0.3f' % r_value**2)
plt.plot(X.AvgFare,y, 'o', label='original data')
plt.plot(X.AvgFare, intercept + slope*X.AvgFare, 'r', label='fitted line')
plt.legend()
plt.show()
seaborn also has lmplot
which serves as a much nicer way to handle plotting of a simple linear model.
sns.lmplot('AvgFare','Price',airq);
Sklearn's linear regression¶
Moving up to the multi-linear regression models, sklearn's LinearRegression
uses Ordinary Least Squares method with very little set up required, it's quick and effective.
linreg = lm.LinearRegression()
model = linreg.fit(X,y)
print('R-squared: %0.5f' % model.score(X,y))
print('R-squared: %0.5f' % lm.LinearRegression().fit(X_nc,y).score(X_nc,y))
Using the data without any categorical values only slightly decreased the $R^2$ value, entirely excluding them from a model is an option worth considering.
StatsModels' linear regression¶
StatsModels arguably provides the most feature rich, robust methods of linear regression. It supports two different syntaxes for modeling, a traditional pythonic approach and, for those accustomed to R, it has a formula based approach. Here, we will be using the latter as it is slightly more convenient.
rform = " + ".join(X.columns); rform # setup R-style RHS/X/exogenous variables
resf = smf.ols(f'Price ~ {rform}', data=airq).fit()
rtab = resf.summary()
from IPython.display import display_html
display_html(rtab.tables[0], rtab.tables[2])
print(rtab.extra_txt)
StatsModels resembles R's style very closely, meaning there is no need to pre-encode the categorical values as that is handled for us. Another nice feature that SM provides is warnings; a warning was displayed mentioning the possibility of strong multicollinearity.
Luckily, it also has a method for calculating VIF
exv = resf.model.exog
exv_names = resf.model.exog_names
vifs = [vif(exv, i) for i in range(exv.shape[1])]
vif_df = pd.DataFrame(data=[*zip(exv_names, vifs)],columns=['feature', 'VIF'])
vif_df[vif_df.VIF > 5].sort_values(by=['VIF'], ascending=False).head(10)
Looks like there are some very large issues with collinearity, this may indicate that allowing the default means of handling categorical variables doesn't quite work with cardinalities as high as they are.
resf.pvalues[resf.pvalues < 0.05].sort_values()
Though LowPriceArLn
and a handful of cities do fall within the realm of statistical significance, we add a large amount of unnecessary multicolinearity and clutter the model quite badly. So, lets try the model without any of the categorical variables and add in interaction terms.
rform_nc = '*'.join(X_nc.columns); rform_nc
Using *
rather than +
builds up interaction terms (:
) for every variable, as well as including the original variables.
resf_nc = smf.ols(f"Price ~ {rform_nc}", data=airq).fit()
rtab_nc = resf_nc.summary()
display_html(rtab_nc.tables[0], rtab_nc.tables[2])
print(rtab_nc.extra_txt)
After simply doing away with all non-numeric variables, we are left with an Adjusted $R^2$ of 0.911, a significant improvement over the previous, non-interacting model. We also can see that both the AIC and BIC scores have decreased, due to the reduction in model complexity.
Xsm = sm.add_constant(X_nc)
model_ols = sm.OLS(y, Xsm).fit()
model_ols.pvalues.sort_values()
Conclusions¶
In this notebook several different means of doing linear regression in python were explored.
Basic linear regression was done in numpy
and scipy.stats
, multiple linear regression was performed with sklearn
and StatsModels
After spending a large amount of time considering the best way to handle all the string values in the data, it turned out that the best was not to deal with them at all. Dropping any non-numeric values improved the model significantly.
There were many issues with collinearity across variables. Average Fare and Price are extremely closely related, which makes intuitive sense, considering Average Fare is just the mean price for that particular trip.
Future work¶
Sklearn has many other learner models that were left unexplored, Lasso, Ridge, Huber, all potential candidates for further exploration. Additionally, there are many more transformation that could be tried on the data to fit a line better. Only interaction terms were tried between the AvgFares
and MktShares
, many other combinations could be attempted as well as polynomial transformations.
References¶
Code & Docs¶
- http://www.statsmodels.org/stable/index.html
- https://etav.github.io/python/vif_factor_python.html
- https://medium.freecodecamp.org/data-science-with-python-8-ways-to-do-linear-regression-and-measure-their-speed-b5577d75f8b
- http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
- https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.polyfit.html
- http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model