A Regression Analysis : the Impact of Viehcle Transmission Type and Other Factors on Fuel Efficiency

Executive Summary

This report analyses the mtcars dataset (Motor Trend Car Road Tests) from R “datasets” package, and uses multivariate regression model to understand the relationship between miles per gallon(mpg) and various variables. The effects of transmission type are especially interested in. Various techniques are used to help selecting the best fitted regression model. The results show that transmission type has a significant effect on mpg. On average, manual transmission cars get 2.94 miles per gallon more than automatic transmission cars.

1. Load Data and Perform Exploratory Data Analysis / Summary

library(datasets); data(mtcars); str(mtcars);
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The dataset contains 32 observations, 19 with automatic and 13 manual. There are 11 variables. mpg is the desired outcome(DV), which leaves the other 10 as the potential independdent variables(IV). See Figure 1 (appendix) for the relationships among all variables(a pair plot).

The initial data exploration shows that : 1) some variables should be converted to categorical : cyl, am, vs and gear. 2) in addition to transmission type (am), adjusting with other confounding variables for the coefficients should be considered. 3) some IVs are highly correlated to each other, for example, cyl and disp, cyl and hp, .. . .

mtcarsO<- mtcars #keep the original dataset
mtcars<- transform(mtcars, cyl=factor(cyl), vs=factor(vs), am=factor(am),gear=factor(gear))  # transform to factor vectors
levels(mtcars$am)=c("automatic", "manual")

2. Regression Model Selection

2.1 Single Linear Regression Model

fit1<- lm(mpg ~ am, mtcars); summary(fit1)$coef # only look at transmission type and mpg

By ignoring all other variables, only 35.98% of the variance is explained by the model. For fitting a multivariate regression model, various techniques are used to ensure varialbes are selected correctly.

2.1 Use R function step() : Choose a model by AIC in a Stepwise Algorithm

fitAll<- lm(mpg ~., mtcars); fitBest<- step(fitAll)

" mpg ~ wt + qsec + am " is suggested as the best fitted model, with p-value of 1.21e-11 and R.squared= 0.85

2.2 Check VIF

mpg’s correlations to IVs are ranked regardless of direction. Show top 5:

corAll<- cor(mtcarsO); mpgRank<- names(sort(abs(corAll["mpg",]),decreasing=T)) 
mpgCor<- c(); for(i in 2:6) { mpgCor[mpgRank[i]]=corAll["mpg",mpgRank[i]] }; mpgCor
##         wt        cyl       disp         hp       drat 
## -0.8676594 -0.8521620 -0.8475514 -0.7761684  0.6811719

These higly-correlated variables are omitted from the suggested model fitBest.

library(car);
vif1<- vif(lm(mpg ~ wt + qsec + am + cyl + disp + hp + drat, mtcars)) #fit1+cyl+disp+hp+drat
vif2<- vif(lm(mpg ~ wt + qsec + am + cyl + disp, mtcars)) #omit drat and hp
vif3<- vif(fitBest) #omit disp and cyl (=fitBest)
list(full=vif1[,1], omit.drat=vif2[,1], omit.hp=vif3)
## $full
##        wt      qsec        am       cyl      disp        hp      drat 
##  8.470935  5.738818  3.970748 16.420690 13.587284  5.850384  3.160280 
## 
## $omit.drat
##        wt      qsec        am       cyl      disp 
##  7.479021  4.541648  3.702063 12.831337 13.334837 
## 
## $omit.hp
##       wt     qsec       am 
## 2.482952 1.364339 2.541437

VIF of wt and qsec are markedly decreased by omitting disp, cyl and hp. VIF of am is also somewhat decreased by this processed.

2.3 Check residuals for normality

# Test residuals for normality
shapiro.test(fitBest$residual)
## 
##  Shapiro-Wilk normality test
## 
## data:  fitBest$residual
## W = 0.9411, p-value = 0.08043

The Shapiro-Wilk p-value of 0.08 fails to reject normality.

3. Result

Selected model : " mpg ~ wt + qsec + am "

Center the data :

fitCenter<-lm(mpg ~ I(wt-mean(wt)) + I(qsec-mean(qsec)) + am, mtcars)
coef<- summary(fitCenter)$coef; summary(fitCenter)
## 
## Call:
## lm(formula = mpg ~ I(wt - mean(wt)) + I(qsec - mean(qsec)) + 
##     am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           18.8979     0.7194  26.271  < 2e-16 ***
## I(wt - mean(wt))      -3.9165     0.7112  -5.507 6.95e-06 ***
## I(qsec - mean(qsec))   1.2259     0.2887   4.247 0.000216 ***
## ammanual               2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

See Figure 2 for the residual plot.

3.1 Is an automatic or manual transmission better for MPG?

Manual transmission is better for MPG.

3.2 Quantify the MPG difference between automatic and manual transmissions

The model, constructed with 95% confidence interval, has a p-value of 0 and \(R^2=\) 0.8497 (explains 84.97% of the variance). At the average wt and average qsec, the MPG of automatic transmission cars is18.9 miles (95% interval: 17.42, 20.37), The MPG increases 2.94 miles if the car is manual transmission (95% interval: 0.05, 5.83).


Appendix

library(ggplot2); library(GGally); #load package
fnum<- fnum+1 #figure counter
ggpairs(mtcarsO, lower=list(continuous=wrap("smooth")),
        axisLabels = "none", title="mtcars Dataset Correlation") +
    theme_bw() + 
    theme(panel.grid.major = element_blank())

Figure 1 mtcars Dataset Correlation


fnum<- fnum+1 #figure counter
plot(fitBest, which=1)

Figure 2 Residuals Plot