Executive Summary
This report analyses the mtcars dataset (Motor Trend Car Road Tests) from R “datasets” package, and uses multivariate regression model to understand the relationship between miles per gallon(mpg) and various variables. The effects of transmission type are especially interested in. Various techniques are used to help selecting the best fitted regression model. The results show that transmission type has a significant effect on mpg. On average, manual transmission cars get 2.94 miles per gallon more than automatic transmission cars.
1. Load Data and Perform Exploratory Data Analysis / Summary
library(datasets); data(mtcars); str(mtcars);
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The dataset contains 32 observations, 19 with automatic and 13 manual. There are 11 variables. mpg is the desired outcome(DV), which leaves the other 10 as the potential independdent variables(IV). See Figure 1 (appendix) for the relationships among all variables(a pair plot).
The initial data exploration shows that : 1) some variables should be converted to categorical : cyl, am, vs and gear. 2) in addition to transmission type (am), adjusting with other confounding variables for the coefficients should be considered. 3) some IVs are highly correlated to each other, for example, cyl and disp, cyl and hp, .. . .
mtcarsO<- mtcars #keep the original dataset
mtcars<- transform(mtcars, cyl=factor(cyl), vs=factor(vs), am=factor(am),gear=factor(gear)) # transform to factor vectors
levels(mtcars$am)=c("automatic", "manual")
2. Regression Model Selection
2.1 Single Linear Regression Model
fit1<- lm(mpg ~ am, mtcars); summary(fit1)$coef # only look at transmission type and mpg
By ignoring all other variables, only 35.98% of the variance is explained by the model. For fitting a multivariate regression model, various techniques are used to ensure varialbes are selected correctly.
2.1 Use R function step() : Choose a model by AIC in a Stepwise Algorithm
fitAll<- lm(mpg ~., mtcars); fitBest<- step(fitAll)
" mpg ~ wt + qsec + am " is suggested as the best fitted model, with p-value of 1.21e-11 and R.squared= 0.85
2.2 Check VIF
mpg’s correlations to IVs are ranked regardless of direction. Show top 5:
corAll<- cor(mtcarsO); mpgRank<- names(sort(abs(corAll["mpg",]),decreasing=T))
mpgCor<- c(); for(i in 2:6) { mpgCor[mpgRank[i]]=corAll["mpg",mpgRank[i]] }; mpgCor
## wt cyl disp hp drat
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 0.6811719
These higly-correlated variables are omitted from the suggested model fitBest.
library(car);
vif1<- vif(lm(mpg ~ wt + qsec + am + cyl + disp + hp + drat, mtcars)) #fit1+cyl+disp+hp+drat
vif2<- vif(lm(mpg ~ wt + qsec + am + cyl + disp, mtcars)) #omit drat and hp
vif3<- vif(fitBest) #omit disp and cyl (=fitBest)
list(full=vif1[,1], omit.drat=vif2[,1], omit.hp=vif3)
## $full
## wt qsec am cyl disp hp drat
## 8.470935 5.738818 3.970748 16.420690 13.587284 5.850384 3.160280
##
## $omit.drat
## wt qsec am cyl disp
## 7.479021 4.541648 3.702063 12.831337 13.334837
##
## $omit.hp
## wt qsec am
## 2.482952 1.364339 2.541437
VIF of wt and qsec are markedly decreased by omitting disp, cyl and hp. VIF of am is also somewhat decreased by this processed.
2.3 Check residuals for normality
# Test residuals for normality
shapiro.test(fitBest$residual)
##
## Shapiro-Wilk normality test
##
## data: fitBest$residual
## W = 0.9411, p-value = 0.08043
The Shapiro-Wilk p-value of 0.08 fails to reject normality.
3. Result
Selected model : " mpg ~ wt + qsec + am "
Center the data :
fitCenter<-lm(mpg ~ I(wt-mean(wt)) + I(qsec-mean(qsec)) + am, mtcars)
coef<- summary(fitCenter)$coef; summary(fitCenter)
##
## Call:
## lm(formula = mpg ~ I(wt - mean(wt)) + I(qsec - mean(qsec)) +
## am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.8979 0.7194 26.271 < 2e-16 ***
## I(wt - mean(wt)) -3.9165 0.7112 -5.507 6.95e-06 ***
## I(qsec - mean(qsec)) 1.2259 0.2887 4.247 0.000216 ***
## ammanual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
See Figure 2 for the residual plot.
3.1 Is an automatic or manual transmission better for MPG?
Manual transmission is better for MPG.
3.2 Quantify the MPG difference between automatic and manual transmissions
The model, constructed with 95% confidence interval, has a p-value of 0 and \(R^2=\) 0.8497 (explains 84.97% of the variance). At the average wt and average qsec, the MPG of automatic transmission cars is18.9 miles (95% interval: 17.42, 20.37), The MPG increases 2.94 miles if the car is manual transmission (95% interval: 0.05, 5.83).
library(ggplot2); library(GGally); #load package
fnum<- fnum+1 #figure counter
ggpairs(mtcarsO, lower=list(continuous=wrap("smooth")),
axisLabels = "none", title="mtcars Dataset Correlation") +
theme_bw() +
theme(panel.grid.major = element_blank())
fnum<- fnum+1 #figure counter
plot(fitBest, which=1)