Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(corrplot)

Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

Section 1.1 Quantile Plot - Response Variable

Firstly, we will have to check whether the response variable follows a normal distribution by charting a quantile plot. This can be compared to the quantile plot of the logarithm of the price variable.

par(mfrow = c(1,2))
qqnorm(ames_train$price, pch = 1, main = "Q-Q Plot - Price")
qqline(ames_train$price)
qqnorm(log(ames_train$price), pch = 1, main = "Q-Q Plot - log(Price)")
qqline(log(ames_train$price))

par(mfrow = c(1,2))
qqnorm(ames_train$area, pch = 1, main = "Q-Q Plot - Area")
qqline(ames_train$area)
qqnorm(log(ames_train$area), pch = 1, main = "Q-Q Plot - log(Area)")
qqline(log(ames_train$area))

It can be observed from the above 4 plots that the price variable and the area variable have a much larger deviation from the qqline than the log(price) and log(area). This analysis is done to show that the log(price) and log(area) variables should be used in the final model generation.

Section 1.2 Boxplot - Price vs Neighborhood

Let us try to understand how the neighborhood affects the price of the house. Since the mantra in real estate is “Location, Location and Location”, the neighborhood’s effect is explored with the price of the house.The box plot is ordered in descending order based on the median of the price of the houses grouped by their corresponding neighborhood.

ggplot(ames_train,aes(x=price,y=reorder(Neighborhood,price))) + geom_boxplot(aes(fill = reorder(Neighborhood,price)) )+ labs(title = "Price of Houses - Neighborhood", x = "Prices of House", y ='Neighborhood') + scale_fill_discrete(guide = guide_legend(title = "Neighborhood"))

We can also extract the costliest and cheapest neighborhood.

sub_data = ames_train %>% select(Neighborhood, price) %>% group_by(Neighborhood) %>% summarise(Price_Median = median(price), Price_iqr = IQR(price))
sub_data[sub_data$Price_Median == max(sub_data$Price_Median),]

## # A tibble: 1 x 3
##   Neighborhood Price_Median Price_iqr
##   <fct>               <dbl>     <dbl>
## 1 StoneBr           340692.    151358

sub_data[sub_data$Price_Median == min(sub_data$Price_Median),]

## # A tibble: 1 x 3
##   Neighborhood Price_Median Price_iqr
##   <fct>               <dbl>     <dbl>
## 1 MeadowV             85750     20150

sub_data[sub_data$Price_iqr == max(sub_data$Price_iqr),]

## # A tibble: 1 x 3
##   Neighborhood Price_Median Price_iqr
##   <fct>               <dbl>     <dbl>
## 1 StoneBr           340692.    151358

It can be seen that the Stone Brook neighborhood is the most expensive locality [Price = 340,692] and the meadow village has the lowest priced houses [85,750] in general. It can also be observed that the the stonebrook neighborhood has the most variation when it comes to house prices with an inter-quartile range = 151,358.

Section 1.3 Boxplot - Price vs Overall Quality

The overall quality of a house encompasses all the other quality variables and trying to understand its correlation with the price of a house will be insightful and help in building the prediction model.

ggplot(ames_train,aes(x=price,y=reorder(Overall.Qual,price))) + geom_boxplot(aes(fill = reorder(Overall.Qual,price)))+ labs(title = "Price of Houses - Overall Quality", x = "Prices of House", y ='Overall Quality') + scale_fill_discrete(guide = guide_legend(title = "Overall Quality"))

The box plot clearly shows that as the price of the house increases, the overall quality increases. This is major contributing factor for including the “overall quality” variable in the final model.

Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

The variables initially included in the model building are log(Area), log(Lot.Area), Year.Built, Year.Remod.Add, Overall.Qual, Overall.Cond , Neighborhood, MS.Zoning.

The above mentioned variables have a strong correlation with the price variable due to which they have been included in the final model. The Lot.Area and Overall Condition have a low correlation but nevertheless are included in the final model as they are factors that determine the price of a house [Intuitive Guess]. The above observation can be seen in the correlation plot below.

sub = ames_train %>% select(price, area, Lot.Area, Year.Built, Year.Remod.Add, Overall.Qual, Overall.Cond)
sub = na.omit(sub)
corrplot(cor(sub))

The model is fitted as shown below.

fit = lm(log(price)~log(area) + log(Lot.Area) + Year.Built + Year.Remod.Add + Overall.Qual + Overall.Cond + Neighborhood + MS.Zoning, data = ames_train)
summary(fit)

## 
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + Year.Built + 
##     Year.Remod.Add + Overall.Qual + Overall.Cond + Neighborhood + 
##     MS.Zoning, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47112 -0.07093  0.00515  0.08422  0.47485 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.4018763  0.9377834  -4.694 3.07e-06 ***
## log(area)            0.4265605  0.0212401  20.083  < 2e-16 ***
## log(Lot.Area)        0.1324598  0.0139135   9.520  < 2e-16 ***
## Year.Built           0.0047180  0.0004038  11.683  < 2e-16 ***
## Year.Remod.Add       0.0008538  0.0003566   2.394 0.016850 *  
## Overall.Qual         0.0867910  0.0062595  13.866  < 2e-16 ***
## Overall.Cond         0.0641653  0.0054020  11.878  < 2e-16 ***
## NeighborhoodBlueste -0.0205572  0.1000474  -0.205 0.837245    
## NeighborhoodBrDale  -0.1386011  0.0708885  -1.955 0.050849 .  
## NeighborhoodBrkSide -0.0079625  0.0613170  -0.130 0.896706    
## NeighborhoodClearCr -0.0300840  0.0669845  -0.449 0.653447    
## NeighborhoodCollgCr -0.1045045  0.0495554  -2.109 0.035216 *  
## NeighborhoodCrawfor  0.0818480  0.0583171   1.403 0.160791    
## NeighborhoodEdwards -0.1253812  0.0529036  -2.370 0.017985 *  
## NeighborhoodGilbert -0.1619510  0.0517729  -3.128 0.001812 ** 
## NeighborhoodGreens   0.1217758  0.0874735   1.392 0.164200    
## NeighborhoodGrnHill  0.3579906  0.1172968   3.052 0.002336 ** 
## NeighborhoodIDOTRR  -0.0691103  0.0672505  -1.028 0.304371    
## NeighborhoodMeadowV -0.1133437  0.0656519  -1.726 0.084591 .  
## NeighborhoodMitchel -0.0626478  0.0534502  -1.172 0.241455    
## NeighborhoodNAmes   -0.0685242  0.0515171  -1.330 0.183792    
## NeighborhoodNoRidge  0.0104211  0.0550301   0.189 0.849842    
## NeighborhoodNPkVill -0.0409177  0.0873188  -0.469 0.639460    
## NeighborhoodNridgHt  0.0871940  0.0507142   1.719 0.085878 .  
## NeighborhoodNWAmes  -0.1460961  0.0540022  -2.705 0.006943 ** 
## NeighborhoodOldTown -0.0595421  0.0629252  -0.946 0.344265    
## NeighborhoodSawyer  -0.0797240  0.0534144  -1.493 0.135881    
## NeighborhoodSawyerW -0.1713355  0.0518786  -3.303 0.000993 ***
## NeighborhoodSomerst  0.0338595  0.0586183   0.578 0.563652    
## NeighborhoodStoneBr  0.1244506  0.0574350   2.167 0.030494 *  
## NeighborhoodSWISU   -0.0374282  0.0684672  -0.547 0.584740    
## NeighborhoodTimber   0.0005126  0.0584865   0.009 0.993009    
## NeighborhoodVeenker -0.0563801  0.0679205  -0.830 0.406694    
## MS.ZoningFV          0.2388192  0.0727262   3.284 0.001061 ** 
## MS.ZoningI (all)     0.2308595  0.1594085   1.448 0.147880    
## MS.ZoningRH          0.3107010  0.0834037   3.725 0.000206 ***
## MS.ZoningRL          0.3140522  0.0605050   5.191 2.56e-07 ***
## MS.ZoningRM          0.2487574  0.0556829   4.467 8.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1476 on 962 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8768 
## F-statistic: 193.1 on 37 and 962 DF,  p-value: < 2.2e-16

The final summary gives an adjusted R-Squared = 0.876 which can be interpreted as the model explaining 87.6% of the variation in the log(price) variable. The significant variables can be obtained from the p-values in the last column in the summary table. We can see that log(Area), log(Lot.Area), Year.Built, Overall.Qual, Overall.Cond , Neighborhood and MS.Zoning are the significant variables as they have a p-value less than 0.05. Moreover, since atleast one class of Neighborhood and MS.Zoning has a p-value less than 0.05, both the variables are included.

Section 2.2 Model Selection

Now either using BAS or another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

The model selection methods used for comparison are “AIC” and “BIC” instead of BAS. Firstly, lets use the AIC assessment criteria to check if the current model is the optimal solution.

ames_aic = step(fit,k=2,direction = "backward")

## Start:  AIC=-3788.63
## log(price) ~ log(area) + log(Lot.Area) + Year.Built + Year.Remod.Add + 
##     Overall.Qual + Overall.Cond + Neighborhood + MS.Zoning
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        20.971 -3788.6
## - Year.Remod.Add  1    0.1249 21.096 -3784.7
## - MS.Zoning       5    0.6610 21.631 -3767.6
## - log(Lot.Area)   1    1.9757 22.946 -3700.6
## - Year.Built      1    2.9755 23.946 -3657.9
## - Overall.Cond    1    3.0756 24.046 -3653.8
## - Neighborhood   26    4.4318 25.402 -3648.9
## - Overall.Qual    1    4.1909 25.162 -3608.4
## - log(area)       1    8.7919 29.762 -3440.5

The above summary shows that removal of no variables gives the minimum AIC value. Hence, according to this model, all variables are included in the model through the AIC criteria. This can be confirmed by checking the Adjusted R-squared value after running the stepwise AIC method.

summary(ames_aic)

## 
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + Year.Built + 
##     Year.Remod.Add + Overall.Qual + Overall.Cond + Neighborhood + 
##     MS.Zoning, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47112 -0.07093  0.00515  0.08422  0.47485 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.4018763  0.9377834  -4.694 3.07e-06 ***
## log(area)            0.4265605  0.0212401  20.083  < 2e-16 ***
## log(Lot.Area)        0.1324598  0.0139135   9.520  < 2e-16 ***
## Year.Built           0.0047180  0.0004038  11.683  < 2e-16 ***
## Year.Remod.Add       0.0008538  0.0003566   2.394 0.016850 *  
## Overall.Qual         0.0867910  0.0062595  13.866  < 2e-16 ***
## Overall.Cond         0.0641653  0.0054020  11.878  < 2e-16 ***
## NeighborhoodBlueste -0.0205572  0.1000474  -0.205 0.837245    
## NeighborhoodBrDale  -0.1386011  0.0708885  -1.955 0.050849 .  
## NeighborhoodBrkSide -0.0079625  0.0613170  -0.130 0.896706    
## NeighborhoodClearCr -0.0300840  0.0669845  -0.449 0.653447    
## NeighborhoodCollgCr -0.1045045  0.0495554  -2.109 0.035216 *  
## NeighborhoodCrawfor  0.0818480  0.0583171   1.403 0.160791    
## NeighborhoodEdwards -0.1253812  0.0529036  -2.370 0.017985 *  
## NeighborhoodGilbert -0.1619510  0.0517729  -3.128 0.001812 ** 
## NeighborhoodGreens   0.1217758  0.0874735   1.392 0.164200    
## NeighborhoodGrnHill  0.3579906  0.1172968   3.052 0.002336 ** 
## NeighborhoodIDOTRR  -0.0691103  0.0672505  -1.028 0.304371    
## NeighborhoodMeadowV -0.1133437  0.0656519  -1.726 0.084591 .  
## NeighborhoodMitchel -0.0626478  0.0534502  -1.172 0.241455    
## NeighborhoodNAmes   -0.0685242  0.0515171  -1.330 0.183792    
## NeighborhoodNoRidge  0.0104211  0.0550301   0.189 0.849842    
## NeighborhoodNPkVill -0.0409177  0.0873188  -0.469 0.639460    
## NeighborhoodNridgHt  0.0871940  0.0507142   1.719 0.085878 .  
## NeighborhoodNWAmes  -0.1460961  0.0540022  -2.705 0.006943 ** 
## NeighborhoodOldTown -0.0595421  0.0629252  -0.946 0.344265    
## NeighborhoodSawyer  -0.0797240  0.0534144  -1.493 0.135881    
## NeighborhoodSawyerW -0.1713355  0.0518786  -3.303 0.000993 ***
## NeighborhoodSomerst  0.0338595  0.0586183   0.578 0.563652    
## NeighborhoodStoneBr  0.1244506  0.0574350   2.167 0.030494 *  
## NeighborhoodSWISU   -0.0374282  0.0684672  -0.547 0.584740    
## NeighborhoodTimber   0.0005126  0.0584865   0.009 0.993009    
## NeighborhoodVeenker -0.0563801  0.0679205  -0.830 0.406694    
## MS.ZoningFV          0.2388192  0.0727262   3.284 0.001061 ** 
## MS.ZoningI (all)     0.2308595  0.1594085   1.448 0.147880    
## MS.ZoningRH          0.3107010  0.0834037   3.725 0.000206 ***
## MS.ZoningRL          0.3140522  0.0605050   5.191 2.56e-07 ***
## MS.ZoningRM          0.2487574  0.0556829   4.467 8.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1476 on 962 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8768 
## F-statistic: 193.1 on 37 and 962 DF,  p-value: < 2.2e-16

The adjusted R-Squared value = 0.8768 as shown above is similar to the adjusted R-squared value of the original model.

The second model selection method is the BIC assessment method. The below code implements the BIC method.

ames_bic = step(fit,k=log(1000),direction = "backward",)

## Start:  AIC=-3602.14
## log(price) ~ log(area) + log(Lot.Area) + Year.Built + Year.Remod.Add + 
##     Overall.Qual + Overall.Cond + Neighborhood + MS.Zoning
## 
##                  Df Sum of Sq    RSS     AIC
## - MS.Zoning       5    0.6610 21.631 -3605.6
## - Year.Remod.Add  1    0.1249 21.096 -3603.1
## <none>                        20.971 -3602.1
## - Neighborhood   26    4.4318 25.402 -3590.0
## - log(Lot.Area)   1    1.9757 22.946 -3519.0
## - Year.Built      1    2.9755 23.946 -3476.4
## - Overall.Cond    1    3.0756 24.046 -3472.2
## - Overall.Qual    1    4.1909 25.162 -3426.9
## - log(area)       1    8.7919 29.762 -3258.9
## 
## Step:  AIC=-3605.65
## log(price) ~ log(area) + log(Lot.Area) + Year.Built + Year.Remod.Add + 
##     Overall.Qual + Overall.Cond + Neighborhood
## 
##                  Df Sum of Sq    RSS     AIC
## - Year.Remod.Add  1    0.1087 21.740 -3607.5
## <none>                        21.631 -3605.6
## - Neighborhood   26    4.7789 26.410 -3585.6
## - log(Lot.Area)   1    2.6270 24.259 -3497.9
## - Year.Built      1    2.9774 24.609 -3483.6
## - Overall.Cond    1    3.3319 24.963 -3469.3
## - Overall.Qual    1    4.3683 26.000 -3428.6
## - log(area)       1    8.6694 30.301 -3275.5
## 
## Step:  AIC=-3607.54
## log(price) ~ log(area) + log(Lot.Area) + Year.Built + Overall.Qual + 
##     Overall.Cond + Neighborhood
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       21.740 -3607.5
## - Neighborhood  26    4.7638 26.504 -3589.0
## - log(Lot.Area)  1    2.6022 24.342 -3501.4
## - Year.Built     1    3.5541 25.294 -3463.0
## - Overall.Qual   1    4.4713 26.212 -3427.4
## - Overall.Cond   1    4.5324 26.273 -3425.1
## - log(area)      1    9.3779 31.118 -3255.8

The above summary shows that removal of “Year.Remod.Add” and “MS.Zoning” reduces the AIC value. Hence, according to this model these 2 variables are redundant and can be removed from the final model.

summary(ames_bic)

## 
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + Year.Built + 
##     Overall.Qual + Overall.Cond + Neighborhood, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45156 -0.07037  0.00251  0.08251  0.47209 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.8823227  0.8192606  -3.518 0.000455 ***
## log(area)            0.4313399  0.0211087  20.434  < 2e-16 ***
## log(Lot.Area)        0.1431479  0.0132987  10.764  < 2e-16 ***
## Year.Built           0.0048845  0.0003883  12.580  < 2e-16 ***
## Overall.Qual         0.0886864  0.0062854  14.110  < 2e-16 ***
## Overall.Cond         0.0704467  0.0049590  14.206  < 2e-16 ***
## NeighborhoodBlueste -0.0891913  0.0986241  -0.904 0.366032    
## NeighborhoodBrDale  -0.2114841  0.0674194  -3.137 0.001759 ** 
## NeighborhoodBrkSide -0.0742913  0.0582385  -1.276 0.202389    
## NeighborhoodClearCr -0.0595757  0.0672959  -0.885 0.376226    
## NeighborhoodCollgCr -0.1176472  0.0500900  -2.349 0.019039 *  
## NeighborhoodCrawfor  0.0464654  0.0582430   0.798 0.425191    
## NeighborhoodEdwards -0.1508086  0.0530717  -2.842 0.004583 ** 
## NeighborhoodGilbert -0.1729601  0.0523546  -3.304 0.000989 ***
## NeighborhoodGreens   0.1110296  0.0886570   1.252 0.210745    
## NeighborhoodGrnHill  0.2797081  0.1159052   2.413 0.015996 *  
## NeighborhoodIDOTRR  -0.2141527  0.0596278  -3.591 0.000345 ***
## NeighborhoodMeadowV -0.1800227  0.0615783  -2.923 0.003542 ** 
## NeighborhoodMitchel -0.0886748  0.0536504  -1.653 0.098690 .  
## NeighborhoodNAmes   -0.0951244  0.0515763  -1.844 0.065438 .  
## NeighborhoodNoRidge -0.0075672  0.0555292  -0.136 0.891633    
## NeighborhoodNPkVill -0.0518181  0.0883032  -0.587 0.557461    
## NeighborhoodNridgHt  0.0783688  0.0513035   1.528 0.126950    
## NeighborhoodNWAmes  -0.1742615  0.0540764  -3.223 0.001313 ** 
## NeighborhoodOldTown -0.1346813  0.0582285  -2.313 0.020933 *  
## NeighborhoodSawyer  -0.1066211  0.0535412  -1.991 0.046720 *  
## NeighborhoodSawyerW -0.1847387  0.0523059  -3.532 0.000432 ***
## NeighborhoodSomerst -0.0254272  0.0491033  -0.518 0.604695    
## NeighborhoodStoneBr  0.1125059  0.0581039   1.936 0.053123 .  
## NeighborhoodSWISU   -0.0677370  0.0686424  -0.987 0.323983    
## NeighborhoodTimber  -0.0142732  0.0591225  -0.241 0.809282    
## NeighborhoodVeenker -0.0839978  0.0684108  -1.228 0.219804    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1499 on 968 degrees of freedom
## Multiple R-squared:  0.877,  Adjusted R-squared:  0.873 
## F-statistic: 222.6 on 31 and 968 DF,  p-value: < 2.2e-16

As we can see with the removal of “Year.Remod.Add” and “MS.Zoning” the adjusted R-Squared value does not change much from the original model. The AIC model predicted no change in the model because the AIC value decrease while removing any one variable but when closely observed it can be seen that the 2 least significant variables, according to the AIC method, are “Year.Remod.Add” and “MS.Zoning” as the decrease in AIC after removing either of these 2 variables are very less. Hence since the BIC model selection method penalizes the parameters more than the AIC model selection method the least 2 significant variables are removed.

Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

The preferred model is the BIC assessed model - “ames_bic” since it is a more parsimonious model and has less number of explanatory variables in comparison to the AIC model.

plot(ames_bic, which = 1, sub.caption = "")

The residual plot clearly shows the residuals to be randomly scattered around the linear line which proves the linearity condition. The major outliers in the plot are points - (181,428,310) and these are explored below

sub[c(181,428,310),]

## # A tibble: 3 x 7
##    price  area Lot.Area Year.Built Year.Remod.Add Overall.Qual Overall.Cond
##    <int> <int>    <int>      <int>          <int>        <int>        <int>
## 1  82500  1411    11900       1977           1977            7            5
## 2  12789   832     9656       1923           1970            2            2
## 3 184750  4676    40094       2007           2008           10            5

These 3 points can be safely assumed to be outliers. The point 428 has a price 12,789 dollars which is extremely low for a house and clearly shows it is an outlier. The point 310 which has a house priced at 184750 is close to the mean = 181190 whereas it has an overall quality = 10 and has an area approximately 3 times the mean area [1476] giving the model to predict at a very high rate. This shows that the point 310 is severly undervalued and has good chances for an investment.

Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

rmse_inin = sqrt(mean((exp(predict(ames_bic,ames_train)) - ames_train$price)**2))
rmse_inin

## [1] 30557.52

The root mean square error after transforming the variable = 30,557 dollars.

Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?

The test data set contains the neighborhood - Landmrk which is not present in the training data set and hence it was removed.

ames_test = ames_test %>% filter(Neighborhood!='Landmrk')
pred = predict(ames_bic,ames_test,estimator = "BMA")
rmse_inout = sqrt(mean((exp(pred) - ames_test$price)**2))
rmse_inout

## [1] 25409.34

The out of sample root mean squared error = 25,409 is relatively less when compared to the in sample root mean squared error = 30,557 which shows that the model has not overfit the training data and performs relatively well on new data. The performance of the model on the test data set shows that the model is more accurate for the test data set when compared to the training data set. The reason for the model to perform better on the test set may be attributed to extracting the right and significant variables while building the model using the training set. Moreover by removing the uneccesary variables using the BIC model selection method we might have prevented overfitting the model leading to better predictions with the test set.

Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

Section 3.1 Final Model

Provide the summary table for your model.

The additional variables added to the final model in addition to the original model are Garage.Area, Exter.Qual, Exter.Cond, Total.Bsmt.SF, Central.Air and Bedroom.AbvGr.

fit_final = lm(log(price)~log(area) + log(Lot.Area) + Year.Built + Overall.Qual + Overall.Cond + Neighborhood + Garage.Area + area*Overall.Qual + Exter.Qual + Exter.Cond + Total.Bsmt.SF + Central.Air + Bedroom.AbvGr, data = ames_train)
model=summary(fit_final)
model

## 
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + Year.Built + 
##     Overall.Qual + Overall.Cond + Neighborhood + Garage.Area + 
##     area * Overall.Qual + Exter.Qual + Exter.Cond + Total.Bsmt.SF + 
##     Central.Air + Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29116 -0.06326  0.00417  0.07357  0.51288 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.309e+00  9.560e-01   1.370 0.171120    
## log(area)            2.215e-01  9.211e-02   2.405 0.016374 *  
## log(Lot.Area)        9.414e-02  1.324e-02   7.109 2.29e-12 ***
## Year.Built           3.503e-03  3.784e-04   9.258  < 2e-16 ***
## Overall.Qual         1.117e-01  1.526e-02   7.321 5.21e-13 ***
## Overall.Cond         6.649e-02  5.045e-03  13.180  < 2e-16 ***
## NeighborhoodBlueste -3.825e-02  9.089e-02  -0.421 0.673936    
## NeighborhoodBrDale  -1.657e-01  6.253e-02  -2.649 0.008200 ** 
## NeighborhoodBrkSide -8.670e-03  5.380e-02  -0.161 0.872003    
## NeighborhoodClearCr  1.252e-02  6.170e-02   0.203 0.839258    
## NeighborhoodCollgCr -3.641e-02  4.589e-02  -0.793 0.427756    
## NeighborhoodCrawfor  9.937e-02  5.346e-02   1.859 0.063355 .  
## NeighborhoodEdwards -7.683e-02  4.913e-02  -1.564 0.118196    
## NeighborhoodGilbert -4.571e-02  4.877e-02  -0.937 0.348834    
## NeighborhoodGreens   7.850e-02  8.106e-02   0.968 0.333116    
## NeighborhoodGrnHill  3.799e-01  1.059e-01   3.586 0.000353 ***
## NeighborhoodIDOTRR  -1.452e-01  5.520e-02  -2.631 0.008658 ** 
## NeighborhoodMeadowV -1.314e-01  5.763e-02  -2.281 0.022791 *  
## NeighborhoodMitchel -2.697e-02  4.969e-02  -0.543 0.587379    
## NeighborhoodNAmes   -4.080e-02  4.772e-02  -0.855 0.392782    
## NeighborhoodNoRidge  8.043e-02  5.127e-02   1.569 0.117075    
## NeighborhoodNPkVill -3.784e-02  8.096e-02  -0.467 0.640294    
## NeighborhoodNridgHt  8.735e-02  4.730e-02   1.847 0.065109 .  
## NeighborhoodNWAmes  -9.670e-02  5.001e-02  -1.934 0.053441 .  
## NeighborhoodOldTown -9.125e-02  5.354e-02  -1.704 0.088617 .  
## NeighborhoodSawyer  -2.371e-02  4.988e-02  -0.475 0.634705    
## NeighborhoodSawyerW -8.574e-02  4.813e-02  -1.782 0.075136 .  
## NeighborhoodSomerst  2.721e-02  4.477e-02   0.608 0.543510    
## NeighborhoodStoneBr  1.512e-01  5.310e-02   2.848 0.004488 ** 
## NeighborhoodSWISU   -4.258e-03  6.307e-02  -0.068 0.946187    
## NeighborhoodTimber   2.063e-02  5.375e-02   0.384 0.701178    
## NeighborhoodVeenker -2.124e-02  6.262e-02  -0.339 0.734558    
## Garage.Area          1.059e-04  2.898e-05   3.656 0.000270 ***
## area                 3.595e-04  1.095e-04   3.282 0.001069 ** 
## Exter.QualFa        -4.331e-02  5.751e-02  -0.753 0.451594    
## Exter.QualGd        -1.007e-01  2.924e-02  -3.445 0.000597 ***
## Exter.QualTA        -1.228e-01  3.376e-02  -3.638 0.000290 ***
## Exter.CondFa        -1.410e-01  7.922e-02  -1.780 0.075454 .  
## Exter.CondGd         3.591e-03  7.034e-02   0.051 0.959297    
## Exter.CondTA         2.848e-02  6.963e-02   0.409 0.682646    
## Total.Bsmt.SF        1.272e-04  1.475e-05   8.626  < 2e-16 ***
## Central.AirY         8.905e-02  2.325e-02   3.831 0.000136 ***
## Bedroom.AbvGr       -3.416e-02  7.561e-03  -4.518 7.03e-06 ***
## Overall.Qual:area   -3.191e-05  9.126e-06  -3.496 0.000494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1358 on 954 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9002, Adjusted R-squared:  0.8957 
## F-statistic: 200.1 on 43 and 954 DF,  p-value: < 2.2e-16

The final adjusted R-squared increases to 0.895 from 0.87 in the initial model.

Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

Yes two variables are transformed - area and Lot.Area. The main reason for this transformation is that both these parameters are not normally distributed and heavily rightly skewed. The logarithmic transformation is used to normalize the skewed parameters.

Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

I have used one variable interaction in the model - Area * Overall Quality. The main reason being area and quality together will have an impact on the price of an house. Sometimes people purchase houses not just when the size is large but also when the quality of the house is good. So together these variables might have an impact on the price of an house. From the summary table it can be seen that the p-value for this parameter is less than 0.05 making it significant.

Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

I have used the AIC model selection method to include the variables in the final model. The reason for using this method is that this method penalizes the model for excessive parameters. Hence to achieve a parsimonious model and highly predictive model, this selection method is incorporated.

ames_aic_final = step(fit_final,k=2,direction = "backward")

## Start:  AIC=-3942.72
## log(price) ~ log(area) + log(Lot.Area) + Year.Built + Overall.Qual + 
##     Overall.Cond + Neighborhood + Garage.Area + area * Overall.Qual + 
##     Exter.Qual + Exter.Cond + Total.Bsmt.SF + Central.Air + Bedroom.AbvGr
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           17.584 -3942.7
## - log(area)          1    0.1066 17.690 -3938.7
## - Overall.Qual:area  1    0.2253 17.809 -3932.0
## - Exter.Qual         3    0.2972 17.881 -3932.0
## - Garage.Area        1    0.2464 17.830 -3930.8
## - Central.Air        1    0.2705 17.854 -3929.5
## - Exter.Cond         3    0.4439 18.027 -3923.8
## - Bedroom.AbvGr      1    0.3762 17.960 -3923.6
## - log(Lot.Area)      1    0.9314 18.515 -3893.2
## - Total.Bsmt.SF      1    1.3714 18.955 -3869.8
## - Year.Built         1    1.5798 19.163 -3858.9
## - Neighborhood      26    2.9813 20.565 -3838.4
## - Overall.Cond       1    3.2019 20.785 -3777.8

summary(ames_aic_final)

## 
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + Year.Built + 
##     Overall.Qual + Overall.Cond + Neighborhood + Garage.Area + 
##     area * Overall.Qual + Exter.Qual + Exter.Cond + Total.Bsmt.SF + 
##     Central.Air + Bedroom.AbvGr, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29116 -0.06326  0.00417  0.07357  0.51288 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.309e+00  9.560e-01   1.370 0.171120    
## log(area)            2.215e-01  9.211e-02   2.405 0.016374 *  
## log(Lot.Area)        9.414e-02  1.324e-02   7.109 2.29e-12 ***
## Year.Built           3.503e-03  3.784e-04   9.258  < 2e-16 ***
## Overall.Qual         1.117e-01  1.526e-02   7.321 5.21e-13 ***
## Overall.Cond         6.649e-02  5.045e-03  13.180  < 2e-16 ***
## NeighborhoodBlueste -3.825e-02  9.089e-02  -0.421 0.673936    
## NeighborhoodBrDale  -1.657e-01  6.253e-02  -2.649 0.008200 ** 
## NeighborhoodBrkSide -8.670e-03  5.380e-02  -0.161 0.872003    
## NeighborhoodClearCr  1.252e-02  6.170e-02   0.203 0.839258    
## NeighborhoodCollgCr -3.641e-02  4.589e-02  -0.793 0.427756    
## NeighborhoodCrawfor  9.937e-02  5.346e-02   1.859 0.063355 .  
## NeighborhoodEdwards -7.683e-02  4.913e-02  -1.564 0.118196    
## NeighborhoodGilbert -4.571e-02  4.877e-02  -0.937 0.348834    
## NeighborhoodGreens   7.850e-02  8.106e-02   0.968 0.333116    
## NeighborhoodGrnHill  3.799e-01  1.059e-01   3.586 0.000353 ***
## NeighborhoodIDOTRR  -1.452e-01  5.520e-02  -2.631 0.008658 ** 
## NeighborhoodMeadowV -1.314e-01  5.763e-02  -2.281 0.022791 *  
## NeighborhoodMitchel -2.697e-02  4.969e-02  -0.543 0.587379    
## NeighborhoodNAmes   -4.080e-02  4.772e-02  -0.855 0.392782    
## NeighborhoodNoRidge  8.043e-02  5.127e-02   1.569 0.117075    
## NeighborhoodNPkVill -3.784e-02  8.096e-02  -0.467 0.640294    
## NeighborhoodNridgHt  8.735e-02  4.730e-02   1.847 0.065109 .  
## NeighborhoodNWAmes  -9.670e-02  5.001e-02  -1.934 0.053441 .  
## NeighborhoodOldTown -9.125e-02  5.354e-02  -1.704 0.088617 .  
## NeighborhoodSawyer  -2.371e-02  4.988e-02  -0.475 0.634705    
## NeighborhoodSawyerW -8.574e-02  4.813e-02  -1.782 0.075136 .  
## NeighborhoodSomerst  2.721e-02  4.477e-02   0.608 0.543510    
## NeighborhoodStoneBr  1.512e-01  5.310e-02   2.848 0.004488 ** 
## NeighborhoodSWISU   -4.258e-03  6.307e-02  -0.068 0.946187    
## NeighborhoodTimber   2.063e-02  5.375e-02   0.384 0.701178    
## NeighborhoodVeenker -2.124e-02  6.262e-02  -0.339 0.734558    
## Garage.Area          1.059e-04  2.898e-05   3.656 0.000270 ***
## area                 3.595e-04  1.095e-04   3.282 0.001069 ** 
## Exter.QualFa        -4.331e-02  5.751e-02  -0.753 0.451594    
## Exter.QualGd        -1.007e-01  2.924e-02  -3.445 0.000597 ***
## Exter.QualTA        -1.228e-01  3.376e-02  -3.638 0.000290 ***
## Exter.CondFa        -1.410e-01  7.922e-02  -1.780 0.075454 .  
## Exter.CondGd         3.591e-03  7.034e-02   0.051 0.959297    
## Exter.CondTA         2.848e-02  6.963e-02   0.409 0.682646    
## Total.Bsmt.SF        1.272e-04  1.475e-05   8.626  < 2e-16 ***
## Central.AirY         8.905e-02  2.325e-02   3.831 0.000136 ***
## Bedroom.AbvGr       -3.416e-02  7.561e-03  -4.518 7.03e-06 ***
## Overall.Qual:area   -3.191e-05  9.126e-06  -3.496 0.000494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1358 on 954 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.9002, Adjusted R-squared:  0.8957 
## F-statistic: 200.1 on 43 and 954 DF,  p-value: < 2.2e-16

As shown above, AIC is minimum when none of the variables are removed making this model the optimum model for predicting the price of an house.

Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

Here the test data set is filtered by removing the factor ‘Po’ from the test data set under the column exterior condition as it is not present in the training data set.

ames_test = ames_test %>% filter(Exter.Cond!='Po')

rmse_finin = sqrt(mean((exp(predict(ames_aic_final,ames_train)) - ames_train$price)**2, na.rm = TRUE))
rmse_finin

## [1] 26969.39

rmse_finout = sqrt(mean((exp(predict(ames_aic_final,ames_test)) - ames_test$price)**2, na.rm = TRUE))
rmse_finout

## [1] 22616.27

Let us compare the in-sample and out-of-sample error for both the initial and final model.

tab = matrix(c(30557,25409,26969,22616),ncol = 2)
colnames(tab) = c('Initial', 'Final')
rownames(tab) = c('In-sample', 'Out-of-sample')
tab

##               Initial Final
## In-sample       30557 26969
## Out-of-sample   25409 22616

From the table above it can be observed that the in-sample root mean square error decreases from the initial to the final model and the out-of-sample root mean square error also decreases from the initial to the final model. Moreover, the RMSE also decreases from the in-sample to the out-sample data in the final model. Hence the final model constructed is the optimal solution.

Part 4 Final Model Assessment

Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

The residual plots are plotted as shown below. Two plots are majorly discussed - Residuals vs Fitted and the Q-Q plot

plot(ames_aic_final,which = 1,sub.caption = "")

plot(ames_aic_final,which = 2,sub.caption = "")

The residuals are scattered randomly around the linear line which confirms the homoscedasticity condition of linear regression. This plot is similar to the initial model’s residual plot with more randomness in the plot.
The Q-Q plot confirms that the residuals follow a normal distribution with certain deviations in the beginning and the end of the line.

Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

This is the same analysis as calculated in Section 3. The final RMSE for the in-sample and the out-of-sample data are calculated and tabulated in a matrix.

rmse_finin = sqrt(mean((exp(predict(ames_aic_final,ames_train)) - ames_train$price)**2, na.rm = TRUE))

rmse_finout = sqrt(mean((exp(predict(ames_aic_final,ames_test)) - ames_test$price)**2, na.rm = TRUE))

Let us compare the in-sample and out-of-sample error for both the initial and final model.

tab = matrix(c(rmse_inin,rmse_inout,rmse_finin,rmse_finout),ncol = 2)
colnames(tab) = c('Initial', 'Final')
rownames(tab) = c('In-sample', 'Out-of-sample')
tab

##                Initial    Final
## In-sample     30557.52 26969.39
## Out-of-sample 25409.34 22616.27

Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

Strength:

The major strength of the model is that out of 81 variables we have achieved an adjusted R-squared = 0.90 approximately with just 13 input variables [12 if variable interaction is excluded]. This shows that 90% of the output, which is the price of the house, can be explained by the model.
Secondly, the root mean square error in the initial and final model decrease from the in-sample to the out-of-sample data which shows that the model has not overfit the training set and can be used to predict the outcome for a new data point.

Weakness:

Majority of the variables remain unexplored in the model. Approximately 20 variables have been explored and studied out of the total 81 variables. There are good chances of the predictive power improving by increasing the number of variables in the model.
The model takes a frequentist approach and not a bayesian approach which might have improved the model.

Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

rmse_valid = sqrt(mean((exp(predict(ames_aic_final,ames_validation)) - ames_validation$price)**2, na.rm = TRUE))
rmse_valid

## [1] 22201.46

The final root mean square error on the validation data set is 22,201 dollars.

tab_final = matrix(c(rmse_finin,rmse_finout,rmse_valid),ncol = 3,nrow = 1)
colnames(tab_final) = c('Train', 'Test', 'Validation')
rownames(tab_final) = c('RMSE')
tab_final

##         Train     Test Validation
## RMSE 26969.39 22616.27   22201.46

We can clearly see a decrease in the RMSE value from the training set to the test set and finally to the validation set. This shows that the model has not overfit the training set and has a good predictive power for the house prices.

pred_valid = pred_valid = predict(ames_aic_final,ames_validation)
cov = (sum(exp(quantile(pred_valid,c(0.025,0.975)))[1]<ames_validation$price & exp(quantile(pred_valid,c(0.025,0.975)))[2]>ames_validation$price))/nrow(ames_validation)
# Coverage probability
cov*100

## [1] 95.01966

95 percentage of the 95% predictive confidence intervals contain the true price of the house in the validation data set. The model reflects the uncertainty in a proper way as the percentage of true values that do not fall in the interval is less than 5%.

Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.

The AIC model selection method used to identify the significant variables helped in building a parsimonious model to predict the price of the houses. The results achieved can be used by the investment firm to understand the trend in the prices of the various houses. The main reason for to trust the model is that the RMSE has decreased from the training set to the test set and finally the validation set. This shows that the model has not overfit the data and it can be used to, more or less, accurately estimate the price of the houses. Furthermore, this belief is reinforced by the fact that the coverage probability is more than 95% making it a trustworthy model.

An interesting analysis was transforming the price and area variable using a logarithmic transformation to make it a normal distribution. This form of analysis has been done rarely been applied before and this can further help in understanding on how to mutate other variables. The models can further be improved by including many more variables but since we have used an AIC method, there will be a penalty on the assessment criteria. This does not mean that the R-Squared will decrease as it is a major parameter influencing the predictive power of the model. So further studies can include many more variables to increase the R-Squared to get a better prediction.

Peer Assessment II

Background