Predictions of Ames Housing Prices

5 min readAug 28, 2018

A project that required data cleaning, feature engineering, exploratory data analysis, building a model, and using regression techniques to accurately predict the sale prices in Ames, IA.

The Objective

During my time at General Assembly’s Data Immersive Program, projects were completed to apply and demonstrate what we learned. The goal for the project I will be talking about today is to predict the sale prices of homes in Ames from a data set taken from Kaggle. The project challenged us on how we go about preprocessing the data so that we can build the best model for predicting the housing prices.

Data Preprocessing

I first loaded my datasets, which was split between a training and test set. After checking the data, there were a few things that needed to be done before I could fit the features into a model. For categorical columns, I ranked the category by their mean sale price (i.e. Neighborhood ranking by mean sale price of each neighborhood). In my opinion, this made the most sense in finding any relationship between categorical features with sale price. Now that everything is in numerical form, I made sure that outliers were taken care of and missing values were filled in. I also wanted to create some features to further capture certain aspects of the house that I thought we impact sale prices. For example, establishing a feature for total square footage of the house, total number of bathrooms, or overall condition and quality of the house. Once all this was taken care of, I was ready to analyze my data.

Exploratory Data Analysis

There were a lot of features to go through, so I started big and worked my way down to a smaller set. Below you will see the correlation between each feature with sale price that I found useful to include in my model.

Variable correlations relative to Sale Price

After tinkering around with which features to include in my model , I was able to narrow my variables down to the ones below. The features with top correlation being total sqft, overall quality, neighborhood rank, ground living area and total baths.

Top correlating features with Sale Price

Here are some scatterplots to depict the relationship between sale price and some of the top variables in the data set:

I depicted neighborhood ranking with overall quality above because houses within each neighborhood tend to have their own intrinsic value, however, I felt that they tend to also share the same characteristics with each other, like quality.

Results of my Model

As mentioned above, I worked my way from a larger set of values to a smaller one while fitting those variables with different models each time until my R-squared scores got worse. I tested my sets of variables with three different regression models and compared my training set with my test set to see the variance of my predicted model.

Below were the R-squares that I found from my best model and it seems that out of the three pairs, elastic regression fit our data the best.

Train Set

Ridge Regression: 0.87840
Lasso Regression: 0.86745
Elastic Regression: 0.86796

Test Set

Ridge Regression: 0.86914
Lasso Regression: 0.86912
Elastic Regression: 0.86923

To visualize this, I plotted observed sale prices with predicted sale price taken from my elastic regression model.

Elastic regression model: observed vs predicted sale price

We can see that my predicted values are closely matched with observed sale price with the exception to the prices out on the right. Great! But what is impacting the accuracy of sale price. To get a better idea of what features are influencing the sale prices, I want to interpret the beta coefficients of each variable.

Highest Beta( β) Coefficients

Ground Living Area : 15,701
Overall Quality: 12,885
Neighborhood: 10,393
Basement Finish Sq Ft: 10,966

Having a beta coefficient of almost 16,000 means that for every increase in sqft of ground living area, we can see an increase in almost $16,000 in sale price.

Conclusion

To sum it up, even though certain features have a higher correlation with sale price, it doesn’t necessarily mean that they have as high of an impact. The beta coefficient allows us to get a better idea of what variables we should look at for future predictions in housing prices.

I was able to build a model to my best ability but I believe that my model can still be further improved. Some ways to improve it would be by finding more accurate data with less mistakes in them. Perhaps I could collect other data features to improve the model such as the distance from houses to employment, shopping, retail, and school areas. These are a couple of factors we may want to consider for future modeling.