Saturday, November 19, 2022

Predicting the house sale price using Multiple Linear Regression

 Home buyers always worry about overpaying for the house of their choice. It is natural for all buyers to get a second price consultation or to make sure that the house price is within a reasonable price range for that locality. On the other hand, potential house sellers will want to understand which features(or amenities) influence the house price the most. This study focuses on taking historic house sale prices of King County from the year 2014 - 2015 and building a Regression model that can help users understand which features affect the price the most and also predict the price with reasonable accuracy and confidence level.

Our project will incorporate the following topics :

  • Variable manipulation
  • Multiple Linear regression
  • Residual diagnostics
  • Data analysis and interpretation
  • Model Building
  • Model Evaluation

Exploring the variation of house pricing in King County Washington


Link to Full Code :


Description of Dataset Variables

This dataset has 21,613 observations and 21 variables. These variables consist of a numeric response variable of interest and other numeric, categorical, and continous predictor variables. Some of the more important variables are the following :-

Variables Description
Price Price of each house sold
Bedrooms Number of Bedrooms
Bathrooms Number of Bathrooms
Sqft_living Square footage of the apartments interior living space
Sqft_loft Square footage of the land space
Floors Number of floors
Waterfront Whether the apartment was overlooking the waterfront property
Condition Condition of the property
Building Grade Rating the quality of the construction and Design
Sqft_above Square footage of the interior housing space that is above ground level
Sqft_below Square footage of the interior housing space that is below ground level
Zipcode Zipcode of the locality

Dataset Source

This data for the dataset was provided by King County, Washington.It includes details of houses sold in King Country from May 2014 - May 2015. Dataset Source

As stated above our goal is to come up with a model that will be able to predict house prices with a resonable accuracy and in doing so we will discuss the various models chosen as initial contenders based on different metrics and reason why we believe the final model is the most suited to solve the problem.


Data Preparation

Read the file.
Following additional features derived added to the data.
- age : Derived age of the house from year built.
- renovated : Derived based on year renovated. - distance : Derived distance from house to the city center.
- cluster : There are over 100 zipcodes. Converted them to 20 clusters so that they can be treated as factors.

Dummy Variables : waterfront and renaovated Convert following fields to factors: condition, waterfront, view, cluster, renovated Split to Test and Train

### Data Preparation and Cleaning ##
# Read the File
orig_house = read_csv("kc_house_data.csv")
house = orig_house 

## Additional Features
house$age = abs(2015 - house$yr_built)  ## Derive Age from year built
house$renovated = ifelse(house$yr_renovated > 0, 1, 0) ### Renovation

### Derive distance to City Center from Lat and Long
city_cent_lat = 47.6062
city_cent_lon = -122.3321
get_dist = function(lat,lon)
   distm (c(city_cent_lon, city_cent_lat), c(lon,lat), fun = distHaversine)
house$distance = mapply(get_dist,house$lat,house$long)

## Zip Code Cluster Based on Latitude Longitude
cluster20 <- kmeans(scale(house[,c(18,19,17)]),18,100)
house$cluster = cluster20$cluster

## Convert into Factors

########### Remove redundant variables
house = subset(house, select = -c(id, date,yr_built,yr_renovated ))

Split to Test and Train

Data split to train and test using 70:30 ratio.

## [1] "Train Size :  15117"
## [1] "Test Size :  6480"

Data Exploration

Pair plots to understand numeric dependency

Performed quick checks of correlation between the predictors visually using the pairs() function, which plots all possible scatterplots between pairs of variables in the dataset. Observed correlation between several predictors like sqft_living vs bedrooms, sqft_living vs bathrooms, sqft_living vs sqft_above, sqft_living vs grade.