Ames Housing Data and House Price Prediction

Hi and welcome to my blog. This is my first real entry, and it basically entails a project I did in my data science bootcamp at General Assembly. This particular post will be heavy in the coding department and light in the writing department, but will give you a little glimpse into how I was thinking early on in the program. So, without further ado…

First we start off by importing anything and everything that might be helpful here.

import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
import warnings
warnings.simplefilter("ignore")

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

Next, we import out data files, first saving the names as variables.

train_csv = '/Users/evanjacobs/dsi/DSI-US-4/project-2/train.csv'
test_csv = '/Users/evanjacobs/dsi/DSI-US-4/project-2/test.csv'

For now, we’ll just import out training data, so we don’t accidentally alter the precious testing data.

df = pd.read_csv(train_csv)
finaltest = pd.read_csv(test_csv)

First, we’re going to do our test train split, and here we’ll set our y to be our target, ‘SalePrice’.

X = df.drop(['SalePrice'], axis=1)
y = df.SalePrice.values
X_full = df.drop(['SalePrice'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Let’s have a look, shall we?

X_train.head()
Id PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... 3Ssn Porch Screen Porch Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type
484 1875 534201040 20 RL 70.0 8050 Pave NaN Reg Lvl ... 0 0 0 NaN NaN NaN 0 3 2007 WD
1234 178 902206040 50 RM 50.0 5500 Pave NaN Reg Lvl ... 0 0 0 NaN NaN NaN 0 4 2010 WD
1917 20 527302110 20 RL 85.0 13175 Pave NaN Reg Lvl ... 0 0 0 NaN MnPrv NaN 0 2 2010 WD
640 2420 528228280 120 RL 43.0 3087 Pave NaN Reg Lvl ... 0 0 0 NaN NaN NaN 0 11 2006 New
811 1448 907202160 80 RL NaN 10970 Pave NaN IR1 Low ... 0 0 0 NaN MnPrv NaN 0 10 2008 WD

5 rows × 80 columns

X_train.describe()
Id PID MS SubClass Lot Frontage Lot Area Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area ... Garage Area Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold
count 1538.000000 1.538000e+03 1538.000000 1294.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1521.000000 ... 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000
mean 1469.118336 7.148299e+08 57.542263 69.540958 10179.084525 6.109883 5.571521 1971.674252 1984.081274 99.113083 ... 471.424577 95.207412 49.256177 23.018856 2.914174 17.200260 3.197659 55.282835 6.195709 2007.784785
std 844.226713 1.887552e+08 43.351837 22.987056 7353.026485 1.405082 1.110848 30.258868 21.200024 174.156041 ... 216.396308 132.411630 69.244398 60.037423 27.776465 59.571394 43.605315 617.362905 2.753136 1.313997
min 1.000000 5.263011e+08 20.000000 21.000000 1300.000000 1.000000 1.000000 1879.000000 1950.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000
25% 746.500000 5.284567e+08 20.000000 59.000000 7455.500000 5.000000 5.000000 1953.000000 1964.000000 0.000000 ... 316.250000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 2007.000000
50% 1496.500000 5.354546e+08 50.000000 69.000000 9465.000000 6.000000 5.000000 1975.000000 1993.000000 0.000000 ... 480.000000 0.000000 28.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000
75% 2174.750000 9.071855e+08 70.000000 80.000000 11635.500000 7.000000 6.000000 2001.000000 2004.000000 162.000000 ... 576.000000 168.000000 72.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000
max 2930.000000 9.241520e+08 190.000000 313.000000 159000.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 ... 1418.000000 1424.000000 547.000000 432.000000 508.000000 490.000000 800.000000 17000.000000 12.000000 2010.000000

8 rows × 38 columns

X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1538 entries, 484 to 1169
Data columns (total 80 columns):
Id                 1538 non-null int64
PID                1538 non-null int64
MS SubClass        1538 non-null int64
MS Zoning          1538 non-null object
Lot Frontage       1294 non-null float64
Lot Area           1538 non-null int64
Street             1538 non-null object
Alley              110 non-null object
Lot Shape          1538 non-null object
Land Contour       1538 non-null object
Utilities          1538 non-null object
Lot Config         1538 non-null object
Land Slope         1538 non-null object
Neighborhood       1538 non-null object
Condition 1        1538 non-null object
Condition 2        1538 non-null object
Bldg Type          1538 non-null object
House Style        1538 non-null object
Overall Qual       1538 non-null int64
Overall Cond       1538 non-null int64
Year Built         1538 non-null int64
Year Remod/Add     1538 non-null int64
Roof Style         1538 non-null object
Roof Matl          1538 non-null object
Exterior 1st       1538 non-null object
Exterior 2nd       1538 non-null object
Mas Vnr Type       1521 non-null object
Mas Vnr Area       1521 non-null float64
Exter Qual         1538 non-null object
Exter Cond         1538 non-null object
Foundation         1538 non-null object
Bsmt Qual          1501 non-null object
Bsmt Cond          1501 non-null object
Bsmt Exposure      1498 non-null object
BsmtFin Type 1     1501 non-null object
BsmtFin SF 1       1538 non-null float64
BsmtFin Type 2     1500 non-null object
BsmtFin SF 2       1538 non-null float64
Bsmt Unf SF        1538 non-null float64
Total Bsmt SF      1538 non-null float64
Heating            1538 non-null object
Heating QC         1538 non-null object
Central Air        1538 non-null object
Electrical         1538 non-null object
1st Flr SF         1538 non-null int64
2nd Flr SF         1538 non-null int64
Low Qual Fin SF    1538 non-null int64
Gr Liv Area        1538 non-null int64
Bsmt Full Bath     1537 non-null float64
Bsmt Half Bath     1537 non-null float64
Full Bath          1538 non-null int64
Half Bath          1538 non-null int64
Bedroom AbvGr      1538 non-null int64
Kitchen AbvGr      1538 non-null int64
Kitchen Qual       1538 non-null object
TotRms AbvGrd      1538 non-null int64
Functional         1538 non-null object
Fireplaces         1538 non-null int64
Fireplace Qu       798 non-null object
Garage Type        1449 non-null object
Garage Yr Blt      1449 non-null float64
Garage Finish      1449 non-null object
Garage Cars        1538 non-null float64
Garage Area        1538 non-null float64
Garage Qual        1449 non-null object
Garage Cond        1449 non-null object
Paved Drive        1538 non-null object
Wood Deck SF       1538 non-null int64
Open Porch SF      1538 non-null int64
Enclosed Porch     1538 non-null int64
3Ssn Porch         1538 non-null int64
Screen Porch       1538 non-null int64
Pool Area          1538 non-null int64
Pool QC            9 non-null object
Fence              286 non-null object
Misc Feature       50 non-null object
Misc Val           1538 non-null int64
Mo Sold            1538 non-null int64
Yr Sold            1538 non-null int64
Sale Type          1538 non-null object
dtypes: float64(11), int64(27), object(42)
memory usage: 973.3+ KB
X_train.columns
Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual',
       'Garage Cond', 'Paved Drive', 'Wood Deck SF', 'Open Porch SF',
       'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Pool QC',
       'Fence', 'Misc Feature', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type'],
      dtype='object')

Closing up the column names so I can use dot notation.

def colclean(column_list): 
    columns=[]
    for n in column_list:
        n = n.lower().replace(' ','')
        columns.append(n)
    return columns
colclean(df.columns)
X_train.columns = colclean(X_train.columns)
X_test.columns = colclean(X_test.columns)
X_train.columns
X_full.columns = colclean(X_full.columns)
finaltest.columns = colclean(finaltest.columns)

Checking for duplicate PIDs.

X_train.duplicated(subset='pid', keep='first').sum()
0

Got any nulls lying around?

X_train.isnull().sum()
id                  0
pid                 0
mssubclass          0
mszoning            0
lotfrontage       244
lotarea             0
street              0
alley            1428
lotshape            0
landcontour         0
utilities           0
lotconfig           0
landslope           0
neighborhood        0
condition1          0
condition2          0
bldgtype            0
housestyle          0
overallqual         0
overallcond         0
yearbuilt           0
yearremod/add       0
roofstyle           0
roofmatl            0
exterior1st         0
exterior2nd         0
masvnrtype         17
masvnrarea         17
exterqual           0
extercond           0
                 ... 
fullbath            0
halfbath            0
bedroomabvgr        0
kitchenabvgr        0
kitchenqual         0
totrmsabvgrd        0
functional          0
fireplaces          0
fireplacequ       740
garagetype         89
garageyrblt        89
garagefinish       89
garagecars          0
garagearea          0
garagequal         89
garagecond         89
paveddrive          0
wooddecksf          0
openporchsf         0
enclosedporch       0
3ssnporch           0
screenporch         0
poolarea            0
poolqc           1529
fence            1252
miscfeature      1488
miscval             0
mosold              0
yrsold              0
saletype            0
Length: 80, dtype: int64

Here’s a trick I learned.

X_train.isna().sum()[X_train.isna().sum() !=0]
lotfrontage      244
alley           1428
masvnrtype        17
masvnrarea        17
bsmtqual          37
bsmtcond          37
bsmtexposure      40
bsmtfintype1      37
bsmtfintype2      38
bsmtfullbath       1
bsmthalfbath       1
fireplacequ      740
garagetype        89
garageyrblt       89
garagefinish      89
garagequal        89
garagecond        89
poolqc          1529
fence           1252
miscfeature     1488
dtype: int64

Having a look at what the object column with null values looks like.

X_train.poolqc.unique()
array([nan, 'TA', 'Gd', 'Ex', 'Fa'], dtype=object)

Just for the sake of time, going to fill all null values with their numerical averages for numerical columns.

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())
X_full = X_full.fillna(X_full.mean())
finaltest = finaltest.fillna(finaltest.mean())
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1538 entries, 484 to 1169
Data columns (total 80 columns):
id               1538 non-null int64
pid              1538 non-null int64
mssubclass       1538 non-null int64
mszoning         1538 non-null object
lotfrontage      1538 non-null float64
lotarea          1538 non-null int64
street           1538 non-null object
alley            110 non-null object
lotshape         1538 non-null object
landcontour      1538 non-null object
utilities        1538 non-null object
lotconfig        1538 non-null object
landslope        1538 non-null object
neighborhood     1538 non-null object
condition1       1538 non-null object
condition2       1538 non-null object
bldgtype         1538 non-null object
housestyle       1538 non-null object
overallqual      1538 non-null int64
overallcond      1538 non-null int64
yearbuilt        1538 non-null int64
yearremod/add    1538 non-null int64
roofstyle        1538 non-null object
roofmatl         1538 non-null object
exterior1st      1538 non-null object
exterior2nd      1538 non-null object
masvnrtype       1521 non-null object
masvnrarea       1538 non-null float64
exterqual        1538 non-null object
extercond        1538 non-null object
foundation       1538 non-null object
bsmtqual         1501 non-null object
bsmtcond         1501 non-null object
bsmtexposure     1498 non-null object
bsmtfintype1     1501 non-null object
bsmtfinsf1       1538 non-null float64
bsmtfintype2     1500 non-null object
bsmtfinsf2       1538 non-null float64
bsmtunfsf        1538 non-null float64
totalbsmtsf      1538 non-null float64
heating          1538 non-null object
heatingqc        1538 non-null object
centralair       1538 non-null object
electrical       1538 non-null object
1stflrsf         1538 non-null int64
2ndflrsf         1538 non-null int64
lowqualfinsf     1538 non-null int64
grlivarea        1538 non-null int64
bsmtfullbath     1538 non-null float64
bsmthalfbath     1538 non-null float64
fullbath         1538 non-null int64
halfbath         1538 non-null int64
bedroomabvgr     1538 non-null int64
kitchenabvgr     1538 non-null int64
kitchenqual      1538 non-null object
totrmsabvgrd     1538 non-null int64
functional       1538 non-null object
fireplaces       1538 non-null int64
fireplacequ      798 non-null object
garagetype       1449 non-null object
garageyrblt      1538 non-null float64
garagefinish     1449 non-null object
garagecars       1538 non-null float64
garagearea       1538 non-null float64
garagequal       1449 non-null object
garagecond       1449 non-null object
paveddrive       1538 non-null object
wooddecksf       1538 non-null int64
openporchsf      1538 non-null int64
enclosedporch    1538 non-null int64
3ssnporch        1538 non-null int64
screenporch      1538 non-null int64
poolarea         1538 non-null int64
poolqc           9 non-null object
fence            286 non-null object
miscfeature      50 non-null object
miscval          1538 non-null int64
mosold           1538 non-null int64
yrsold           1538 non-null int64
saletype         1538 non-null object
dtypes: float64(11), int64(27), object(42)
memory usage: 973.3+ KB

Breaking dataframes into two where one is all num values and one is all obj values so I can look at them more easily.

X_tr_obj = X_train.select_dtypes(exclude=[np.number])
X_tr_num = X_train.select_dtypes(include=[np.number])
X_ts_obj = X_test.select_dtypes(exclude=[np.number])
X_ts_num = X_test.select_dtypes(include=[np.number])
X_full_obj = X_full.select_dtypes(exclude=[np.number])
X_full_num = X_full.select_dtypes(include=[np.number])
finaltest_obj = finaltest.select_dtypes(exclude=[np.number])
finaltest_num = finaltest.select_dtypes(include=[np.number])
X_ts_num.head()
id pid mssubclass lotfrontage lotarea overallqual overallcond yearbuilt yearremod/add masvnrarea ... wooddecksf openporchsf enclosedporch 3ssnporch screenporch poolarea poolqc miscval mosold yrsold
908 2559 534455080 20 80.0 9600 5 6 1961 1990 0.0 ... 144 0 205 0 0 0 NaN 0 6 2006
1619 1947 535375130 50 60.0 10134 5 6 1940 1950 0.0 ... 0 39 0 0 0 0 NaN 0 7 2007
391 81 531453010 20 81.0 9672 6 5 1984 1985 0.0 ... 0 0 0 0 0 0 NaN 0 5 2010
861 2573 535151130 90 70.0 7728 5 6 1962 1962 120.0 ... 0 18 0 0 0 0 NaN 0 5 2006
1270 1569 914476080 90 76.0 10260 5 4 1976 1976 0.0 ... 0 0 0 0 0 0 NaN 0 11 2008

5 rows × 39 columns

Checking to make sure it worked.

print(X_train.shape)
print(X_tr_obj.shape)
print(X_tr_num.shape)
(1538, 80)
(1538, 42)
(1538, 38)

Let’s check for potential outliers.

X_tr_num.describe().T
count mean std min 25% 50% 75% max
id 1538.0 1.469118e+03 8.442267e+02 1.0 7.465000e+02 1.496500e+03 2.174750e+03 2930.0
pid 1538.0 7.148299e+08 1.887552e+08 526301100.0 5.284567e+08 5.354546e+08 9.071855e+08 924152030.0
mssubclass 1538.0 5.754226e+01 4.335184e+01 20.0 2.000000e+01 5.000000e+01 7.000000e+01 190.0
lotfrontage 1538.0 6.954096e+01 2.108364e+01 21.0 6.000000e+01 6.954096e+01 7.900000e+01 313.0
lotarea 1538.0 1.017908e+04 7.353026e+03 1300.0 7.455500e+03 9.465000e+03 1.163550e+04 159000.0
overallqual 1538.0 6.109883e+00 1.405082e+00 1.0 5.000000e+00 6.000000e+00 7.000000e+00 10.0
overallcond 1538.0 5.571521e+00 1.110848e+00 1.0 5.000000e+00 5.000000e+00 6.000000e+00 9.0
yearbuilt 1538.0 1.971674e+03 3.025887e+01 1879.0 1.953000e+03 1.975000e+03 2.001000e+03 2010.0
yearremod/add 1538.0 1.984081e+03 2.120002e+01 1950.0 1.964000e+03 1.993000e+03 2.004000e+03 2010.0
masvnrarea 1538.0 9.911308e+01 1.731902e+02 0.0 0.000000e+00 0.000000e+00 1.600000e+02 1600.0
bsmtfinsf1 1538.0 4.402523e+02 4.704420e+02 0.0 0.000000e+00 3.610000e+02 7.287500e+02 5644.0
bsmtfinsf2 1538.0 4.791873e+01 1.636097e+02 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1474.0
bsmtunfsf 1538.0 5.709402e+02 4.445010e+02 0.0 2.222500e+02 4.800000e+02 8.150000e+02 2336.0
totalbsmtsf 1538.0 1.059111e+03 4.524990e+02 0.0 7.895000e+02 9.945000e+02 1.324000e+03 6110.0
1stflrsf 1538.0 1.165298e+03 4.039276e+02 334.0 8.782500e+02 1.092500e+03 1.408500e+03 5095.0
2ndflrsf 1538.0 3.328362e+02 4.238517e+02 0.0 0.000000e+00 0.000000e+00 6.995000e+02 1836.0
lowqualfinsf 1538.0 5.667100e+00 5.337538e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1064.0
grlivarea 1538.0 1.503801e+03 5.047836e+02 334.0 1.143000e+03 1.452000e+03 1.724000e+03 5642.0
bsmtfullbath 1538.0 4.307092e-01 5.182866e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 2.0
bsmthalfbath 1538.0 6.115810e-02 2.476318e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 2.0
fullbath 1538.0 1.583225e+00 5.445938e-01 0.0 1.000000e+00 2.000000e+00 2.000000e+00 4.0
halfbath 1538.0 3.719116e-01 4.993598e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 2.0
bedroomabvgr 1538.0 2.843953e+00 8.124554e-01 0.0 2.000000e+00 3.000000e+00 3.000000e+00 8.0
kitchenabvgr 1538.0 1.041612e+00 2.093097e-01 0.0 1.000000e+00 1.000000e+00 1.000000e+00 3.0
totrmsabvgrd 1538.0 6.445384e+00 1.545643e+00 2.0 5.000000e+00 6.000000e+00 7.000000e+00 15.0
fireplaces 1538.0 6.046814e-01 6.481274e-01 0.0 0.000000e+00 1.000000e+00 1.000000e+00 4.0
garageyrblt 1538.0 1.978795e+03 2.501178e+01 1895.0 1.962000e+03 1.978795e+03 2.001000e+03 2207.0
garagecars 1538.0 1.774382e+00 7.672165e-01 0.0 1.000000e+00 2.000000e+00 2.000000e+00 4.0
garagearea 1538.0 4.714246e+02 2.163963e+02 0.0 3.162500e+02 4.800000e+02 5.760000e+02 1418.0
wooddecksf 1538.0 9.520741e+01 1.324116e+02 0.0 0.000000e+00 0.000000e+00 1.680000e+02 1424.0
openporchsf 1538.0 4.925618e+01 6.924440e+01 0.0 0.000000e+00 2.800000e+01 7.200000e+01 547.0
enclosedporch 1538.0 2.301886e+01 6.003742e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 432.0
3ssnporch 1538.0 2.914174e+00 2.777646e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 508.0
screenporch 1538.0 1.720026e+01 5.957139e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 490.0
poolarea 1538.0 3.197659e+00 4.360531e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 800.0
miscval 1538.0 5.528283e+01 6.173629e+02 0.0 0.000000e+00 0.000000e+00 0.000000e+00 17000.0
mosold 1538.0 6.195709e+00 2.753136e+00 1.0 4.000000e+00 6.000000e+00 8.000000e+00 12.0
yrsold 1538.0 2.007785e+03 1.313997e+00 2006.0 2.007000e+03 2.008000e+03 2.009000e+03 2010.0

Obviously, some of the maxes are large, but that’s the way real estate works. Otherwise, nothing here pops out as being wroong.

Okay, now that we have separated our dataframe into a numerical one and a categorical one, let’s take a look-see at the numerical correlations.

X_tr_num['sp']=y_train
abs(X_tr_num.corr().sp)
id               0.046963
pid              0.243312
mssubclass       0.104183
lotfrontage      0.320966
lotarea          0.301233
overallqual      0.787963
overallcond      0.094205
yearbuilt        0.560274
yearremod/add    0.531071
masvnrarea       0.503817
bsmtfinsf1       0.423380
bsmtfinsf2       0.010237
bsmtunfsf        0.180959
totalbsmtsf      0.621630
1stflrsf         0.616752
2ndflrsf         0.244477
lowqualfinsf     0.031802
grlivarea        0.695442
bsmtfullbath     0.279659
bsmthalfbath     0.036105
fullbath         0.541253
halfbath         0.296527
bedroomabvgr     0.150723
kitchenabvgr     0.131280
totrmsabvgrd     0.495609
fireplaces       0.467371
garageyrblt      0.499246
garagecars       0.646358
garagearea       0.654564
wooddecksf       0.322195
openporchsf      0.326194
enclosedporch    0.113088
3ssnporch        0.062105
screenporch      0.128890
poolarea         0.026498
miscval          0.013014
mosold           0.026836
yrsold           0.014487
sp               1.000000
Name: sp, dtype: float64

For this first model, we’re going to choose all the columns where the correlation coefficient with SalePrice is greater than or equal to some value I determine.

vals = abs(X_tr_num.corr().sp).drop('sp').sort_values(ascending=False)
corr_cols = list(vals[vals >= 0.3].index)

X_tr_mod1 = X_tr_num[corr_cols]
X_ts_mod1 = X_ts_num[corr_cols]
X_full_mod1 = X_full_num[corr_cols]
finaltest_num = finaltest_num[corr_cols]

corr_cols
['overallqual',
 'grlivarea',
 'garagearea',
 'garagecars',
 'totalbsmtsf',
 '1stflrsf',
 'yearbuilt',
 'fullbath',
 'yearremod/add',
 'masvnrarea',
 'garageyrblt',
 'totrmsabvgrd',
 'fireplaces',
 'bsmtfinsf1',
 'openporchsf',
 'wooddecksf',
 'lotfrontage',
 'lotarea']

First, let’s just notice that garageyrblt, yearbuilt should be correlated, as well as garagearea, garagecars, as well as totalbsmtsf, masvnrarea, grlivarea, 1stflrsf. So let’s make some interaction variables.

from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2, interaction_only=False, 
                         include_bias=True)
pf.fit(X_tr_mod1)
X_tr_mod1 = pf.transform(X_tr_mod1)
X_ts_mod1 = pf.transform(X_ts_mod1)
X_full_mod1 = pf.transform(X_full_mod1)
finaltest_num = pf.transform(finaltest_num)

Okay, now let’s use a standard scalar to make everything line up nicely.

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_tr_mod1 = ss.fit_transform(X_tr_mod1)
X_ts_mod1 = ss.transform(X_ts_mod1)
X_full_mod1 = ss.fit_transform(X_full_mod1)
finaltest_num = ss.transform(finaltest_num)

So, for lasso, ridge, and enet, I played with different alpha ranges, different numbers of iterations, and also different correlation thresholds.

Let’s try a lasso!

l_alphas = np.arange(.001, .15, .0025)
lasso_model = LassoCV(alphas=l_alphas, max_iter=2000, cv=5)
# lasso_model = LassoCV(max_iter=10000, cv=5)

model_1 = lasso_model.fit(X_tr_mod1, y_train)

Great, let’s score the lasso.

print(model_1.score(X_ts_mod1, y_test))
0.8775294669615423

Hey, that’s not a bad score at all! What if we tried ridge?

ridge_alphas = np.logspace(0, 5, 200)

ridge_model = RidgeCV(alphas=ridge_alphas, cv=10)
# ridge_model = RidgeCV(cv=10)
ridge_model.fit(X_tr_mod1, y_train)
RidgeCV(alphas=array([1.00000e+00, 1.05956e+00, ..., 9.43788e+04, 1.00000e+05]),
    cv=10, fit_intercept=True, gcv_mode=None, normalize=False,
    scoring=None, store_cv_values=False)
ridge = Ridge(alpha=ridge_model.alpha_)

ridge_scores = cross_val_score(ridge, X_ts_mod1, y_test, cv=15)

print(ridge_scores)
print(np.mean(ridge_scores))

Hmmm. And last, everyone’s favorite, the elastic net.

l1_ratios = np.linspace(0.01, 1.0, 25)

enet = ElasticNetCV(l1_ratio=l1_ratios, n_alphas=100, cv=10,
                            verbose=0)
# enet = ElasticNetCV(cv=10, verbose=0)
enet.fit(X_tr_mod1, y_train)

print(enet.alpha_)
print(enet.l1_ratio_)

enet = ElasticNet(alpha=enet.alpha_, l1_ratio=enet.l1_ratio_)

enet_scores = cross_val_score(enet, X_ts_mod1, y_test, cv=10)

print(enet_scores)
print(np.mean(enet_scores))

It’s basically the same. But I have discovered that as I decreased my cut off for correlation, my lasso score remained largely the same, but my ridge and enet scores went up a tiny bit, culminating with my pulling an R-squared on .89 from Elastic Net.

# l_alphas = np.arange(.001, .15, .0025)
# lasso_model_final = LassoCV(alphas=l_alphas, cv=5)
# model_1_final = lasso_model.fit(X_full_mod1, y)

enet = ElasticNetCV(l1_ratio=l1_ratios, n_alphas=100, cv=10,
                            verbose=0)
model_1_final = enet.fit(X_full_mod1, y)

/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

This is how I send my model’s predictions to a file.

evansubmission1 = pd.DataFrame(data = model_1_final.predict(finaltest_num), columns = ['SalePrice'], index=finaltest['id'])
evansubmission1.to_csv('./evansubmission1.csv')

Plotting my test predictions vs. my test y for a nice visualization of the efficacy of my model.

predictions = model_1_final.predict(X_ts_mod1)
y = y_test

# Plot the model
plt.figure(figsize=(8,8))
plt.scatter(predictions, y, s=30, c='g', marker='*', zorder=10)
plt.xlabel("Predicted Values of Price From My Horrible Model")
plt.ylabel("Actual Values of Price")

plt.plot([0, np.max(y)], [0, np.max(y)], c = 'k')

plt.show()
score = cross_val_score(model_1_final, X_ts_mod1, y_test, cv=10)
print("score: ", score.mean())

png

/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)


score:  0.8773479828283166

And that’s it for the first model! Don’t forget to look at the other two model files by checking out the GitHub repository. I’ve included the data dictionary below, and finally, the presentation I made at the very end.

# evansubmission1 = pd.DataFrame(data = model_1.predict(X_ts_mod1), columns = ['SalePrice'], index=y_test['Id'])
# evansubmission1.to_csv('./evansubmission1.csv')
# There are three files:

# train.csv -- this data contains all of the training data for your model.
# The target variable (SalePrice) is removed from the test set!
# test.csv -- this data contains the test data for your model. You will feed this data into your regression model to make predictions.
# sample_sub_reg.csv -- An example of a correctly formatted submission for this challenge (with a random number provided as predictions for SalePrice. Please ensure that your submission to Kaggle matches this format.
# Codebook / Data Dictionary:

# SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict for this challenge.
# MSSubClass: The building class
#     20 1-STORY 1946 & NEWER ALL STYLES
#     30 1-STORY 1945 & OLDER
#     40 1-STORY W/FINISHED ATTIC ALL AGES
#     45 1-1/2 STORY - UNFINISHED ALL AGES
#     50 1-1/2 STORY FINISHED ALL AGES
#     60 2-STORY 1946 & NEWER
#     70 2-STORY 1945 & OLDER
#     75 2-1/2 STORY ALL AGES
#     80 SPLIT OR MULTI-LEVEL
#     85 SPLIT FOYER
#     90 DUPLEX - ALL STYLES AND AGES
#     120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
#     150 1-1/2 STORY PUD - ALL AGES
#     160 2-STORY PUD - 1946 & NEWER
#     180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
#     190 2 FAMILY CONVERSION - ALL STYLES AND AGES
# MSZoning: Identifies the general zoning classification of the sale.
#     A Agriculture
#     C Commercial
#     FV Floating Village Residential
#     I Industrial
#     RH Residential High Density
#     RL Residential Low Density
#     RP Residential Low Density Park
#     RM Residential Medium Density
# LotFrontage: Linear feet of street connected to property
# LotArea: Lot size in square feet
# Street: Type of road access to property
#     Grvl Gravel
#     Pave Paved
# Alley: Type of alley access to property
#     Grvl Gravel
#     Pave Paved
#     NA No alley access
# LotShape: General shape of property
#     Reg Regular
#     IR1 Slightly irregular
#     IR2 Moderately Irregular
#     IR3 Irregular
# LandContour: Flatness of the property
#     Lvl Near Flat/Level
#     Bnk Banked - Quick and significant rise from street grade to building
#     HLS Hillside - Significant slope from side to side
#     Low Depression
# Utilities: Type of utilities available
#     AllPub All public Utilities (E,G,W,& S)
#     NoSewr Electricity, Gas, and Water (Septic Tank)
#     NoSeWa Electricity and Gas Only
#     ELO Electricity only
# LotConfig: Lot configuration
#     Inside Inside lot
#     Corner Corner lot
#     CulDSac Cul-de-sac
#     FR2 Frontage on 2 sides of property
#     FR3 Frontage on 3 sides of property
# LandSlope: Slope of property
#     Gtl Gentle slope
#     Mod Moderate Slope
#     Sev Severe Slope
# Neighborhood: Physical locations within Ames city limits
#     Blmngtn Bloomington Heights
#     Blueste Bluestem
#     BrDale Briardale
#     BrkSide Brookside
#     ClearCr Clear Creek
#     CollgCr College Creek
#     Crawfor Crawford
#     Edwards Edwards
#     Gilbert Gilbert
#     IDOTRR Iowa DOT and Rail Road
#     MeadowV Meadow Village
#     Mitchel Mitchell
#     Names North Ames
#     NoRidge Northridge
#     NPkVill Northpark Villa
#     NridgHt Northridge Heights
#     NWAmes Northwest Ames
#     OldTown Old Town
#     SWISU South & West of Iowa State University
#     Sawyer Sawyer
#     SawyerW Sawyer West
#     Somerst Somerset
#     StoneBr Stone Brook
#     Timber Timberland
#     Veenker Veenker
# Condition1: Proximity to main road or railroad
#     Artery Adjacent to arterial street
#     Feedr Adjacent to feeder street
#     Norm Normal
#     RRNn Within 200' of North-South Railroad
#     RRAn Adjacent to North-South Railroad
#     PosN Near positive off-site feature--park, greenbelt, etc.
#     PosA Adjacent to postive off-site feature
#     RRNe Within 200' of East-West Railroad
#     RRAe Adjacent to East-West Railroad
# Condition2: Proximity to main road or railroad (if a second is present)
#     Artery Adjacent to arterial street
#     Feedr Adjacent to feeder street
#     Norm Normal
#     RRNn Within 200' of North-South Railroad
#     RRAn Adjacent to North-South Railroad
#     PosN Near positive off-site feature--park, greenbelt, etc.
#     PosA Adjacent to postive off-site feature
#     RRNe Within 200' of East-West Railroad
#     RRAe Adjacent to East-West Railroad
# BldgType: Type of dwelling
#     1Fam Single-family Detached
#     2FmCon Two-family Conversion; originally built as one-family dwelling
#     Duplx Duplex
#     TwnhsE Townhouse End Unit
#     TwnhsI Townhouse Inside Unit
# HouseStyle: Style of dwelling
#     1Story One story
#     1.5Fin One and one-half story: 2nd level finished
#     1.5Unf One and one-half story: 2nd level unfinished
#     2Story Two story
#     2.5Fin Two and one-half story: 2nd level finished
#     2.5Unf Two and one-half story: 2nd level unfinished
#     SFoyer Split Foyer
#     SLvl Split Level
# OverallQual: Overall material and finish quality
#     10 Very Excellent
#     9 Excellent
#     8 Very Good
#     7 Good
#     6 Above Average
#     5 Average
#     4 Below Average
#     3 Fair
#     2 Poor
#     1 Very Poor
# OverallCond: Overall condition rating
#     10 Very Excellent
#     9 Excellent
#     8 Very Good
#     7 Good
#     6 Above Average
#     5 Average
#     4 Below Average
#     3 Fair
#     2 Poor
#     1 Very Poor
# YearBuilt: Original construction date
# YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
# RoofStyle: Type of roof
#     Flat Flat
#     Gable Gable
#     Gambrel Gabrel (Barn)
#     Hip Hip
#     Mansard Mansard
#     Shed Shed
# RoofMatl: Roof material
#     ClyTile Clay or Tile
#     CompShg Standard (Composite) Shingle
#     Membran Membrane
#     Metal Metal
#     Roll Roll
#     Tar&Grv Gravel & Tar
#     WdShake Wood Shakes
#     WdShngl Wood Shingles
# Exterior1st: Exterior covering on house
#     AsbShng Asbestos Shingles
#     AsphShn Asphalt Shingles
#     BrkComm Brick Common
#     BrkFace Brick Face
#     CBlock Cinder Block
#     CemntBd Cement Board
#     HdBoard Hard Board
#     ImStucc Imitation Stucco
#     MetalSd Metal Siding
#     Other Other
#     Plywood Plywood
#     PreCast PreCast
#     Stone Stone
#     Stucco Stucco
#     VinylSd Vinyl Siding
#     Wd Sdng Wood Siding
#     WdShing Wood Shingles
# Exterior2nd: Exterior covering on house (if more than one material)
#     AsbShng Asbestos Shingles
#     AsphShn Asphalt Shingles
#     BrkComm Brick Common
#     BrkFace Brick Face
#     CBlock Cinder Block
#     CemntBd Cement Board
#     HdBoard Hard Board
#     ImStucc Imitation Stucco
#     MetalSd Metal Siding
#     Other Other
#     Plywood Plywood
#     PreCast PreCast
#     Stone Stone
#     Stucco Stucco
#     VinylSd Vinyl Siding
#     Wd Sdng Wood Siding
#     WdShing Wood Shingles
# MasVnrType: Masonry veneer type
#     BrkCmn Brick Common
#     BrkFace Brick Face
#     CBlock Cinder Block
#     None None
#     Stone Stone
# MasVnrArea: Masonry veneer area in square feet
# ExterQual: Exterior material quality
#     Ex Excellent
#     Gd Good
#     TA Average/Typical
#     Fa Fair
#     Po Poor
# ExterCond: Present condition of the material on the exterior
#     Ex Excellent
#     Gd Good
#     TA Average/Typical
#     Fa Fair
#     Po Poor
# Foundation: Type of foundation
#     BrkTil Brick & Tile
#     CBlock Cinder Block
#     PConc Poured Contrete
#     Slab Slab
#     Stone Stone
#     Wood Wood
# BsmtQual: Height of the basement
#     Ex Excellent (100+ inches)
#     Gd Good (90-99 inches)
#     TA Typical (80-89 inches)
#     Fa Fair (70-79 inches)
#     Po Poor (<70 inches)
#     NA No Basement
# BsmtCond: General condition of the basement
#     Ex Excellent
#     Gd Good
#     TA Typical - slight dampness allowed
#     Fa Fair - dampness or some cracking or settling
#     Po Poor - Severe cracking, settling, or wetness
#     NA No Basement
# BsmtExposure: Walkout or garden level basement walls
#     Gd Good Exposure
#     Av Average Exposure (split levels or foyers typically score average or above)
#     Mn Mimimum Exposure
#     No No Exposure
#     NA No Basement
# BsmtFinType1: Quality of basement finished area
#     GLQ Good Living Quarters
#     ALQ Average Living Quarters
#     BLQ Below Average Living Quarters
#     Rec Average Rec Room
#     LwQ Low Quality
#     Unf Unfinshed
#     NA No Basement
# BsmtFinSF1: Type 1 finished square feet
# BsmtFinType2: Quality of second finished area (if present)
#     GLQ Good Living Quarters
#     ALQ Average Living Quarters
#     BLQ Below Average Living Quarters
#     Rec Average Rec Room
#     LwQ Low Quality
#     Unf Unfinshed
#     NA No Basement
# BsmtFinSF2: Type 2 finished square feet
# BsmtUnfSF: Unfinished square feet of basement area
# TotalBsmtSF: Total square feet of basement area
# Heating: Type of heating
#     Floor Floor Furnace
#     GasA Gas forced warm air furnace
#     GasW Gas hot water or steam heat
#     Grav Gravity furnace
#     OthW Hot water or steam heat other than gas
#     Wall Wall furnace
# HeatingQC: Heating quality and condition
#     Ex Excellent
#     Gd Good
#     TA Average/Typical
#     Fa Fair
#     Po Poor
# CentralAir: Central air conditioning
#     N No
#     Y Yes
# Electrical: Electrical system
#     SBrkr Standard Circuit Breakers & Romex
#     FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
#     FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
#     FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
#     Mix Mixed
# 1stFlrSF: First Floor square feet
# 2ndFlrSF: Second floor square feet
# LowQualFinSF: Low quality finished square feet (all floors)
# GrLivArea: Above grade (ground) living area square feet
# BsmtFullBath: Basement full bathrooms
# BsmtHalfBath: Basement half bathrooms
# FullBath: Full bathrooms above grade
# HalfBath: Half baths above grade
# Bedroom: Number of bedrooms above basement level
# Kitchen: Number of kitchens
# KitchenQual: Kitchen quality
#     Ex Excellent
#     Gd Good
#     TA Typical/Average
#     Fa Fair
#     Po Poor
# TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
# Functional: Home functionality rating
#     Typ Typical Functionality
#     Min1 Minor Deductions 1
#     Min2 Minor Deductions 2
#     Mod Moderate Deductions
#     Maj1 Major Deductions 1
#     Maj2 Major Deductions 2
#     Sev Severely Damaged
#     Sal Salvage only
# Fireplaces: Number of fireplaces
# FireplaceQu: Fireplace quality
#     Ex Excellent - Exceptional Masonry Fireplace
#     Gd Good - Masonry Fireplace in main level
#     TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
#     Fa Fair - Prefabricated Fireplace in basement
#     Po Poor - Ben Franklin Stove
#     NA No Fireplace
# GarageType: Garage location
#     2Types More than one type of garage
#     Attchd Attached to home
#     Basment Basement Garage
#     BuiltIn Built-In (Garage part of house - typically has room above garage)
#     CarPort Car Port
#     Detchd Detached from home
#     NA No Garage
# GarageYrBlt: Year garage was built
# GarageFinish: Interior finish of the garage
#     Fin Finished
#     RFn Rough Finished
#     Unf Unfinished
#     NA No Garage
# GarageCars: Size of garage in car capacity
# GarageArea: Size of garage in square feet
# GarageQual: Garage quality
#     Ex Excellent
#     Gd Good
#     TA Typical/Average
#     Fa Fair
#     Po Poor
#     NA No Garage
#     GarageCond: Garage condition
#     Ex Excellent
#     Gd Good
#     TA Typical/Average
#     Fa Fair
#     Po Poor
#     NA No Garage
# PavedDrive: Paved driveway
#     Y Paved
#     P Partial Pavement
#     N Dirt/Gravel
# WoodDeckSF: Wood deck area in square feet
# OpenPorchSF: Open porch area in square feet
# EnclosedPorch: Enclosed porch area in square feet
# 3SsnPorch: Three season porch area in square feet
# ScreenPorch: Screen porch area in square feet
# PoolArea: Pool area in square feet
# PoolQC: Pool quality
#     Ex Excellent
#     Gd Good
#     TA Average/Typical
#     Fa Fair
#     NA No Pool
# Fence: Fence quality
#     GdPrv Good Privacy
#     MnPrv Minimum Privacy
#     GdWo Good Wood
#     MnWw Minimum Wood/Wire
#     NA No Fence
# MiscFeature: Miscellaneous feature not covered in other categories
#     Elev Elevator
#     Gar2 2nd Garage (if not described in garage section)
#     Othr Other
#     Shed Shed (over 100 SF)
#     TenC Tennis Court
#     NA None
# MiscVal: $Value of miscellaneous feature
# MoSold: Month Sold
# YrSold: Year Sold
# SaleType: Type of sale
#     WD Warranty Deed - Conventional
#     CWD Warranty Deed - Cash
#     VWD Warranty Deed - VA Loan
#     New Home just constructed and sold
#     COD Court Officer Deed/Estate
#     Con Contract 15% Down payment regular terms
#     ConLw Contract Low Down payment and low interest
#     ConLI Contract Low Interest
#     ConLD Contract Low Down
#     Oth Other

Finally, here’s my presentation. I put this at the end because I delivered this with an eye for light-heartedness and comedy, as I was presenting it just to my fellow students. I was originally going to change it for the blog to make it more professional, but I enjoyed too much of the jokes I had written, and it’s somewhat indicative of my personality. However, as you can see from my other projects, I am quite capable of giving a fully professional presentation. Anyway, enjoy!

Written on May 19, 2017