Ames Housing Data and House Price Prediction

Hi and welcome to my blog. This is my first real entry, and it basically entails a project I did in my data science bootcamp at General Assembly. This particular post will be heavy in the coding department and light in the writing department, but will give you a little glimpse into how I was thinking early on in the program. So, without further ado…

First we start off by importing anything and everything that might be helpful here.

import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
import warnings


%config InlineBackend.figure_format = 'retina'
%matplotlib inline

Next, we import out data files, first saving the names as variables.

train_csv = '/Users/evanjacobs/dsi/DSI-US-4/project-2/train.csv'
test_csv = '/Users/evanjacobs/dsi/DSI-US-4/project-2/test.csv'

For now, we’ll just import out training data, so we don’t accidentally alter the precious testing data.

df = pd.read_csv(train_csv)
finaltest = pd.read_csv(test_csv)

First, we’re going to do our test train split, and here we’ll set our y to be our target, ‘SalePrice’.

X = df.drop(['SalePrice'], axis=1)
y = df.SalePrice.values
X_full = df.drop(['SalePrice'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Let’s have a look, shall we?

Id PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... 3Ssn Porch Screen Porch Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type
484 1875 534201040 20 RL 70.0 8050 Pave NaN Reg Lvl ... 0 0 0 NaN NaN NaN 0 3 2007 WD
1234 178 902206040 50 RM 50.0 5500 Pave NaN Reg Lvl ... 0 0 0 NaN NaN NaN 0 4 2010 WD
1917 20 527302110 20 RL 85.0 13175 Pave NaN Reg Lvl ... 0 0 0 NaN MnPrv NaN 0 2 2010 WD
640 2420 528228280 120 RL 43.0 3087 Pave NaN Reg Lvl ... 0 0 0 NaN NaN NaN 0 11 2006 New
811 1448 907202160 80 RL NaN 10970 Pave NaN IR1 Low ... 0 0 0 NaN MnPrv NaN 0 10 2008 WD

5 rows × 80 columns

Id PID MS SubClass Lot Frontage Lot Area Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area ... Garage Area Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold
count 1538.000000 1.538000e+03 1538.000000 1294.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1521.000000 ... 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000 1538.000000
mean 1469.118336 7.148299e+08 57.542263 69.540958 10179.084525 6.109883 5.571521 1971.674252 1984.081274 99.113083 ... 471.424577 95.207412 49.256177 23.018856 2.914174 17.200260 3.197659 55.282835 6.195709 2007.784785
std 844.226713 1.887552e+08 43.351837 22.987056 7353.026485 1.405082 1.110848 30.258868 21.200024 174.156041 ... 216.396308 132.411630 69.244398 60.037423 27.776465 59.571394 43.605315 617.362905 2.753136 1.313997
min 1.000000 5.263011e+08 20.000000 21.000000 1300.000000 1.000000 1.000000 1879.000000 1950.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000
25% 746.500000 5.284567e+08 20.000000 59.000000 7455.500000 5.000000 5.000000 1953.000000 1964.000000 0.000000 ... 316.250000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 2007.000000
50% 1496.500000 5.354546e+08 50.000000 69.000000 9465.000000 6.000000 5.000000 1975.000000 1993.000000 0.000000 ... 480.000000 0.000000 28.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000
75% 2174.750000 9.071855e+08 70.000000 80.000000 11635.500000 7.000000 6.000000 2001.000000 2004.000000 162.000000 ... 576.000000 168.000000 72.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000
max 2930.000000 9.241520e+08 190.000000 313.000000 159000.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 ... 1418.000000 1424.000000 547.000000 432.000000 508.000000 490.000000 800.000000 17000.000000 12.000000 2010.000000

8 rows × 38 columns
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1538 entries, 484 to 1169
Data columns (total 80 columns):
Id                 1538 non-null int64
PID                1538 non-null int64
MS SubClass        1538 non-null int64
MS Zoning          1538 non-null object
Lot Frontage       1294 non-null float64
Lot Area           1538 non-null int64
Street             1538 non-null object
Alley              110 non-null object
Lot Shape          1538 non-null object
Land Contour       1538 non-null object
Utilities          1538 non-null object
Lot Config         1538 non-null object
Land Slope         1538 non-null object
Neighborhood       1538 non-null object
Condition 1        1538 non-null object
Condition 2        1538 non-null object
Bldg Type          1538 non-null object
House Style        1538 non-null object
Overall Qual       1538 non-null int64
Overall Cond       1538 non-null int64
Year Built         1538 non-null int64
Year Remod/Add     1538 non-null int64
Roof Style         1538 non-null object
Roof Matl          1538 non-null object
Exterior 1st       1538 non-null object
Exterior 2nd       1538 non-null object
Mas Vnr Type       1521 non-null object
Mas Vnr Area       1521 non-null float64
Exter Qual         1538 non-null object
Exter Cond         1538 non-null object
Foundation         1538 non-null object
Bsmt Qual          1501 non-null object
Bsmt Cond          1501 non-null object
Bsmt Exposure      1498 non-null object
BsmtFin Type 1     1501 non-null object
BsmtFin SF 1       1538 non-null float64
BsmtFin Type 2     1500 non-null object
BsmtFin SF 2       1538 non-null float64
Bsmt Unf SF        1538 non-null float64
Total Bsmt SF      1538 non-null float64
Heating            1538 non-null object
Heating QC         1538 non-null object
Central Air        1538 non-null object
Electrical         1538 non-null object
1st Flr SF         1538 non-null int64
2nd Flr SF         1538 non-null int64
Low Qual Fin SF    1538 non-null int64
Gr Liv Area        1538 non-null int64
Bsmt Full Bath     1537 non-null float64
Bsmt Half Bath     1537 non-null float64
Full Bath          1538 non-null int64
Half Bath          1538 non-null int64
Bedroom AbvGr      1538 non-null int64
Kitchen AbvGr      1538 non-null int64
Kitchen Qual       1538 non-null object
TotRms AbvGrd      1538 non-null int64
Functional         1538 non-null object
Fireplaces         1538 non-null int64
Fireplace Qu       798 non-null object
Garage Type        1449 non-null object
Garage Yr Blt      1449 non-null float64
Garage Finish      1449 non-null object
Garage Cars        1538 non-null float64
Garage Area        1538 non-null float64
Garage Qual        1449 non-null object
Garage Cond        1449 non-null object
Paved Drive        1538 non-null object
Wood Deck SF       1538 non-null int64
Open Porch SF      1538 non-null int64
Enclosed Porch     1538 non-null int64
3Ssn Porch         1538 non-null int64
Screen Porch       1538 non-null int64
Pool Area          1538 non-null int64
Pool QC            9 non-null object
Fence              286 non-null object
Misc Feature       50 non-null object
Misc Val           1538 non-null int64
Mo Sold            1538 non-null int64
Yr Sold            1538 non-null int64
Sale Type          1538 non-null object
dtypes: float64(11), int64(27), object(42)
memory usage: 973.3+ KB
Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual',
       'Garage Cond', 'Paved Drive', 'Wood Deck SF', 'Open Porch SF',
       'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Pool QC',
       'Fence', 'Misc Feature', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type'],

Closing up the column names so I can use dot notation.

def colclean(column_list): 
    for n in column_list:
        n = n.lower().replace(' ','')
    return columns
X_train.columns = colclean(X_train.columns)
X_test.columns = colclean(X_test.columns)
X_full.columns = colclean(X_full.columns)
finaltest.columns = colclean(finaltest.columns)

Checking for duplicate PIDs.

X_train.duplicated(subset='pid', keep='first').sum()

Got any nulls lying around?

id                  0
pid                 0
mssubclass          0
mszoning            0
lotfrontage       244
lotarea             0
street              0
alley            1428
lotshape            0
landcontour         0
utilities           0
lotconfig           0
landslope           0
neighborhood        0
condition1          0
condition2          0
bldgtype            0
housestyle          0
overallqual         0
overallcond         0
yearbuilt           0
yearremod/add       0
roofstyle           0
roofmatl            0
exterior1st         0
exterior2nd         0
masvnrtype         17
masvnrarea         17
exterqual           0
extercond           0
fullbath            0
halfbath            0
bedroomabvgr        0
kitchenabvgr        0
kitchenqual         0
totrmsabvgrd        0
functional          0
fireplaces          0
fireplacequ       740
garagetype         89
garageyrblt        89
garagefinish       89
garagecars          0
garagearea          0
garagequal         89
garagecond         89
paveddrive          0
wooddecksf          0
openporchsf         0
enclosedporch       0
3ssnporch           0
screenporch         0
poolarea            0
poolqc           1529
fence            1252
miscfeature      1488
miscval             0
mosold              0
yrsold              0
saletype            0
Length: 80, dtype: int64

Here’s a trick I learned.

X_train.isna().sum()[X_train.isna().sum() !=0]
lotfrontage      244
alley           1428
masvnrtype        17
masvnrarea        17
bsmtqual          37
bsmtcond          37
bsmtexposure      40
bsmtfintype1      37
bsmtfintype2      38
bsmtfullbath       1
bsmthalfbath       1
fireplacequ      740
garagetype        89
garageyrblt       89
garagefinish      89
garagequal        89
garagecond        89
poolqc          1529
fence           1252
miscfeature     1488
dtype: int64

Having a look at what the object column with null values looks like.

array([nan, 'TA', 'Gd', 'Ex', 'Fa'], dtype=object)

Just for the sake of time, going to fill all null values with their numerical averages for numerical columns.

X_train = X_train.fillna(X_train.mean())
X_test = X_test.fillna(X_test.mean())
X_full = X_full.fillna(X_full.mean())
finaltest = finaltest.fillna(finaltest.mean())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1538 entries, 484 to 1169
Data columns (total 80 columns):
id               1538 non-null int64
pid              1538 non-null int64
mssubclass       1538 non-null int64
mszoning         1538 non-null object
lotfrontage      1538 non-null float64
lotarea          1538 non-null int64
street           1538 non-null object
alley            110 non-null object
lotshape         1538 non-null object
landcontour      1538 non-null object
utilities        1538 non-null object
lotconfig        1538 non-null object
landslope        1538 non-null object
neighborhood     1538 non-null object
condition1       1538 non-null object
condition2       1538 non-null object
bldgtype         1538 non-null object
housestyle       1538 non-null object
overallqual      1538 non-null int64
overallcond      1538 non-null int64
yearbuilt        1538 non-null int64
yearremod/add    1538 non-null int64
roofstyle        1538 non-null object
roofmatl         1538 non-null object
exterior1st      1538 non-null object
exterior2nd      1538 non-null object
masvnrtype       1521 non-null object
masvnrarea       1538 non-null float64
exterqual        1538 non-null object
extercond        1538 non-null object
foundation       1538 non-null object
bsmtqual         1501 non-null object
bsmtcond         1501 non-null object
bsmtexposure     1498 non-null object
bsmtfintype1     1501 non-null object
bsmtfinsf1       1538 non-null float64
bsmtfintype2     1500 non-null object
bsmtfinsf2       1538 non-null float64
bsmtunfsf        1538 non-null float64
totalbsmtsf      1538 non-null float64
heating          1538 non-null object
heatingqc        1538 non-null object
centralair       1538 non-null object
electrical       1538 non-null object
1stflrsf         1538 non-null int64
2ndflrsf         1538 non-null int64
lowqualfinsf     1538 non-null int64
grlivarea        1538 non-null int64
bsmtfullbath     1538 non-null float64
bsmthalfbath     1538 non-null float64
fullbath         1538 non-null int64
halfbath         1538 non-null int64
bedroomabvgr     1538 non-null int64
kitchenabvgr     1538 non-null int64
kitchenqual      1538 non-null object
totrmsabvgrd     1538 non-null int64
functional       1538 non-null object
fireplaces       1538 non-null int64
fireplacequ      798 non-null object
garagetype       1449 non-null object
garageyrblt      1538 non-null float64
garagefinish     1449 non-null object
garagecars       1538 non-null float64
garagearea       1538 non-null float64
garagequal       1449 non-null object
garagecond       1449 non-null object
paveddrive       1538 non-null object
wooddecksf       1538 non-null int64
openporchsf      1538 non-null int64
enclosedporch    1538 non-null int64
3ssnporch        1538 non-null int64
screenporch      1538 non-null int64
poolarea         1538 non-null int64
poolqc           9 non-null object
fence            286 non-null object
miscfeature      50 non-null object
miscval          1538 non-null int64
mosold           1538 non-null int64
yrsold           1538 non-null int64
saletype         1538 non-null object
dtypes: float64(11), int64(27), object(42)
memory usage: 973.3+ KB

Breaking dataframes into two where one is all num values and one is all obj values so I can look at them more easily.

X_tr_obj = X_train.select_dtypes(exclude=[np.number])
X_tr_num = X_train.select_dtypes(include=[np.number])
X_ts_obj = X_test.select_dtypes(exclude=[np.number])
X_ts_num = X_test.select_dtypes(include=[np.number])
X_full_obj = X_full.select_dtypes(exclude=[np.number])
X_full_num = X_full.select_dtypes(include=[np.number])
finaltest_obj = finaltest.select_dtypes(exclude=[np.number])
finaltest_num = finaltest.select_dtypes(include=[np.number])
id pid mssubclass lotfrontage lotarea overallqual overallcond yearbuilt yearremod/add masvnrarea ... wooddecksf openporchsf enclosedporch 3ssnporch screenporch poolarea poolqc miscval mosold yrsold
908 2559 534455080 20 80.0 9600 5 6 1961 1990 0.0 ... 144 0 205 0 0 0 NaN 0 6 2006
1619 1947 535375130 50 60.0 10134 5 6 1940 1950 0.0 ... 0 39 0 0 0 0 NaN 0 7 2007
391 81 531453010 20 81.0 9672 6 5 1984 1985 0.0 ... 0 0 0 0 0 0 NaN 0 5 2010
861 2573 535151130 90 70.0 7728 5 6 1962 1962 120.0 ... 0 18 0 0 0 0 NaN 0 5 2006
1270 1569 914476080 90 76.0 10260 5 4 1976 1976 0.0 ... 0 0 0 0 0 0 NaN 0 11 2008

5 rows × 39 columns

Checking to make sure it worked.

(1538, 80)
(1538, 42)
(1538, 38)

Let’s check for potential outliers.

count mean std min 25% 50% 75% max
id 1538.0 1.469118e+03 8.442267e+02 1.0 7.465000e+02 1.496500e+03 2.174750e+03 2930.0
pid 1538.0 7.148299e+08 1.887552e+08 526301100.0 5.284567e+08 5.354546e+08 9.071855e+08 924152030.0
mssubclass 1538.0 5.754226e+01 4.335184e+01 20.0 2.000000e+01 5.000000e+01 7.000000e+01 190.0
lotfrontage 1538.0 6.954096e+01 2.108364e+01 21.0 6.000000e+01 6.954096e+01 7.900000e+01 313.0
lotarea 1538.0 1.017908e+04 7.353026e+03 1300.0 7.455500e+03 9.465000e+03 1.163550e+04 159000.0
overallqual 1538.0 6.109883e+00 1.405082e+00 1.0 5.000000e+00 6.000000e+00 7.000000e+00 10.0
overallcond 1538.0 5.571521e+00 1.110848e+00 1.0 5.000000e+00 5.000000e+00 6.000000e+00 9.0
yearbuilt 1538.0 1.971674e+03 3.025887e+01 1879.0 1.953000e+03 1.975000e+03 2.001000e+03 2010.0
yearremod/add 1538.0 1.984081e+03 2.120002e+01 1950.0 1.964000e+03 1.993000e+03 2.004000e+03 2010.0
masvnrarea 1538.0 9.911308e+01 1.731902e+02 0.0 0.000000e+00 0.000000e+00 1.600000e+02 1600.0
bsmtfinsf1 1538.0 4.402523e+02 4.704420e+02 0.0 0.000000e+00 3.610000e+02 7.287500e+02 5644.0
bsmtfinsf2 1538.0 4.791873e+01 1.636097e+02 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1474.0
bsmtunfsf 1538.0 5.709402e+02 4.445010e+02 0.0 2.222500e+02 4.800000e+02 8.150000e+02 2336.0
totalbsmtsf 1538.0 1.059111e+03 4.524990e+02 0.0 7.895000e+02 9.945000e+02 1.324000e+03 6110.0
1stflrsf 1538.0 1.165298e+03 4.039276e+02 334.0 8.782500e+02 1.092500e+03 1.408500e+03 5095.0
2ndflrsf 1538.0 3.328362e+02 4.238517e+02 0.0 0.000000e+00 0.000000e+00 6.995000e+02 1836.0
lowqualfinsf 1538.0 5.667100e+00 5.337538e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1064.0
grlivarea 1538.0 1.503801e+03 5.047836e+02 334.0 1.143000e+03 1.452000e+03 1.724000e+03 5642.0
bsmtfullbath 1538.0 4.307092e-01 5.182866e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 2.0
bsmthalfbath 1538.0 6.115810e-02 2.476318e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 2.0
fullbath 1538.0 1.583225e+00 5.445938e-01 0.0 1.000000e+00 2.000000e+00 2.000000e+00 4.0
halfbath 1538.0 3.719116e-01 4.993598e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 2.0
bedroomabvgr 1538.0 2.843953e+00 8.124554e-01 0.0 2.000000e+00 3.000000e+00 3.000000e+00 8.0
kitchenabvgr 1538.0 1.041612e+00 2.093097e-01 0.0 1.000000e+00 1.000000e+00 1.000000e+00 3.0
totrmsabvgrd 1538.0 6.445384e+00 1.545643e+00 2.0 5.000000e+00 6.000000e+00 7.000000e+00 15.0
fireplaces 1538.0 6.046814e-01 6.481274e-01 0.0 0.000000e+00 1.000000e+00 1.000000e+00 4.0
garageyrblt 1538.0 1.978795e+03 2.501178e+01 1895.0 1.962000e+03 1.978795e+03 2.001000e+03 2207.0
garagecars 1538.0 1.774382e+00 7.672165e-01 0.0 1.000000e+00 2.000000e+00 2.000000e+00 4.0
garagearea 1538.0 4.714246e+02 2.163963e+02 0.0 3.162500e+02 4.800000e+02 5.760000e+02 1418.0
wooddecksf 1538.0 9.520741e+01 1.324116e+02 0.0 0.000000e+00 0.000000e+00 1.680000e+02 1424.0
openporchsf 1538.0 4.925618e+01 6.924440e+01 0.0 0.000000e+00 2.800000e+01 7.200000e+01 547.0
enclosedporch 1538.0 2.301886e+01 6.003742e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 432.0
3ssnporch 1538.0 2.914174e+00 2.777646e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 508.0
screenporch 1538.0 1.720026e+01 5.957139e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 490.0
poolarea 1538.0 3.197659e+00 4.360531e+01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 800.0
miscval 1538.0 5.528283e+01 6.173629e+02 0.0 0.000000e+00 0.000000e+00 0.000000e+00 17000.0
mosold 1538.0 6.195709e+00 2.753136e+00 1.0 4.000000e+00 6.000000e+00 8.000000e+00 12.0
yrsold 1538.0 2.007785e+03 1.313997e+00 2006.0 2.007000e+03 2.008000e+03 2.009000e+03 2010.0

Obviously, some of the maxes are large, but that’s the way real estate works. Otherwise, nothing here pops out as being wroong.

Okay, now that we have separated our dataframe into a numerical one and a categorical one, let’s take a look-see at the numerical correlations.

id               0.046963
pid              0.243312
mssubclass       0.104183
lotfrontage      0.320966
lotarea          0.301233
overallqual      0.787963
overallcond      0.094205
yearbuilt        0.560274
yearremod/add    0.531071
masvnrarea       0.503817
bsmtfinsf1       0.423380
bsmtfinsf2       0.010237
bsmtunfsf        0.180959
totalbsmtsf      0.621630
1stflrsf         0.616752
2ndflrsf         0.244477
lowqualfinsf     0.031802
grlivarea        0.695442
bsmtfullbath     0.279659
bsmthalfbath     0.036105
fullbath         0.541253
halfbath         0.296527
bedroomabvgr     0.150723
kitchenabvgr     0.131280
totrmsabvgrd     0.495609
fireplaces       0.467371
garageyrblt      0.499246
garagecars       0.646358
garagearea       0.654564
wooddecksf       0.322195
openporchsf      0.326194
enclosedporch    0.113088
3ssnporch        0.062105
screenporch      0.128890
poolarea         0.026498
miscval          0.013014
mosold           0.026836
yrsold           0.014487
sp               1.000000
Name: sp, dtype: float64

For this first model, we’re going to choose all the columns where the correlation coefficient with SalePrice is greater than or equal to some value I determine.

vals = abs(X_tr_num.corr().sp).drop('sp').sort_values(ascending=False)
corr_cols = list(vals[vals >= 0.3].index)

X_tr_mod1 = X_tr_num[corr_cols]
X_ts_mod1 = X_ts_num[corr_cols]
X_full_mod1 = X_full_num[corr_cols]
finaltest_num = finaltest_num[corr_cols]


First, let’s just notice that garageyrblt, yearbuilt should be correlated, as well as garagearea, garagecars, as well as totalbsmtsf, masvnrarea, grlivarea, 1stflrsf. So let’s make some interaction variables.

from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2, interaction_only=False, 
X_tr_mod1 = pf.transform(X_tr_mod1)
X_ts_mod1 = pf.transform(X_ts_mod1)
X_full_mod1 = pf.transform(X_full_mod1)
finaltest_num = pf.transform(finaltest_num)

Okay, now let’s use a standard scalar to make everything line up nicely.

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_tr_mod1 = ss.fit_transform(X_tr_mod1)
X_ts_mod1 = ss.transform(X_ts_mod1)
X_full_mod1 = ss.fit_transform(X_full_mod1)
finaltest_num = ss.transform(finaltest_num)

So, for lasso, ridge, and enet, I played with different alpha ranges, different numbers of iterations, and also different correlation thresholds.

Let’s try a lasso!

l_alphas = np.arange(.001, .15, .0025)
lasso_model = LassoCV(alphas=l_alphas, max_iter=2000, cv=5)
# lasso_model = LassoCV(max_iter=10000, cv=5)

model_1 =, y_train)

Great, let’s score the lasso.

print(model_1.score(X_ts_mod1, y_test))

Hey, that’s not a bad score at all! What if we tried ridge?

ridge_alphas = np.logspace(0, 5, 200)

ridge_model = RidgeCV(alphas=ridge_alphas, cv=10)
# ridge_model = RidgeCV(cv=10), y_train)
RidgeCV(alphas=array([1.00000e+00, 1.05956e+00, ..., 9.43788e+04, 1.00000e+05]),
    cv=10, fit_intercept=True, gcv_mode=None, normalize=False,
    scoring=None, store_cv_values=False)
ridge = Ridge(alpha=ridge_model.alpha_)

ridge_scores = cross_val_score(ridge, X_ts_mod1, y_test, cv=15)


Hmmm. And last, everyone’s favorite, the elastic net.

l1_ratios = np.linspace(0.01, 1.0, 25)

enet = ElasticNetCV(l1_ratio=l1_ratios, n_alphas=100, cv=10,
# enet = ElasticNetCV(cv=10, verbose=0), y_train)


enet = ElasticNet(alpha=enet.alpha_, l1_ratio=enet.l1_ratio_)

enet_scores = cross_val_score(enet, X_ts_mod1, y_test, cv=10)


It’s basically the same. But I have discovered that as I decreased my cut off for correlation, my lasso score remained largely the same, but my ridge and enet scores went up a tiny bit, culminating with my pulling an R-squared on .89 from Elastic Net.

# l_alphas = np.arange(.001, .15, .0025)
# lasso_model_final = LassoCV(alphas=l_alphas, cv=5)
# model_1_final =, y)

enet = ElasticNetCV(l1_ratio=l1_ratios, n_alphas=100, cv=10,
model_1_final =, y)

/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.

This is how I send my model’s predictions to a file.

evansubmission1 = pd.DataFrame(data = model_1_final.predict(finaltest_num), columns = ['SalePrice'], index=finaltest['id'])

Plotting my test predictions vs. my test y for a nice visualization of the efficacy of my model.

predictions = model_1_final.predict(X_ts_mod1)
y = y_test

# Plot the model
plt.scatter(predictions, y, s=30, c='g', marker='*', zorder=10)
plt.xlabel("Predicted Values of Price From My Horrible Model")
plt.ylabel("Actual Values of Price")

plt.plot([0, np.max(y)], [0, np.max(y)], c = 'k')
score = cross_val_score(model_1_final, X_ts_mod1, y_test, cv=10)
print("score: ", score.mean())


/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
/anaconda3/envs/dsi/lib/python3.6/site-packages/sklearn/linear_model/ ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.

score:  0.8773479828283166

And that’s it for the first model! Don’t forget to look at the other two model files by checking out the GitHub repository. I’ve included the data dictionary below, and finally, the presentation I made at the very end.

# evansubmission1 = pd.DataFrame(data = model_1.predict(X_ts_mod1), columns = ['SalePrice'], index=y_test['Id'])
# evansubmission1.to_csv('./evansubmission1.csv')
Finally, here’s my presentation. I put this at the end because I delivered this with an eye for light-heartedness and comedy, as I was presenting it just to my fellow students. I was originally going to change it for the blog to make it more professional, but I enjoyed too much of the jokes I had written, and it’s somewhat indicative of my personality. However, as you can see from my other projects, I am quite capable of giving a fully professional presentation. Anyway, enjoy!

Written on May 19, 2017