Modeling, The Movie
(Go to the READ.ME of this repository for the entire write-up.)
For modeling, I took the practice of throwing everything at the wall and seeing what worked. I imported many different models, including linear regression, lasso, SGD regressor, bagging regressor, random forrest regressor, SVR, and adaboost regressor, as well as classifiers including logistic regression, random forest classifier, adaboost classifier, k-nearest neighbors classifier, decision tree classifier, and even a neural network.
import imdb
import warnings
warnings.simplefilter("ignore")
import re
import pandas as pd
import numpy as np
import ast
from datetime import datetime, timedelta
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, SGDRegressor
from sklearn.feature_selection import RFE
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error, f1_score
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.utils import np_utils
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
I brought in my six DataFrames:
- 1 df 2 = directors and actors weighted, , deleted columns with 1 or fewer terms
- 2 df 2 = directors and actors weighted, deleted columns with 1 or fewer terms
- 3 df 2 = directors and actors and writers weighted, deleted columns with 1 or fewer terms
- 1 df 3 = directors and actors weighted, , deleted columns with 2 or fewer terms
- 2 df 2 = directors and actors weighted, deleted columns with 2 or fewer terms
- 3 df 2 = directors and actors and writers weighted, deleted columns with 2 or fewer terms
# Pre-made DataFrames with directors weighted
# X_train = pd.read_csv('train_everything_director_weights_df2.csv') # 1
# X_test = pd.read_csv('test_everything_director_weights_df2.csv') # 1
# X_train = pd.read_csv('train_everything_director_actor_weights_df2.csv') # 2
# X_test = pd.read_csv('test_everything_director_actor_weights_df2.csv') # 2
X_train = pd.read_csv('train_everything_director_actor_writer_weights_df2.csv') # 3
X_test = pd.read_csv('test_everything_director_actor_writer_weights_df2.csv') # 3
# X_train = pd.read_csv('train_everything_director_weights_df3.csv') # 4
# X_test = pd.read_csv('test_everything_director_weights_df3.csv') # 4
# X_train = pd.read_csv('train_everything_director_actor_weights_df3.csv') # 5
# X_test = pd.read_csv('test_everything_director_actor_weights_df3.csv') # 5
# X_train = pd.read_csv('train_everything_director_actor_writer_weights_df3.csv') # 6
# X_test = pd.read_csv('test_everything_director_actor_writer_weights_df3.csv') # 6
I then fed the DataFrames through the following cell, which gave me three regressor scores, then transformed my y variable for classification (based on median Metacritic score) and fed that through three classifiers. Throughout this process many models were attempted and thrown out. DataFrames were changed and had to be saved again and reloaded. At the end of the day I decided on the following models:
- Regression
- Bagging Regressor
- Random Forest Regressor
- LASSO
- Classification
- Logistic Regression
- Bagging Classifier
- Random Forest Classifier
Except for LASSO and logistic regression, there wasn’t much rhyme or reason for modeling choices. These just gave me the best relative scores (of the ones I tried), and also didn’t take a huge amount of time. Also, the bagging regressor and classifier, which didn’t seem to ever give me scores that were as good as the other models, still worked quickly and served as a veritable canary in a coal mine, warning me if something had gone wrong with the models.
y_train = X_train.Metascore
y_test = X_test.Metascore
X_train.drop(['Metascore'], axis=1, inplace=True)
X_test.drop(['Metascore'], axis=1, inplace=True)
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
br = BaggingRegressor()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()
lasso = Lasso(.15)
lasso.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('lasso test score: ')
print(lasso.score(X_test, y_test))
print()
median = np.median(y_train)
new_y = []
for n in y_train:
if n > median:
new_y.append(1)
else:
new_y.append(0)
y_train = new_y
new_y = []
for n in y_test:
if n > median:
new_y.append(1)
else:
new_y.append(0)
y_test = new_y
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# print('logreg train score: ')
# print(logreg.score(X_train, y_train))
print('logreg test score: ')
print(logreg.score(X_test, y_test))
print()
br = BaggingClassifier()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()
# 1 reg
# br test score:
# 0.038342711196289625
# rf test score:
# 0.11832620794676674
# lasso test score:
# 0.19244316790430385
# 1 class
# logreg test score:
# 0.6736577181208053
# br test score:
# 0.662751677852349
# rf test score:
# 0.6753355704697986
# 2 reg
# br test score:
# 0.006896130293002622
# rf test score:
# 0.07139091002869702
# lasso test score:
# 0.1924431679043039
# 2 class
# logreg test score:
# 0.6736577181208053
# br test score:
# 0.6375838926174496
# rf test score:
# 0.6585570469798657
# 3 reg
# br test score:
# 0.05994540328342234
# rf test score:
# -0.03186605837286138
# lasso test score:
# 0.1924431679043039
# 3 class
# logreg test score:
# 0.6736577181208053
# br test score:
# 0.6384228187919463
# rf test score:
# 0.6719798657718121
# 4 reg
# br test score:
# 0.023266042810753954
# rf test score:
# 0.07619378931494514
# lasso test score:
# 0.21460320119560472
# 4 class
# logreg test score:
# 0.6854026845637584
# br test score:
# 0.6434563758389261
# rf test score:
# 0.6375838926174496
# 5 reg
# br test score:
# 0.005276011558945859
# rf test score:
# 0.03497975713168888
# lasso test score:
# 0.21460320119560497
# 5 class
# logreg test score:
# 0.6854026845637584
# br test score:
# 0.6518456375838926
# rf test score:
# 0.6610738255033557
# 6 reg
# br test score:
# 0.03739589734130877
# rf test score:
# -0.02765041735558893
# lasso test score:
# 0.21460320119560503
# 6 class
# logreg test score:
# 0.6854026845637584
# br test score:
# 0.6593959731543624
# rf test score:
# 0.6753355704697986
br test score:
0.03739589734130877
rf test score:
-0.02765041735558893
lasso test score:
0.21460320119560503
logreg test score:
0.6854026845637584
br test score:
0.6593959731543624
rf test score:
0.6753355704697986
cap_mods = pd.read_csv('capstone_models_1.csv')
ap_mods.columns = ['', '1 df 3', '2 df 3', '3 df 3', '1 df 2', '2 df2 ',
'3 df 2 ']
cap_mods = cap_mods.set_index('')
cap_mods_class = cap_mods.iloc[3:,:].copy()
cap_mods_reg = cap_mods.iloc[:3,:].copy()
sns.set_style("darkgrid",{"xtick.color":"black", "ytick.color":"black"})
plt.figure(figsize=(10,5))
sns.heatmap(cap_mods_reg, annot = True, cmap="Greens")
# plt.tick_params(color='white', labelcolor='white');
# sns.set_style("dark",{"xtick.color":"white", "ytick.color":"white"})
plt.figure(figsize=(10,5))
sns.heatmap(cap_mods_class, annot = True, cmap = "Blues")
# plt.tick_params(color='white', labelcolor='white');
After analyzing the output from my models, I decided to use the 3 df 2 DataFrame, aka, # 3, to tune hyperparameters on. Similarly, I tuned classifers on random forest, logreg, and LASSO, omitting the others for time. Frankly, the differences between performance is largely negligible, but I had might as well take the .02 bump provided by my best models.
cap_mods
Unnamed: 0 | 1 df 3 | 2 df 3 | 3 df 3 | 1 df 2 | 2 df2 | 3 df 2 | |
---|---|---|---|---|---|---|---|
0 | br reg | 0.038343 | 0.006896 | 0.059945 | 0.023266 | 0.005276 | 0.037396 |
1 | rf reg | 0.118326 | 0.071391 | -0.031866 | 0.076194 | 0.034980 | -0.027650 |
2 | lasso reg | 0.192443 | 0.192443 | 0.192443 | 0.214603 | 0.214603 | 0.214603 |
3 | logreg class | 0.673658 | 0.673658 | 0.673658 | 0.685403 | 0.685403 | 0.685403 |
4 | br class | 0.662752 | 0.637584 | 0.638423 | 0.643456 | 0.651846 | 0.659396 |
5 | rf class | 0.675336 | 0.658557 | 0.671980 | 0.637584 | 0.661074 | 0.675336 |
# y_train = X_train.Metascore
# y_test = X_test.Metascore
# X_train.drop(['Metascore'], axis=1, inplace=True)
# X_test.drop(['Metascore'], axis=1, inplace=True)
# ss = StandardScaler()
# X_train = ss.fit_transform(X_train)
# X_test = ss.transform(X_test)
# median = np.median(y_train)
# new_y = []
# for n in y_train:
# if n > median:
# new_y.append(1)
# else:
# new_y.append(0)
# y_train = new_y
# new_y = []
# for n in y_test:
# if n > median:
# new_y.append(1)
# else:
# new_y.append(0)
# y_test = new_y
rf_params = {
'max_depth': [None],
'n_estimators': [200],
'max_features': [2, 10],
}
gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)qwa
# 0.7030201342281879
# 0.667
# {'max_depth': None, 'max_features': 10, 'n_estimators': 200}
# 0.697986577181208
# 0.6808888888888889
# {'max_depth': 1000, 'max_features': 2, 'n_estimators': 200}
0.697986577181208
0.6808888888888889
{'max_depth': 1000, 'max_features': 2, 'n_estimators': 200}
lasso = LassoCV()
lasso.fit(X_train, y_train)
lasso.score(X_test, y_test)
0.21850824761335874
lasso = Lasso()
lasso_params = {
'alphas': [None, .15],
}
gs = GridSearchCV(lasso, param_grid=lasso_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-88-53d7a684e4b4> in <module>()
5
6 gs = GridSearchCV(lasso, param_grid=lasso_params)
----> 7 gs.fit(X_train, y_train)
8 print(gs.score(X_test, y_test))
9 print(gs.best_score_)
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
637 error_score=self.error_score)
638 for parameters, (train, test) in product(candidate_params,
--> 639 cv.split(X, y, groups)))
640
641 # if one choose to see train score, "out" will contain train score info
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
623 return False
624 else:
--> 625 self._dispatch(tasks)
626 return True
627
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
590
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
113 callback(result)
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
333
334 def get(self):
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
442 train_scores = {}
443 if parameters is not None:
--> 444 estimator.set_params(**parameters)
445
446 start_time = time.time()
/anaconda3/lib/python3.6/site-packages/sklearn/base.py in set_params(self, **params)
272 'Check the list of available parameters '
273 'with `estimator.get_params().keys()`.' %
--> 274 (key, self))
275
276 if delim:
ValueError: Invalid parameter alphas for estimator Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
logreg_params = {
'penalty': ['l1'],
'C': [10, 100],
}
gs = GridSearchCV(logreg, param_grid=logreg_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)
# 0.6971476510067114
# 0.6945555555555556
# {'C': 10, 'penalty': 'l1'}
# 0.7055369127516778
# 0.6921111111111111
# {'C': 10, 'penalty': 'l1'}
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py:605: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=3.
% (min_groups, self.n_splits)), Warning)
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in __init__
self.results = batch()
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 448, in _fit_and_score
X_train, y_train = _safe_split(estimator, X, y, train)
File "/anaconda3/lib/python3.6/site-packages/sklearn/utils/Metaestimators.py", line 200, in _safe_split
X_subset = safe_indexing(X, indices)
File "/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py", line 160, in safe_indexing
return X.take(indices, axis=0)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-87-325d164ab35a>", line 7, in <module>
gs.fit(X_train, y_train)
File "/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 639, in fit
cv.split(X, y, groups)))
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1828, in showtraceback
stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1090, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 311, in wrapped
return f(*args, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 345, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/anaconda3/lib/python3.6/inspect.py", line 1483, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/anaconda3/lib/python3.6/inspect.py", line 1441, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/anaconda3/lib/python3.6/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), '__loader__', None) is not None:
File "/anaconda3/lib/python3.6/inspect.py", line 732, in getmodule
for modname, module in list(sys.modules.items()):
KeyboardInterrupt
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in __init__
self.results = batch()
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 448, in _fit_and_score
X_train, y_train = _safe_split(estimator, X, y, train)
File "/anaconda3/lib/python3.6/site-packages/sklearn/utils/Metaestimators.py", line 200, in _safe_split
X_subset = safe_indexing(X, indices)
File "/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py", line 160, in safe_indexing
return X.take(indices, axis=0)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-87-325d164ab35a>", line 7, in <module>
gs.fit(X_train, y_train)
File "/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 639, in fit
cv.split(X, y, groups)))
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1828, in showtraceback
stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2850, in run_ast_nodes
if self.run_code(code, result):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2927, in run_code
self.showtraceback(running_compiled_code=True)
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1831, in showtraceback
value, tb, tb_offset=tb_offset)
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1371, in structured_traceback
self, etype, value, tb, tb_offset, number_of_lines_of_context)
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1279, in structured_traceback
self, etype, value, tb, tb_offset, number_of_lines_of_context
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1140, in structured_traceback
formatted_exceptions += self.prepare_chained_exception_message(evalue.__cause__)
TypeError: must be str, not list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1828, in showtraceback
stb = value._render_traceback_()
AttributeError: 'TypeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1090, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 311, in wrapped
return f(*args, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 345, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/anaconda3/lib/python3.6/inspect.py", line 1483, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/anaconda3/lib/python3.6/inspect.py", line 1441, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/anaconda3/lib/python3.6/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), '__loader__', None) is not None:
File "/anaconda3/lib/python3.6/inspect.py", line 733, in getmodule
if ismodule(module) and hasattr(module, '__file__'):
File "/anaconda3/lib/python3.6/inspect.py", line 71, in ismodule
return isinstance(object, types.ModuleType)
KeyboardInterrupt
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
624 else:
--> 625 self._dispatch(tasks)
626 return True
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
589 self._jobs.append(job)
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
112 if callback:
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
331 # arguments in memory
--> 332 self.results = batch()
333
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
447
--> 448 X_train, y_train = _safe_split(estimator, X, y, train)
449 X_test, y_test = _safe_split(estimator, X, y, test, train)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/Metaestimators.py in _safe_split(estimator, X, y, indices, train_indices)
199 else:
--> 200 X_subset = safe_indexing(X, indices)
201
/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in safe_indexing(X, indices)
159 # This is often substantially faster than X[indices]
--> 160 return X.take(indices, axis=0)
161 else:
KeyboardInterrupt:
During handling of the above exception, another exception occurred:
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py in showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
1827 # in the engines. This should return a list of strings.
-> 1828 stb = value._render_traceback_()
1829 except Exception:
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_code(self, code_obj, result)
2925 if result is not None:
2926 result.error_in_exec = sys.exc_info()[1]
-> 2927 self.showtraceback(running_compiled_code=True)
2928 else:
2929 outflag = False
/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py in showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
1829 except Exception:
1830 stb = self.InteractiveTB.structured_traceback(etype,
-> 1831 value, tb, tb_offset=tb_offset)
1832
1833 self._showtraceback(etype, value, stb)
/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
1369 self.tb = tb
1370 return FormattedTB.structured_traceback(
-> 1371 self, etype, value, tb, tb_offset, number_of_lines_of_context)
1372
1373
/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py in structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
1277 # Verbose modes need a full traceback
1278 return VerboseTB.structured_traceback(
-> 1279 self, etype, value, tb, tb_offset, number_of_lines_of_context
1280 )
1281 else:
/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py in structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
1138 exception = self.get_parts_of_chained_exception(evalue)
1139 if exception:
-> 1140 formatted_exceptions += self.prepare_chained_exception_message(evalue.__cause__)
1141 etype, evalue, etb = exception
1142 else:
TypeError: must be str, not list
My best classifier (logreg) accuracy was
0.6945555555555556
using C = 10 with an l1 penalty.
And my best regression R$^2$ score was
0.21460320119560503
with an $\alpha$ = .15
There is no reason I shouldn’t be able to achieve better than this given more time in the future.
Future recommendations are numerous. There are many different ways possible to make this score better, the only constraint being time.
In terms of data collection, there are several other large databases to access, including imdb’s itself as well as Metacritic’s. It is entirely possible I have all the Metacritic scores, but I could always use more. Plus, Metacritic has statistics such as whether the movie is part of a franchise and how Ill the previous film did. I can, of course, make that data myself, but again, time is a factor here.
I would also like access to more of the cast and crew including producers, cinematographers, composers, editors, and more of the cast. After all, the theory underlying this entire endeavor is that people make movies and people are consistent in their product.
I could impute null values, especially with things like box office revenue, opening weekend box office revenue, Rotten Tomatoes scores, which could all replace Metacritic scores as the target variable. It would then be a simple mapping from one to the other. There could easily be more Rotten Tomatoes scores than Metacritic.
In terms of feature engineering, there are always more columns to make. I could use polynomial features on my numerical data. I could just use directors and writers. I could run more n-grams on the titles. I could change my min_dfs per column. I could sift down out list of actor weights. I could go back and try to get the actors averages like before.
Finally, there are more models for me to use. Several will allow me to tune hyperarameters to eek out better scores. There are models that work better with NLP. I can try a neural network for both classification and regression. I can try a passive aggressive classifer. And I’ll do all that and I’ll predict movie scores and eventually they’ll make a movie about me.
And that’s my capstone! Wasn’t it great?