Apply Logistic Regression To Amazon Reviews Data Set (M)
Apply Logistic Regression To Amazon Reviews Data Set (M)
title: Apply Logistic regression to Amazon reviews data set. [M].ipynb, id: 1Es1wP2edJ0vrKasA5wY
title: Apply Naive Bayes to Amazon reviews [M].ipynb, id: 1qPxAZeYQUM-eqaKnOSM5ubK2IPIVmdyo
title: clean_final.sqlite, id: 1T0HyUqaVFyD8HfIQEM6WN8jF8SpEOsAo
title: KNN on Credit Card fraud detection.ipynb, id: 1CkA-RBfXqvubKkQrpnjbYUKVsC7VHlTl
title: creditcard.csv, id: 1VpeqlS0lPVrlzlMIqvQTzc3Pno_Cj4SV
title: creditcard.csv, id: 1bnZktEq3N_5wjoCH85oIXHxNwXUW_jx-
title: Untitled, id: 1K0wwkizWx3WO8d-zw-YewWIUrPdINYmp
title: final.sqlite, id: 1OzLc3k6-T55I-XRMq47ERyCbQbVw4caF
1
title: HeavyComputations.ipynb, id: 1aBORe3gqeFY-iNhzMtr-TIkzEyEvFxcG
title: LogisticRegression.ipynb, id: 1WcVTklMZBMu9VTCIWeupOK0r2aYbHk8p
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import matplotlib.pyplot as plt
import sqlite3
2
In [0]: train_vectors = vectorizer.transform(X_train)
In [44]: train_vectors.get_shape()
In [46]: X_train_resampled.shape
#Using GridSearchCV
gscv = GridSearchCV(model, tuned_parameters, scoring = 'accuracy', cv=5)
t0 = datetime.now()
print(gscv.fit(X_train_scaled, y_train_resampled))
t1=datetime.now()
In [50]: gscv.best_estimator_
3
In [0]: predictions = gscv.best_estimator_.predict(X_test_scaled)
[[ 465 207]
[ 268 3851]]
In [57]: print("TPR = {}\n TNR = {}\n FPR = {}\n FNR = {}".format(tp/(fn+tp), tn/(tn+fp), fp/(tn
TPR = 0.948989650073928
TNR = 0.6343792633015006
FPR = 0.3656207366984993
FNR = 0.05101034992607196
t0=datetime.now()
print(rscv.fit(X_train_scaled, y_train_resampled))
t1=datetime.now()
RandomizedSearchCV(cv=5, error_score='raise',
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
fit_params={}, iid=True, n_iter=100, n_jobs=1,
param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd
4
pre_dispatch='2*n_jobs', random_state=None, refit=True,
scoring='accuracy', verbose=0)
Execution time = 0:14:46.027157
In [61]: rscv.best_estimator_
[[ 464 214]
[ 269 3844]]
In [63]: print("TPR = {}\n TNR = {}\n FPR = {}\n FNR = {}".format(tp/(fn+tp), tn/(tn+fp), fp/(tn
TPR = 0.9472646623952686
TNR = 0.6330150068212824
FPR = 0.3669849931787176
FNR = 0.0527353376047314
0.1 Remarks
Huge improvement in performance over Naive Bayes (note the improved TNR) and not much
difference between GridSearch and RandomSearch although the latter is somewhat faster (note
the time)
5
print(confusion_matrix(y_test, predictions).T)
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print("\n")
print("TPR = {}\n TNR = {}\n FPR = {}\n FNR = {}".format(tp/(fn+tp), tn/(tn+fp), fp/(tn
[[ 522 220]
[ 211 3838]]
TPR = 0.9457861015278463
TNR = 0.7121418826739427
FPR = 0.2878581173260573
FNR = 0.05421389847215377
L1 regulariser outperforms L2 regulariser (higher TNR) only with 50 iterations. When done
with 100 iterations of random search, this result is bound to improve.
6
print("TPR = {}\n TNR = {}\n FPR = {}\n FNR = {}".format(tp/(fn+tp), tn/(tn+fp), fp/(
[[ 527 221]
[ 206 3837]]
TPR = 0.9455396747166092
TNR = 0.7189631650750341
FPR = 0.2810368349249659
FNR = 0.054460325283390836
number of nonzero components = 6181
sparsity = 0.7694258962211362
************************************************************************************************
[[ 526 220]
[ 207 3838]]
TPR = 0.9457861015278463
TNR = 0.7175989085948158
FPR = 0.28240109140518416
FNR = 0.05421389847215377
number of nonzero components = 6032
7
sparsity = 0.7749841459320327
************************************************************************************************
[[ 524 225]
[ 209 3833]]
TPR = 0.9445539674716609
TNR = 0.7148703956343793
FPR = 0.28512960436562074
FNR = 0.05544603252833908
number of nonzero components = 5900
sparsity = 0.7799082329242362
************************************************************************************************
[[ 527 224]
[ 206 3834]]
TPR = 0.944800394282898
TNR = 0.7189631650750341
FPR = 0.2810368349249659
FNR = 0.05519960571710202
number of nonzero components = 5743
sparsity = 0.7857649121498116
8
************************************************************************************************
[[ 529 223]
[ 204 3835]]
TPR = 0.9450468210941351
TNR = 0.7216916780354706
FPR = 0.2783083219645293
FNR = 0.05495317890586496
number of nonzero components = 5711
sparsity = 0.7869586302085276
************************************************************************************************
[[ 530 222]
[ 203 3836]]
TPR = 0.9452932479053721
TNR = 0.723055934515689
FPR = 0.27694406548431105
FNR = 0.05470675209462789
number of nonzero components = 5624
sparsity = 0.7902040511806617
************************************************************************************************
9
results for 4.5 times best lambda
precision recall f1-score support
[[ 528 225]
[ 205 3833]]
TPR = 0.9445539674716609
TNR = 0.7203274215552524
FPR = 0.27967257844474763
FNR = 0.05544603252833908
number of nonzero components = 5555
sparsity = 0.7927780057447682
************************************************************************************************
0.4 Remarks
With increase in lambda (decrease in C) number of nonzero components decrease however there
is no appreciable change in performance
model = LogisticRegression(C=0.0168)
model.fit(X_train_scaled, y_train_resampled)
w_non_noisy = model.coef_
#model.fit(X_noisy, y_train_resampled)
#w_noisy = model.coef_
In [26]: np.linalg.norm(w_non_noisy)
Out[26]: 4.6978174895120945
10
In [28]: np.linalg.norm(w_noisy)
Out[28]: 4.698050250116066
In [29]: diff
Out[29]: 0.04061884783793817
0.6 Remarks
Since difference vector has very small magnitude as compared to w_non_noisy we can conclude,
there is very low multicollinearity between features ## Feature Importance
In [82]: imp
11