-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhw3(SpamHamClassification).py
1016 lines (766 loc) · 35.4 KB
/
hw3(SpamHamClassification).py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env python
# coding: utf-8
# # HW 3: Spam/Ham Classification
# ## Due Date: 5/31 (Wed), 11:59 PM
#
# **Collaboration Policy**
#
# Data science is a collaborative activity. While you may talk with others about
# the project, we ask that you **write your solutions individually**. If you do
# discuss the assignments with others please **include their names** at the top
# of your notebook.
# **Collaborators**: *list collaborators here*
# ## This Assignment
# In this homework, you will use what you've learned in class to create a classifier that can distinguish spam (junk or commercial or bulk) emails from ham (non-spam) emails. In addition to providing some skeleton code to fill in, we will evaluate your work based on your model's accuracy and your written responses in this notebook.
#
# After this homework, you should feel comfortable with the following:
#
# - Part 1: Feature engineering with text data
# - Part 2: Using sklearn libraries to process data and fit models
# - Part 3: Validating the performance of your model and minimizing overfitting
# - Part 3: Generating and analyzing precision-recall curves
#
# ## <span style="color:red">Warning!</span>
# We've tried our best to filter the data for anything blatantly offensive as best as we can, but unfortunately there may still be some examples you may find in poor taste. If you encounter these examples and believe it is inappropriate for students, please let a TA know and we will try to remove it for future semesters. Thanks for your understanding!
# ## Score Breakdown
# Question | Points
# --- | ---
# 1a | 2
# 1b | 2
# 1c | 2
# 2 | 3
# 3 | 3
# 4 | 3
# 5 | 3
# 6a | 3
# 6b | 3
# 6c | 3
# 6d | 3
# 7 | 4
# 8 | 4
# Total | 38
# # Part I - Initial Analysis
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
import seaborn as sns
sns.set(style = "whitegrid",
color_codes = True,
font_scale = 1.5)
class bcolor:
BLACK = '\033[40m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
END = '\033[0m'
def print_passed(str_in):
print(bcolor.BLACK + bcolor.YELLOW + bcolor.BOLD + str_in + bcolor.END)
# ## Mount your Google Drive
# When you run a code cell, Colab executes it on a temporary cloud instance. Every time you open the notebook, you will be assigned a different machine. All compute state and files saved on the previous machine will be lost. Therefore, you may need to re-download datasets or rerun code after a reset. Here, you can mount your Google drive to the temporary cloud instance's local filesystem using the following code snippet and save files under the specified directory (note that you will have to provide permission every time you run this).
# In[2]:
# mount Google drive
from google.colab import drive
drive.mount('/content/drive')
# now you can see files
get_ipython().system('echo -e "\\nNumber of Google drive files in /content/drive/My Drive/:"')
get_ipython().system('ls -l "/content/drive/My Drive/" | wc -l')
# by the way, you can run any linux command by putting a ! at the start of the line
# by default everything gets executed and saved in /content/
get_ipython().system('echo -e "\\nCurrent directory:"')
get_ipython().system('pwd')
# In[3]:
workspace_path = '/content/drive/MyDrive/COSE471' # Change this path!
print(f'Current Workspace: {workspace_path}')
# ### Loading in the Data
#
# Our goal is to classify emails as spam or not spam (referred to as "ham") using features generated from the text in the email.
#
# The dataset consists of email messages and their labels (0 for ham, 1 for spam). Your labeled training dataset contains 8348 labeled examples, and the test set contains 1000 unlabeled examples.
#
# Run the following cells to load in the data into DataFrames.
#
# The `train` DataFrame contains labeled data that you will use to train your model. It contains four columns:
#
# 1. `id`: An identifier for the training example
# 1. `subject`: The subject of the email
# 1. `email`: The text of the email
# 1. `spam`: 1 if the email is spam, 0 if the email is ham (not spam)
#
# The `test` DataFrame contains 1000 unlabeled emails. You will predict labels for these emails and submit your predictions to Kaggle for evaluation.
# In[4]:
original_training_data = pd.read_csv(f'{workspace_path}/train.csv')
test = pd.read_csv(f'{workspace_path}/test.csv')
# Convert the emails to lower case as a first step to processing the text
original_training_data['email'] = original_training_data['email'].str.lower()
test['email'] = test['email'].str.lower()
original_training_data.head()
# ### Question 1a
# First, let's check if our data contains any missing values.
#
# - Step1: Fill in the cell below to print the number of NaN values in each column. **Hint**: [pandas.isnull](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html)
# - Step2: If there are NaN values, replace them with appropriate filler values (i.e., NaN values in the `subject` or `email` columns should be replaced with empty strings).
# - Step3: Print the number of NaN values in each column after this modification to verify that there are no NaN values left.
#
# <!--
# BEGIN QUESTION
# name: q1a
# points: 1
# -->
# In[5]:
# BEGIN YOUR CODE
# -----------------------
print('Before imputation:')
print(original_training_data.isnull().sum())
original_training_data=original_training_data.fillna("")
print('------------')
print('After imputation:')
print(original_training_data.isnull().sum())
# -----------------------
# END YOUR CODE
# In[6]:
assert original_training_data.isnull().sum().sum() == 0
print_passed('Q1a: Passed all unit tests!')
# ### Question 1b
#
# In the cell below, print the text of the first ham (i.e. 1st row) and the first spam email in the original training set.
#
# <!--
# BEGIN QUESTION
# name: q1b
# points: 1
# -->
# In[7]:
# BEGIN YOUR CODE
# -----------------------
first_ham = original_training_data[original_training_data["spam"]==0].iloc[0]["email"]
first_spam = original_training_data[original_training_data["spam"]==1].iloc[0]["email"]
# -----------------------
# END YOUR CODE
print('The text of the first Ham:')
print('------------')
print(first_ham)
print('The text of the first Spam:')
print('------------')
print(first_spam)
# In[8]:
assert len(first_ham) == 359 and len(first_spam) == 444
print_passed('Q1b: Passed all unit tests!')
# ### Question 1c
#
# Discuss one thing you notice that is different between the two emails that might relate to the identification of spam.
#
# <!--
# BEGIN QUESTION
# name: q1c
# manual: True
# points: 2
# -->
# <!-- EXPORT TO PDF -->
# Answer: `URL address is given in the text as hyperlinked in the email text in the case of ham, but URL address is given in the HTML Tag in the case of spam email.`
# ## Training Validation Split
# The training data is all the data we have available for both training models and **validating** the models that we train. We therefore need to split the training data into separate training and validation datsets. You will need this **validation data** to assess the performance of your classifier once you are finished training.
#
# Note that we set the seed (random_state) to 42. This will produce a pseudo-random sequence of random numbers that is the same for every student. **Do not modify this in the following questions, as our tests depend on this random seed.**
# In[9]:
from sklearn.model_selection import train_test_split
train, val = train_test_split(
original_training_data, test_size=0.1, random_state=42)
# In[10]:
print(train.shape, val.shape) # 더해서 8342 맞음
# # Basic Feature Engineering
#
# We would like to take the text of an email and predict whether the email is **ham** or **spam**. This is a *classification* problem, and here we use logistic regression to train a classifier.
#
# Recall that to train an logistic regression model we need:
# - a numeric feature matrix $X$
# - a vector of corresponding binary labels $y$.
#
# Unfortunately, our data are text, not numbers. To address this, we can create numeric features derived from the email text and use those features for logistic regression:
# - Each row of $X$ is an email.
# - Each column of $X$ contains one feature for all the emails.
#
# We'll guide you through creating a simple feature, and you'll create more interesting ones when you are trying to increase your accuracy.
# ### Question 2
#
# Create a function called `words_in_texts` that takes in a list of `words` and a pandas Series of email `texts`. It should output a 2-dimensional NumPy array containing one row for each email text. The row should contain either a 0 or a 1 for each word in the list: 0 if the word doesn't appear in the text and 1 if the word does. For example:
#
# ```
# >>> words_in_texts(['hello', 'bye', 'world'],
# pd.Series(['hello', 'hello worldhello']))
#
# array([[1, 0, 0],
# [1, 0, 1]])
# ```
#
# **Hint**: [pandas.Series.str.contains](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html)
#
# *The provided tests make sure that your function works correctly, so that you can use it for future questions.*
#
# <!--
# BEGIN QUESTION
# name: q2
# points: 3
# -->
# In[11]:
from tables.tests.common import TestFileMixin
def words_in_texts(words, texts):
'''
Args:
words (list-like): words to find
texts (Series): strings to search in
Returns:
NumPy array of 0s and 1s with shape (n, p) where n is the
number of texts and p is the number of words.
'''
# BEGIN YOUR CODE
# -----------------------
indicator_array = []
for text in texts:
row = [int(word in text) for word in words]
indicator_array.append(row)
#indicator_array = np.array(indicator_array)
# -----------------------
# END YOUR CODE
return indicator_array
# In[12]:
assert np.allclose(
words_in_texts(
['hello', 'bye', 'world'],
pd.Series(['hello', 'hello worldhello'])),
np.array([[1, 0, 0], [1, 0, 1]]))
assert np.allclose(
words_in_texts(
['a', 'b', 'c', 'd', 'e', 'f', 'g'],
pd.Series(['a b c d ef g', 'a', 'b', 'c', 'd e f g', 'h', 'a h'])),
np.array(
[[1,1,1,1,1,1,1],
[1,0,0,0,0,0,0],
[0,1,0,0,0,0,0],
[0,0,1,0,0,0,0],
[0,0,0,1,1,1,1],
[0,0,0,0,0,0,0],
[1,0,0,0,0,0,0]]))
print_passed('Q2: Passed all unit tests!')
# # Basic EDA
#
# We need to identify some features that allow us to distinguish spam emails from ham emails. One idea is to compare the distribution of a single feature in spam emails to the distribution of the same feature in ham emails.
#
# If the feature is itself a binary indicator (such as whether a certain word occurs in the text), this amounts to comparing the proportion of spam emails with the word to the proportion of ham emails with the word.
# In[13]:
from IPython.display import display, Markdown
df = pd.DataFrame({
'word_1': [1, 0, 1, 0],
'word_2': [0, 1, 0, 1],
'type': ['spam', 'ham', 'ham', 'ham']
})
display(Markdown("> Our Original DataFrame has some words column and a type column. You can think of each row is a sentence, and the value of 1 or 0 indicates the number of occurances of the word in this sentence."))
display(df);
display(Markdown("> `melt` will turn columns into variale, notice how `word_1` and `word_2` become `variable`, their values are stoed in the value column"))
display(df.melt("type"))
# We can create a bar chart like the one above comparing the proportion of spam and ham emails containing certain words. Choose a set of words that are different from the ones above, but also have different proportions for the two classes. Make sure that we only consider emails from `train`.
#
# <!--
# BEGIN QUESTION
# name: q3a
# manual: True
# format: image
# points: 2
# -->
# <!-- EXPORT TO PDF format:image -->
# In[14]:
# We must do this in order to preserve the ordering of emails to labels for words_in_texts
train=train.reset_index(drop=True)
some_words = ['body', 'html', 'please', 'money', 'business', 'offer']
Phi_train = words_in_texts(some_words, train['email'])
df = pd.DataFrame(data = Phi_train, columns = some_words)
df['label'] = train['spam']
plt.figure(figsize=(8,8))
sns.barplot(x = "variable",
y = "value",
hue = "label",
data = (df
.replace({'label':
{0 : 'Ham',
1 : 'Spam'}})
.melt('label')
.groupby(['label', 'variable'])
.mean()
.reset_index()))
plt.ylim([0, 1])
plt.xlabel('Words')
plt.ylabel('Proportion of Emails')
plt.legend(title = "")
plt.title("Frequency of Words in Spam/Ham Emails")
plt.tight_layout()
plt.show()
# ### Question 3
#
# When the feature is binary, it makes sense to compare its proportions across classes (as in the previous question). Otherwise, if the feature can take on numeric values, we can compare the distributions of these values for different classes.
#
# Create a *class conditional density plot* like the one above (using `sns.distplot`), comparing the distribution of the length of spam emails to the distribution of the length of ham emails in the training set. Set the x-axis limit from 0 to 50000.
#
# <!--
# BEGIN QUESTION
# name: q3b
# manual: True
# format: image
# points: 2
# -->
# <!-- EXPORT TO PDF format:image -->
# In[15]:
# BEGIN SOLUTION
ham_length = train[train["spam"]==0]["email"].str.len()
spam_length = train[train["spam"]==1]["email"].str.len()
x_axis_limit = (0,50000)
sns.distplot(ham_length, hist = False, kde = True, label="Ham")
sns.distplot(spam_length, hist = False, kde = True, label="Spam")
plt.xlim(x_axis_limit)
plt.xlabel("Length of email body")
plt.ylabel("Distribution")
df = pd.DataFrame(data = Phi_train, columns = some_words)
df['label'] = train['spam']
plt.figure(figsize=(6,8))
plt.tight_layout()
plt.show()
# END SOLUTION
# ## Basic Classification
#
# Notice that the output of `words_in_texts(words, train['email'])` is a numeric matrix containing features for each email. This means we can use it directly to train a classifier!
# ### Question 4
#
# We've given you 5 words that might be useful as features to distinguish spam/ham emails. Use these words as well as the `train` DataFrame to create two NumPy arrays: `X_train` and `Y_train`.
#
# - `X_train` should be a matrix of 0s and 1s created by using your `words_in_texts` function on all the emails in the training set.
#
# - `Y_train` should be a vector of the correct labels for each email in the training set.
#
# *The provided tests check that the dimensions of your feature matrix (X) are correct, and that your features and labels are binary (i.e. consists of 0 and 1, no other values). It does not check that your function is correct; that was verified in a previous question.*
#
# <!--
# BEGIN QUESTION
# name: q4
# points: 2
# -->
# In[16]:
some_words = ['drug', 'bank', 'prescription', 'memo', 'private']
# BEGIN YOUR CODE
# -----------------------
Phi_train = words_in_texts(some_words, train['email'])
X_train = np.array(pd.DataFrame(data = Phi_train, columns = some_words))
Y_train = np.array(train["spam"])
# -----------------------
# END YOUR CODE
X_train[:5], Y_train[:5]
# In[17]:
assert X_train.shape == (7513, 5)
assert len(np.unique(X_train)) == 2
assert len(np.unique(Y_train)) == 2
print_passed('Q4: Passed all unit tests!')
# ### Question 5
#
# Now we have matrices we can give to scikit-learn!
#
# - Using the [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier, train a logistic regression model using `X_train` and `Y_train`.
# - Then, output the accuracy of the model (on the training data) in the cell below. You should get an accuracy around 75\%.
#
# <!--
# BEGIN QUESTION
# name: q5
# points: 2
# -->
# In[18]:
from sklearn.linear_model import LogisticRegression
# BEGIN YOUR CODE
# -----------------------
model = LogisticRegression()
model.fit(X_train, Y_train)
Y_train_prediction = model.predict(X_train)
training_accuracy = sum(Y_train == Y_train_prediction)/len(train)
# -----------------------
# END YOUR CODE
("Training Accuracy: ", training_accuracy)
# In[19]:
assert training_accuracy > 0.72
print_passed('Q5: Passed all unit tests!')
# ## Evaluating Classifiers
# That doesn't seem too shabby! But the classifier you made above isn't as good as this might lead us to believe. First, we are evaluating accuracy on the training set, which may lead to a misleading accuracy measure, especially if we used the training set to identify discriminative features. In future parts of this analysis, it will be safer to hold out some of our data for model validation and comparison.
#
# Presumably, our classifier will be used for **filtering**, i.e. preventing messages labeled `spam` from reaching someone's inbox. There are two kinds of errors we can make:
# - False positive (FP): a ham email gets flagged as spam and filtered out of the inbox.
# - False negative (FN): a spam email gets mislabeled as ham and ends up in the inbox.
#
# These definitions depend both on the true labels and the predicted labels. False positives and false negatives may be of differing importance, leading us to consider more ways of evaluating a classifier, in addition to overall accuracy:
#
# **Precision** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FP}}$ of emails flagged as spam that are actually spam.
#
# **Recall** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FN}}$ of spam emails that were correctly flagged as spam.
#
# **False-alarm rate** measures the proportion $\frac{\text{FP}}{\text{FP} + \text{TN}}$ of ham emails that were incorrectly flagged as spam.
#
# The following image might help:
#
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/700px-Precisionrecall.svg.png" width="500px">
#
# Note that a true positive (TP) is a spam email that is classified as spam, and a true negative (TN) is a ham email that is classified as ham.
# ### Question 6a
#
# Suppose we have a classifier `zero_predictor` that always predicts 0 (never predicts positive). How many false positives and false negatives would this classifier have if it were evaluated on the training set and its results were compared to `Y_train`? Fill in the variables below (answers can be hard-coded):
#
# <!--
# BEGIN QUESTION
# name: q6a
# points: 1
# -->
# In[20]:
# BEGIN YOUR CODE
# -----------------------
zero_predictor = np.zeros(len(train))
zero_predictor_fp = np.sum(zero_predictor == 1)
zero_predictor_fn = len(Y_train[Y_train==1])
# -----------------------
# END YOUR CODE
# In[21]:
assert zero_predictor_fp + zero_predictor_fn == 1918
print_passed('Q6a: Passed all unit tests!')
# ### Question 6b
#
# What are the accuracy and recall of `zero_predictor` (classifies every email as ham) on the training set? Do NOT use any `sklearn` functions.
#
# <!--
# BEGIN QUESTION
# name: q6b
# points: 1
# -->
# In[22]:
# BEGIN YOUR CODE
# -----------------------
zero_predictor_acc = sum(Y_train == zero_predictor)/len(zero_predictor)
zero_predictor_recall = 0
# -----------------------
# END YOUR CODE
# In[23]:
assert np.isclose(zero_predictor_acc + zero_predictor_recall, 0.7447091707706642)
print_passed('Q6b: Passed all unit tests!')
# ### Question 6c
#
# Compute the precision, recall, and false-alarm rate of the `LogisticRegression` classifier created and trained in Question 5. **Note: Do NOT use any `sklearn` built-in functions.**
#
# <!--
# BEGIN QUESTION
# name: q6d
# points: 2
# -->
# In[24]:
# BEGIN YOUR CODE
# -----------------------
tp = sum((Y_train ==1)& (Y_train_prediction ==1))
fp = sum((Y_train ==0)& (Y_train_prediction ==1))
fn = sum((Y_train ==1)& (Y_train_prediction==0))
tn = len(Y_train)-(tp+fp+fn)
logistic_predictor_precision = tp/(tp+fp)
logistic_predictor_recall = tp/(tp+fn)
logistic_predictor_far = fp/(fp+tn)
print(tp,fp, fn, tn)
# -----------------------
# END YOUR CODE
# In[25]:
assert np.isclose(logistic_predictor_precision, 0.6422287390029325)
assert np.isclose(logistic_predictor_recall, 0.11418143899895725)
assert np.isclose(logistic_predictor_far, 0.021805183199285077)
print_passed('Q6c: Passed all unit tests!')
# ### Question 6d
#
# 1. Our logistic regression classifier got 75.6% prediction accuracy (number of correct predictions / total). How does this compare with predicting 0 for every email?
# 1. Given the word features we gave you above, name one reason this classifier is performing poorly. Hint: Think about how prevalent these words are in the email set.
# 1. Which of these two classifiers would you prefer for a spam filter and why? Describe your reasoning and relate it to at least one of the evaluation metrics you have computed so far.
#
# <!--
# BEGIN QUESTION
# name: q6f
# manual: True
# points: 3
# -->
# <!-- EXPORT TO PDF -->
# Answer:
# 1. `Our logistic regression classifier got similar prediction accuracy(75.6%) compared with zero_predictor(74.5%). But the recall score(=11.4%) is higher than the zero_predictor(=0).`
#
# 2. `One reason the classifier may be performing poorly is if the word features provided are prevalent or hardly common in the email set. If the selected words hardly appear in both spam and ham emails, they may not provide distinct patterns or characteristics that can effectively differentiate between the two classes. None of the words given above appear in both spam and ham mail frequently. For example, the word 'prescription' is contained in '9' ham emails, and just in '46' spam emails, that is, very small proportion considering the total number of given train_data(=7513). As a result, the classifier might struggle to accurately classify emails based on these word features alone.`
#
# 3. `I think logistic regression classifier is better classifier than zero_predictor. It's because even their accuracy scores are similar, precision & recall score of logistic regression classifier are higher than zero_predictor(precision: cannot be measured, recall: 0). There is no meaning for zero_predictor as spam classifier because no matter of what the text of email is, the zero_predictor classifies the email as ham, so can never catch the 'spam' email. However, in my case, I prefer zero_predictor. I think, the case of classifying the ham as spam(FP) is critical because there should be no case that I miss the meaning email(ham) as classified as spam. The zero_predictor can never catch the spam email, but at the same time, the possibility of classifying the real ham email as spam email is zero. `
# # Part II - Moving Forward
#
# With this in mind, it is now your task to make the spam filter more accurate. In order to get full credit on the accuracy part of this assignment, you must get at least **88%** accuracy on the `validation` set.
#
# Here are some ideas for improving your model:
#
# 1. Finding better features based on the email text. Some example features are:
# 1. Number of characters in the subject / body
# 1. Number of words in the subject / body
# 1. Use of punctuation (e.g., how many '!' were there?)
# 1. Number / percentage of capital letters
# 1. Whether the email is a reply to an earlier email or a forwarded email
# 1. Finding better words to use as features. Which words are the best at distinguishing emails? This requires digging into the email text itself.
# 1. Better data processing. For example, many emails contain HTML as well as text. You can consider extracting out the text from the HTML to help you find better words. Or, you can match HTML tags themselves, or even some combination of the two.
# 1. Model selection. You can adjust parameters of your model (e.g. the regularization parameter) to achieve higher accuracy. Recall that you should use cross-validation to do feature and model selection properly! Otherwise, you will likely overfit to your training data.
#
# You may use whatever method you prefer in order to create features, but **you are not allowed to import any external feature extraction libraries**. In addition, **you are only allowed to train logistic regression models**. No random forests, k-nearest-neighbors, neural nets, etc.
# ## Here is the code for creating the model(preprocessing & feature_engineering)
#
# In[26]:
#train = pd.DataFrame(train).dropna()
train = train.reset_index(drop = True)
train
# In[27]:
from bs4 import BeautifulSoup
def remove_html_tags(text):
# Create a BeautifulSoup object with the mail text
soup = BeautifulSoup(text, 'html.parser')
# Remove HTML tags from the mail text
cleaned_text = soup.get_text()
return cleaned_text
# In[28]:
### Removing html tag from the given email text
hammails = []
for email_text in train[train["spam"]==0]["email"]:
removed = remove_html_tags(email_text)
hammails.append(removed)
spammails = []
for email_text in train[train["spam"]==1]["email"]:
removed = remove_html_tags(email_text)
spammails.append(removed)
emails = []
for email_text in train["email"]:
removed = remove_html_tags(email_text)
emails.append(removed)
# In[29]:
### for each given text, analyze how many words for each word class in the text
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
tokenized_emails = [word_tokenize(email.lower()) for email in emails]
tagged_emails = [pos_tag(tokens) for tokens in tokenized_emails]
word_classes = [':', 'VBD', 'JJ', 'NN', "''", 'CD', 'PRP', 'VBP', 'DT', 'WDT', 'VBZ', '$', 'MD', 'VB', 'TO', 'IN', 'PRP$', 'RB', 'NNS', 'CC', 'VBG', 'VBN', 'JJS', 'RBS', '#', '.', 'NNP', 'RP', ',', 'WRB', 'POS', 'EX', 'RBR', '``', 'WP', 'JJR', '(', ')', 'FW', 'NNPS', 'WP$']
class_freq = []
for tagged_email in tagged_emails:
freq = nltk.FreqDist(tag for word, tag in tagged_email if tag in word_classes)
class_freq.append(freq)
df = pd.DataFrame(class_freq)
df = df.fillna(0)
# In[30]:
# add the column that tells whether the email is reply to an earlier email(=1) or a forwarded email(=0)
subject = [1 if subj.startswith("Subject: Re: ") else 0 for subj in train["subject"]]
df["subject"] = subject
# add the column that tells wheter the email text contains "url: " string.
url_presence = [1 if "url: " in email else 0 for email in train["email"]]
df["url_presence"] = url_presence
df["label"] = train["spam"]
df = df.astype(int)
df
# In[31]:
# figuring out which words "ham" mail contains a lot
import re
import nltk
nltk.download("punkt")
nltk.download('averaged_perceptron_tagger')
# Creating bag-of-words for feature_engineering
words = []
tagging = {}
word_counts = {}
for text in hammails:
text = text.replace("\t", " ").replace("\n", " ")
text_words = nltk.word_tokenize(text)
words.extend(text_words)
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
pos_tags = nltk.pos_tag(word_counts.keys()) # tokenizing the sentence
for word, pos in pos_tags:
#word_counts[word] = word_counts.get(word, 0) + 1 # count the word
if pos in tagging.keys(): # tagging is already existed
if word in tagging[pos]: # word is already existed
continue
else: # word is not existed
tagging[pos].add(word)
else: # tagging is not existed
tagging[pos] = {word}
pos_tags_dictionary = {}
for word, pos in pos_tags:
if word in pos_tags_dictionary.keys(): # word_tag info is already existed
continue
else:
pos_tags_dictionary[word] = pos
sort_word_counts = dict(sorted(word_counts.items(), key = lambda x: x[1], reverse = True))
# In[32]:
# figuring out which words "spam" mail contains a lot
words_spam = []
tagging_spam = {}
word_counts_spam = {}
for text in spammails:
text = text.replace("\t", " ").replace("\n", " ")
text_words = nltk.word_tokenize(text)
words_spam.extend(text_words)
for word in words_spam:
word_counts_spam[word] = word_counts_spam.get(word, 0) + 1
pos_tags = nltk.pos_tag(word_counts_spam.keys()) # tokenizing the sentence
for word, pos in pos_tags:
#word_counts[word] = word_counts.get(word, 0) + 1 # count the word
if pos in tagging_spam.keys(): # tagging is already existed
if word in tagging_spam[pos]: # word is already existed
continue
else: # word is not existed
tagging_spam[pos].add(word)
else: # tagging is not existed
tagging_spam[pos] = {word}
pos_tags_dictionary_spam = {}
for word, pos in pos_tags:
if word in pos_tags_dictionary_spam.keys(): # word_tag info is already existed
continue
else:
pos_tags_dictionary_spam[word] = pos
sort_word_counts_spam = dict(sorted(word_counts_spam.items(), key = lambda x: x[1], reverse = True))
# In[33]:
## figuring out the words that one class contains a lot, but the other class doesn't
## these word can be the feature of model
distinguishing_word = []
for word in sort_word_counts_spam.keys():
if word in sort_word_counts.keys():
if sort_word_counts_spam[word] >=1200 and sort_word_counts[word] <=300:
distinguishing_word.append(word)
if sort_word_counts_spam[word] <= 300 and sort_word_counts[word] >=1200:
distinguishing_word.append(word)
if word not in sort_word_counts.keys() and sort_word_counts_spam[word]>1200:
distinguishing_word.append(word)
for word in sort_word_counts.keys():
if word not in sort_word_counts_spam.keys() and sort_word_counts[word]>1200:
distinguishing_word.append(word)
distinguishing_word
# In[34]:
## training the model
# -----------------------
Phi_train = words_in_texts(distinguishing_word, train['email'])
Phi_test = words_in_texts(distinguishing_word, val['email'])
X_train = pd.DataFrame(data = Phi_train, columns = distinguishing_word)
X_train["subject"] = subject
X_train["url_presence"] = url_presence
X_train = np.array(X_train)
X_test = pd.DataFrame(data = Phi_test, columns = distinguishing_word)
X_test["subject"] = [1 if subj.startswith("Subject: Re: ") else 0 for subj in val["subject"]]
X_test["url_presence"] = [1 if "url: " in email else 0 for email in val["email"]]
X_test = np.array(X_test)
Y_train = np.array(train["spam"])
Y_test = np.array(val["spam"])
# -----------------------
X_train[:5], Y_train[:5]
# In[35]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
Y_train_prediction = model.predict(X_train)
Y_test_prediction = model.predict(X_test)
training_accuracy = sum(Y_train == Y_train_prediction)/len(train)
testing_accuracy = sum(Y_test == Y_test_prediction)/len(val)
# -----------------------
tp = sum((Y_test ==1)& (Y_test_prediction ==1))
fp = sum((Y_test ==0)& (Y_test_prediction ==1))
fn = sum((Y_test ==1)& (Y_test_prediction==0))
tn = len(Y_test)-(tp+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
print("Testing Accuracy: ")
print(testing_accuracy)
print("Testing precision:")
print(precision)
print("Testing recall:")
print(recall)
# ### Question 7: EDA
#
# In the cell below, show a visualization that you used to select features for your model. Include both
#
# 1. A plot showing something meaningful about the data that helped you during feature / model selection.
# 2. 2-3 sentences describing what you plotted and what its implications are for your features.
#
# Feel free to create as many plots as you want in your process of feature selection, but select one for the response cell below.
#
# **You should not just produce an identical visualization to question 3.** Specifically, don't show us a bar chart of proportions, or a one-dimensional class-conditional density plot. Any other plot is acceptable, as long as it comes with thoughtful commentary. Here are some ideas:
#
# 1. Consider the correlation between multiple features (look up correlation plots and `sns.heatmap`).
# 1. Try to show redundancy in a group of features (e.g. `body` and `html` might co-occur relatively frequently, or you might be able to design a feature that captures all html tags and compare it to these).
# 1. Visualize which words have high or low values for some useful statistic.
# 1. Visually depict whether spam emails tend to be wordier (in some sense) than ham emails.
# Generate your visualization in the cell below and provide your description in a comment.
#
# <!--
# BEGIN QUESTION
# name: q8
# manual: True
# format: image
# points: 6
# -->
# <!-- EXPORT TO PDF format:image -->
# In[36]:
# Write your description (2-3 sentences) as a comment here:
# The HTML tag in the text of the email was removed, and words that appear a lot in spam mail but rarely in ham mail, or that appear a lot in spam mail but rarely in ham mail are extracted.
# I plot the heat map of information about how many emails contain the extracted word in each spam mail and ham mail.
# There are some words that a lot of ham mails contain, but spam mails don't. Using the characteristics of whether these words are included in the email text, a classifier was created to distinguish whether they are spam mail or not.
# Write the code to generate your visualization here:
X_train = pd.DataFrame(X_train)
X_train.columns = distinguishing_word+["subject", "url_presence"]
X_train["label"] = train["spam"]
data = X_train.replace({'label': {0 : 'Ham', 1 : 'Spam'}}).groupby(['label']).sum()
plt.figure(figsize = (20,4))
sns.heatmap(data)
# ### Question 8: Precision-Recall Curve
#
# We can trade off between precision and recall. In most cases we won't be able to get both perfect precision (i.e. no false positives) and recall (i.e. no false negatives), so we have to compromise.
#
# Recall that logistic regression calculates the probability that an example belongs to a certain class.
# * Then, to classify an example we say that an email is spam if our classifier gives it $\ge 0.5$ probability of being spam.
# * However, *we can adjust that cutoff*: we can say that an email is spam only if our classifier gives it $\ge 0.7$ probability of being spam.
#
# This is how we can trade off false positives and false negatives. The precision-recall curve shows this trade off for each possible cutoff probability. In the cell below, [plot a precision-recall curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#plot-the-precision-recall-curve) for your final classifier.
#
# <!--
# BEGIN QUESTION
# name: q9
# manual: True
# points: 3
# -->
# <!-- EXPORT TO PDF -->
# In[38]:
from sklearn.metrics import precision_recall_curve
# Note that you'll want to use the .predict_proba(...) method for your classifier
# instead of .predict(...) so you get probabilities, not classes
# BEGIN YOUR CODE
# -----------------------
y_true = Y_test
y_scores = Y_test_prediction
# Compute precision and recall values
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
# Plot the PR curve
plt.plot(recall, precision)
# Set plot title and axis labels
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
# Set plot axis limits
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
# Show the plot
plt.show()
# -----------------------
# END YOUR CODE
# ### Congratulations! You have completed HW 3.