Using the Naive Bayes Algorithm to classify data into Spam/Not Spam
Here, we have taken an SMS dataset from 5572 users, and used the Naive Bayes Algorithm to classify their SMSes as as Spam/Not Spam. Lets get started !!
Importing the libraries
Apart from Numpy and Pandas, we would need to import the Naive Bayes module and the Count Vectorizer for the Bag of Words Implementation
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
Loading the dataset
So, as we can see, we have 2 columns — 1st one indicating if its a spam or a “ham” (means not spam). The second column indicates the SMS content
data = pd.read_table('./smsspamcollection/SMSSpamCollection',header = None)
data.head()
Naming the columns
We name the columns as Spam and Content for better representation.
data.columns = ['Spam','Content']
data.head()
print("Total Number of data Points = ", data.shape[0])Total Number of data Points = 5572
Here we define a function called partition to label the spam as 1 and the non-spam as 0
def partition(x):
if x == 'ham':
return 0
else:
return 1
Making Spam column as (Spam = 1, not spam = 0)
onezero = data['Spam'].map(partition)
data['Spam']=onezero
data.head()
Splitting the data into train and test
We split the data into train and test to fit the model and for the later prediction using NB.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['Content'], data['Spam'], random_state=42)
print("The number of training data is = " , X_train.shape[0])
print("The number of testing data is = " , X_test.shape[0])The number of training data is = 4179
The number of testing data is = 1393
Creating a Bag of words representation (for both train and test)
The Bag of Words representation creates a matrix from the content column, where each row corresponds to one SMS and each column denotes a word. Each cell denotes the frequency of that word in that particular SMS.
count_vect = CountVectorizer()
data_train = count_vect.fit_transform(X_train.values)
data_test = count_vect.transform(X_test.values) #Here we are just transforming the data, not fitting it
print("So the Bag of Words matrix for train has {a} rows and {b} columns".format(a = data_train.shape[0], b = data_train.shape[1]))So the Bag of Words matrix for train has 4179 rows and 7490 columns
Using Naive Bayes for training the classification model
mnb = MultinomialNB()
mnb.fit(data_train,y_train)MultinomialNB()predicted_labels = mnb.predict(data_test)
Classification Report
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_scoreprint(classification_report(predicted_labels, y_test, target_names = ['Spam', 'Not Spam'], digits = 4))
print("======================================================\n")
print("The Accuracy score for Naive Bayes is = ",accuracy_score(predicted_labels, y_test))precision recall f1-score support
Spam 0.9967 0.9901 0.9934 1215
Not Spam 0.9355 0.9775 0.9560 178
accuracy 0.9885 1393
macro avg 0.9661 0.9838 0.9747 1393
weighted avg 0.9889 0.9885 0.9886 1393
======================================================
The Accuracy score for Naive Bayes is = 0.9885139985642498
Conclusion :
Here, we have used Naive Bayes to use each words in the BOW matrix as a separate feature and have obtained a score of 98.85 % in Accuracy.