Using the Naive Bayes Algorithm to classify data into Spam/Not Spam

Swayatta Daw
3 min readSep 13, 2020

--

Here, we have taken an SMS dataset from 5572 users, and used the Naive Bayes Algorithm to classify their SMSes as as Spam/Not Spam. Lets get started !!

Importing the libraries

Apart from Numpy and Pandas, we would need to import the Naive Bayes module and the Count Vectorizer for the Bag of Words Implementation

import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

Loading the dataset

So, as we can see, we have 2 columns — 1st one indicating if its a spam or a “ham” (means not spam). The second column indicates the SMS content

data = pd.read_table('./smsspamcollection/SMSSpamCollection',header = None)
data.head()
png

Naming the columns

We name the columns as Spam and Content for better representation.

data.columns = ['Spam','Content']
data.head()
png
print("Total Number of data Points = ", data.shape[0])Total Number of data Points =  5572

Here we define a function called partition to label the spam as 1 and the non-spam as 0

def partition(x):
if x == 'ham':
return 0
else:
return 1

Making Spam column as (Spam = 1, not spam = 0)

onezero = data['Spam'].map(partition)
data['Spam']=onezero
data.head()
png

Splitting the data into train and test

We split the data into train and test to fit the model and for the later prediction using NB.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['Content'], data['Spam'], random_state=42)
print("The number of training data is = " , X_train.shape[0])
print("The number of testing data is = " , X_test.shape[0])
The number of training data is = 4179
The number of testing data is = 1393

Creating a Bag of words representation (for both train and test)

The Bag of Words representation creates a matrix from the content column, where each row corresponds to one SMS and each column denotes a word. Each cell denotes the frequency of that word in that particular SMS.

count_vect = CountVectorizer()
data_train = count_vect.fit_transform(X_train.values)
data_test = count_vect.transform(X_test.values) #Here we are just transforming the data, not fitting it
print("So the Bag of Words matrix for train has {a} rows and {b} columns".format(a = data_train.shape[0], b = data_train.shape[1]))
So the Bag of Words matrix for train has 4179 rows and 7490 columns

Using Naive Bayes for training the classification model

mnb = MultinomialNB()
mnb.fit(data_train,y_train)
MultinomialNB()predicted_labels = mnb.predict(data_test)

Classification Report

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(classification_report(predicted_labels, y_test, target_names = ['Spam', 'Not Spam'], digits = 4))
print("======================================================\n")
print("The Accuracy score for Naive Bayes is = ",accuracy_score(predicted_labels, y_test))
precision recall f1-score support

Spam 0.9967 0.9901 0.9934 1215
Not Spam 0.9355 0.9775 0.9560 178

accuracy 0.9885 1393
macro avg 0.9661 0.9838 0.9747 1393
weighted avg 0.9889 0.9885 0.9886 1393

======================================================

The Accuracy score for Naive Bayes is = 0.9885139985642498

Conclusion :

Here, we have used Naive Bayes to use each words in the BOW matrix as a separate feature and have obtained a score of 98.85 % in Accuracy.

--

--