Using the Naive Bayes Algorithm to classify data into Spam/Not Spam

3 min readSep 13, 2020

Here, we have taken an SMS dataset from 5572 users, and used the Naive Bayes Algorithm to classify their SMSes as as Spam/Not Spam. Lets get started !!

Importing the libraries

Apart from Numpy and Pandas, we would need to import the Naive Bayes module and the Count Vectorizer for the Bag of Words Implementation

import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

Loading the dataset

So, as we can see, we have 2 columns — 1st one indicating if its a spam or a “ham” (means not spam). The second column indicates the SMS content

data = pd.read_table('./smsspamcollection/SMSSpamCollection',header = None)
data.head()

Naming the columns

We name the columns as Spam and Content for better representation.

data.columns = ['Spam','Content']
data.head()

print("Total Number of data Points = ", data.shape[0])Total Number of data Points =  5572

Here we define a function called partition to label the spam as 1 and the non-spam as 0

def partition(x):
    if x == 'ham':
        return 0
    else:
        return 1

Making Spam column as (Spam = 1, not spam = 0)

onezero = data['Spam'].map(partition)
data['Spam']=onezero
data.head()

Splitting the data into train and test

We split the data into train and test to fit the model and for the later prediction using NB.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['Content'], data['Spam'], random_state=42)
print("The number of training data is = " , X_train.shape[0])
print("The number of testing data is = " , X_test.shape[0])The number of training data is =  4179
The number of testing data is =  1393

Creating a Bag of words representation (for both train and test)

The Bag of Words representation creates a matrix from the content column, where each row corresponds to one SMS and each column denotes a word. Each cell denotes the frequency of that word in that particular SMS.

count_vect = CountVectorizer()
data_train = count_vect.fit_transform(X_train.values)
data_test  = count_vect.transform(X_test.values) #Here we are just transforming the data, not fitting it
print("So the Bag of Words matrix for train has {a} rows and {b} columns".format(a = data_train.shape[0], b = data_train.shape[1]))So the Bag of Words matrix for train has 4179 rows and 7490 columns

Using Naive Bayes for training the classification model

mnb = MultinomialNB()
mnb.fit(data_train,y_train)MultinomialNB()predicted_labels = mnb.predict(data_test)

Classification Report

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_scoreprint(classification_report(predicted_labels, y_test, target_names = ['Spam', 'Not Spam'], digits = 4))
print("======================================================\n")
print("The Accuracy score for Naive Bayes is = ",accuracy_score(predicted_labels, y_test))precision    recall  f1-score   support

        Spam     0.9967    0.9901    0.9934      1215
    Not Spam     0.9355    0.9775    0.9560       178

    accuracy                         0.9885      1393
   macro avg     0.9661    0.9838    0.9747      1393
weighted avg     0.9889    0.9885    0.9886      1393

======================================================

The Accuracy score for Naive Bayes is =  0.9885139985642498

Conclusion :

Here, we have used Naive Bayes to use each words in the BOW matrix as a separate feature and have obtained a score of 98.85 % in Accuracy.