The Ironic Sophistication of Naive Bayes Classifiers
Filtering spam with Multinomial Naive Bayes (From Scratch)
In the first half of 2020 more than 50% of all email traffic on the planet was spam. Spammers typically receive 1 reply for every 12,500,000 emails sent which doesn’t sound like much until you realize more than 15 billion spam emails are being sent each and every day. Spam is costing businesses 20–200 billion dollars per year and that number is only expected to grow.
What can we do to save ourselves from spam???
Naive Bayes Classifiers
In probability theory and statistics, Bayes’ theorem (alternatively Bayes’s theorem, Bayes’s law or Bayes’s rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
For example, if the risk of developing health problems is known to increase with age, Bayes’s theorem allows the risk to an individual of a known age to be assessed more accurately than simply assuming that the individual is typical of the population as a whole.
Bayes Theorem Explained
A Naive Bayes Classifier is a probabilistic classifier that uses Bayes theorem with strong independence (naive) assumptions between features.
Probabilistic classifier: a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to.
Independence: Two events are independent if the occurrence of one does not affect the probability of occurrence of the other (equivalently, does not affect the odds). That assumption of independence of features is what makes Naive Bayes naive! In real world, the independence assumption is often violated, but naive Bayes classifiers still tend to perform very well.
Stopping the Spam
For this example we’ll be using the SMSSpamCollection dataset consisting of 5,572 messages. 86.6% of which are real messages and 13.4% that is spam/scams. We are going to build a Naive Bayes Classifier to predict if a message is spam or legitimate.
Above we can see our dataset has a Class feature (column) and a Message feature. Our 2 classes are ham (a legitimate message) and spam. We’ll load and split the data set into a training set and a testing set so later we can test our classifier on data it’s never seen.
import pandas as pd# Import the spam data setemail_data = pd.read_csv('/email-dataset.csv', sep='\t', names=['Class', 'Message'])# training_emails = 70% of the data# testing_emails = 30% of the datafrom sklearn.model_selection import train_test_splittrain_emails, test_emails = train_test_split(email_data, test_size=0.3, random_state=42)
How It Works
When we get a new message and we want to determine if it is spam or not, the classifier will check a spam equation and a ham equation with each word of the message where “w1” is the first word, and (w1,w2, …, wn) is the entire message. Whichever one of these equations is greater, then that is what the new message is classifier as.
Spam Probability Equation
Ham (Authentic) Probability Equation
To calculate P(wi|Spam) and P(wi|Ham) — the last portion of the equations above — we need to use separate equations:
To clarify the terms above:
Creating the Vocabulary
We need to create a vocabulary of words to feed the classifier. After iterating through the dataset and counting all instances of different words, we’re left with a vocabulary of 7,301 unique words. This is what our new dataframe will look like after this step:
Calculating Constants
Some of the terms in the equations above will remain the same for every new message. Let’s calculate these constants now to save ourselves from having to make these calculations many times. We’ll start with:
P(Spam) and P(Ham)
NSpam, NHam, NVocabulary
It’s important to note that:
NSpam is equal to the number of words in all the spam messages — it’s not equal to the number of spam messages, and it’s not equal to the total number of unique words in spam messages.
NHam is equal to the number of words in all the non-spam messages — it’s not equal to the number of non-spam messages, and it’s not equal to the total number of unique words in non-spam messages.
NVocabulary is equal to the number of unique words in all the messages in our training set. (7,301 as referenced above)
We’ll also use Laplace smoothing and set α = 1.
Calculating Parameters
Now that we have the constant terms calculated above, we can move on with calculating the parameters P(wi|Spam) and P(wi|Ham).
P(wi|Spam) and P(wi|Ham) will vary depending on the individual words. For instance, P(“secret”|Spam) will have a certain probability value, while P(“cousin”|Spam) or P(“lovely”|Spam) will most likely have other values.
Therefore, each parameter will be a conditional probability value associated with each word in the vocabulary.
The parameters are calculated using these two equations:
We can also calculate this with the following Python code:
# Initialize parametersspam_parameters = {unique_word:0 for unique_word in vocabulary}ham_parameters = {unique_word:0 for unique_word in vocabulary}# Calculate parametersfor word in vocabulary: # Number of unique words in spam messages n_word_given_spam = spam_messages[word].sum() # Probability a certain vocab word will be in a spam message p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary) spam_parameters[word] = p_word_given_spam# Number of unique words in non-spam messages n_word_given_ham = ham_messages[word].sum() # Probability a certain vocab word will be in a non-spam message p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary) ham_parameters[word] = p_word_given_ham
Classifying A Message: Spam or Not?
Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter is understood as a function that:
Takes in as input a new message (word 1, word 2, …, word n).
Calculates P(Spam|w1, w2, …, wn) and P(Ham|w1, w2, …, wn).
Compares the values of P(Spam|w1, w2, …, wn) and P(Ham|w1, w2, …, wn), and:
If P(Ham|w1, w2, …, wn) > P(Spam|w1, w2, …, wn), then the message is classified as ham.
If P(Ham|w1, w2, …, wn) < P(Spam|w1, w2, …, wn), then the message is classified as spam.
If P(Ham|w1, w2, …, wn) = P(Spam|w1, w2, …, wn), then the algorithm may request human help.
Note that some new messages will contain words that are not part of the vocabulary. We will simply ignore these words when we’re calculating the probabilities.
We can also calculate those values with the following Python code:
import redef NaiveBayesClassifier(message):# Remove the punctuation using the re.sub() function message = re.sub('\W', ' ', message) # Bring all letters to lower case using the str.lower() method message = message.lower().split()# Calculate p_spam_given_message and p_ham_given_message p_spam_given_message = p_spam p_ham_given_message = p_hamfor word in message: if word in parameters_spam: p_spam_given_message *= parameters_spam[word]if word in parameters_ham: p_ham_given_message *= parameters_ham[word]print('Spam Probability:', p_spam_given_message) print('Ham Probability:', p_ham_given_message)# Compare p_spam_given_message with p_ham_given_message # then print a classification label. if p_ham_given_message > p_spam_given_message: print('Email Type: Ham') elif p_ham_given_message < p_spam_given_message: print('Email Type: Spam') else: print('Email Type: Equal probabilities spam or ham')
We’ve successfully built a naive Bayes classifier! Let’s test it!
Congratulation! It looks like we’ve classified both of our examples correctly!
But how accurate is our classifier when we try it out on the entire test set?
Measuring Accuracy
Remember when we split the data into training and testing sets? Well here’s where our testing set comes into play. We’ll try and classify every message in the test set (1,672 messages) without showing if they are spam or not to our classifier. We can accomplish this by editing our classify function to return classification labels instead of printing them.
def NB_test(message):# Remove the punctuation with re.sub() message = re.sub('\W', ' ', message) # Make all letters lower case with str.lower() message = message.lower().split()# Calculate p_spam_given_message and p_ham_given_message p_spam_given_message = p_spam p_ham_given_message = p_hamfor word in message: if word in parameters_spam: p_spam_given_message *= parameters_spam[word]if word in parameters_ham: p_ham_given_message *= parameters_ham[word]# Compare p_spam_given_message with p_ham_given_message # then RETURN a classification label instead of printing if p_ham_given_message > p_spam_given_message: return 'ham' elif p_spam_given_message > p_ham_given_message: return 'spam' else: return 'equal'
Now we can use that function to create a new column/feature in our dataset called “predicted” which is where our classifier will store its predictions for each message (row in the dataframe).
test_set['predicted'] = test_set['Message'].apply(NB_test)test_set.head()
Now we’re going to test the accuracy of our model by dividing the number of correctly classified emails by total number of emails in our testing data set.
correctly_predicted_messages = 0total_messages = test_emails.shape[0]# For each row in our test setfor row in test_emails.iterrows(): row = row[1] # If the class label is equal to the label predicted if row['Class'] == row['predicted']: # Increment the number of correctly predicted messages by 1 correctly_predicted_messages += 1print('Correctly Classified Emails:', correctly_predicted_messages)print('Incorrectly Classified Emails:', total_messages - correctly_predicted_messages, '\n')print('Accuracy:', correctly_predicted_messages / total_messages)
Correct: 1655
Incorrect: 17
Accuracy: 0.9898= 98.98%
We’ve successfully built a spam filter with naive Bayes! It’s crazy to think how a statistical model with such naive assumptions can perform a classification task like this with <95% accuracy.
Code, Contact Info, and Sources
GitHub Repo: GitHub.com/JackRossProjects/Naive-Bayes-From-Scratch
LinkedIn: LinkedIn.com/in/JackCalvinRoss
Portfolio Website: JackRossProjects.com
Stanford CS124 Text Classification and Naive Bayes
StatQuest: Naive Bayes, Clearly Explained
UCI ML Repository: SMS Spam Collection Data Set