Building a Spam Filter with Python and Scikit-Learn

Introduction
Prerequisites
Setup
Data Preprocessing
Text Classification
Evaluation and Testing
Conclusion

Introduction

Spam emails can be a major annoyance and can even pose security risks. In this tutorial, we will build a spam filter using Python and Scikit-Learn, a powerful machine learning library. By the end of this tutorial, you will have a working spam filter that can accurately classify emails as spam or non-spam.

Prerequisites

To follow along with this tutorial, you should have a basic understanding of Python programming. Familiarity with Scikit-Learn and machine learning concepts will also be helpful, but not necessary.

Setup

Before we start building our spam filter, we need to set up our development environment. Follow these steps to get started:

Install Python: If you don’t already have Python installed, download and install the latest version from the official Python website (https://www.python.org/downloads/). Make sure to select the option to add Python to your system’s PATH.
Install Scikit-Learn: Open a terminal or command prompt and run the following command to install Scikit-Learn:
```
pip install scikit-learn
```
Download the Dataset: We will be using the Enron-Spam email dataset for training our spam filter. Download the dataset from this link: Enron-Spam Dataset. Extract the downloaded archive to a suitable location on your computer.

Import Necessary Libraries: Open your preferred Python IDE or text editor and create a new Python file. Import the required libraries as shown below:

import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

Now that our development environment is set up, let’s move on to data preprocessing.

Data Preprocessing

To train our spam filter, we need a labeled dataset of spam and non-spam emails. Follow these steps to preprocess the Enron-Spam dataset:

Load the Data: Use the following code to load the dataset into a Pandas DataFrame:

# Set the path to the extracted dataset folder
dataset_path = '/path/to/enron1'

# Create empty lists to store email text and labels
emails = []
labels = []

# Iterate over the directories in the dataset folder
for folder_name in os.listdir(dataset_path):
    if folder_name in ['spam', 'ham']:
        folder_path = os.path.join(dataset_path, folder_name)
        for file_name in os.listdir(folder_path):
            file_path = os.path.join(folder_path, file_name)
            with open(file_path, encoding='latin-1') as file:
                emails.append(file.read())
                labels.append(folder_name)

Create Training and Testing Sets: Split the data into training and testing sets using Scikit-Learn’s train_test_split function:

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(emails, labels, test_size=0.2, random_state=42)

Tokenization: Convert the emails into tokenized form using Scikit-Learn’s CountVectorizer:

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer on the training data
X_train_counts = vectorizer.fit_transform(X_train)

Feature Extraction: Transform the tokenized emails into numerical features using Scikit-Learn’s TfidfTransformer:
```
# Create an instance of TfidfTransformer
transformer = TfidfTransformer()

# Fit the transformer on the training data
X_train_tfidf = transformer.fit_transform(X_train_counts)
```
Now that we have preprocessed the data, let’s move on to text classification.

Text Classification

In this step, we will train a Naive Bayes classifier to classify emails as spam or non-spam based on the extracted features. Follow these steps:

Train the Classifier: Use the following code to train a Naive Bayes classifier:

# Create an instance of MultinomialNB
classifier = MultinomialNB()

# Train the classifier on the training data
classifier.fit(X_train_tfidf, y_train)

Prepare Testing Data: Prepare the testing data by transforming it using the same vectorizer and transformer as the training data:

# Transform the testing data
X_test_counts = vectorizer.transform(X_test)
X_test_tfidf = transformer.transform(X_test_counts)

Make Predictions: Use the trained classifier to make predictions on the testing data:
```
# Make predictions on the testing data
y_pred = classifier.predict(X_test_tfidf)
```
Evaluation and Testing

Now that we have made predictions on the testing data, let’s evaluate the performance of our spam filter. Follow these steps:

Accuracy Calculation: Calculate the accuracy of the spam filter using Scikit-Learn’s accuracy_score function:
```
# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
```

Confusion Matrix: Generate a confusion matrix to visualize the performance of the spam filter:

# Generate the confusion matrix
confusion = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{confusion}")

Conclusion

In this tutorial, we have built a spam filter using Python and Scikit-Learn. We learned how to preprocess the data, extract features, train a Naive Bayes classifier, and evaluate the performance of our spam filter. You can now use this spam filter to classify your own emails and improve its accuracy by using a larger and more diverse dataset.

In addition to spam filtering, the techniques used in this tutorial can be applied to many other text classification tasks, such as sentiment analysis or topic classification. Experiment with different classifiers and preprocessing techniques to further improve the performance of your text classification models.

Remember to keep updating your spam filter with new examples to make it more accurate over time. Happy coding!

Published: 9 January 2022