Developing LSTM Model in Python to Identify Spam Emails

Identifying spam emails is an important task for individuals and organizations, as it helps them protect their privacy and security. There are a number of characteristics that can be used to identify spam emails, such as the presence of certain words or phrases, the sender’s email address, or the format of the email.

In this blog post, we will develop an advanced machine learning model to identify spam emails using Python. We will start by collecting and preprocessing the data, and then we will train and evaluate a model using a number of different techniques.

Data collection and preprocessing:

To create our machine learning model, we will need a dataset of emails that we can use to train the model. There are a number of publicly available datasets that we can use for this purpose, such as the “Spam Email Classification” dataset from Kaggle.

To begin, we will import the necessary libraries and download the dataset using Pandas.

import pandas as pd
url = "https://www.kaggle.com/uciml/sms-spam-collection-dataset/download"
data = pd.read_csv(url, sep='\t', header=None, names=["label", "text"])

Next, we will split the data into features and the target variable. The features will be the text of the email, and the target variable will be whether or not the email is spam.

X = data["text"]
y = data["label"]

To prepare the data for training, we will need to preprocess the text of the emails. This can be done using techniques such as stemming, lemmatization, and stop word removal.

We can perform these techniques using the nltk library.

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Initialize stemmer and stop words
stemmer = PorterStemmer()
stop_words = set(stopwords.words("english"))

# Preprocess text
X_processed = []
for text in X:
# Remove punctuation and lowercase
text = text.translate(str.maketrans("", "", string.punctuation)).lower()

# Tokenize and stem words
tokens = [stemmer.stem(word) for word in word_tokenize(text) if word not in stop_words]

# Rejoin stemmed words
text = " ".join(tokens)

X_processed.append(text)

Model development:

Now that our data is prepared, we can start developing our machine learning model. There are a number of different algorithms that we could use for this task, such as naive Bayes or support vector machines (SVMs). For this example, we will use a deep learning model, specifically a long short-term memory (LSTM) neural network.

LSTM networks are particularly well-suited for natural language processing tasks, as they are able to capture long-term dependencies in the data. To create the LSTM model, we will use the `Sequential` and `LSTM` classes from the `keras` library.

```python
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

# Create LSTM model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=128))
model.add(LSTM(units=128, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(units=1, activation="sigmoid"))

Model evaluation:

Now that our model is created, we can train it on the data using the fit method.

# Train model
model.fit(X_train, y_train, batch_size=32, epochs=10)

After the model is trained, we can evaluate its performance on the test data. We will start by making predictions on the test data using the predict method of the Sequential object.

# Predict labels for test data
y_pred = model.predict(X_test)

To evaluate the model’s performance, we can use a number of different metrics. One common metric is the accuracy, which measures the proportion of correct predictions made by the model.

We can calculate the accuracy using the accuracy_score function from Scikit-learn.

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Another metric we can use is the confusion matrix, which shows the number of true positive, true negative, false positive, and false negative predictions made by the model.

We can create the confusion matrix using the confusion_matrix function from Scikit-learn.

from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, y_pred)
print("Confusion matrix:", confusion_matrix)

To better understand the model’s performance, we can also create a classification report, which includes precision, recall, and f1-score metrics.

We can create the classification report using the classification_report function from Scikit-learn.

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print("Classification report:", report)

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print("Classification report:", report)

Here is the complete code for creating an advanced machine learning model to identify spam emails using Python:

import pandas as pd
import nltk
import string
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Download and load dataset
url = "https://www.kaggle.com/uciml/sms-spam-collection-dataset/download"
data = pd.read_csv(url, sep='\t', header=None, names=["label", "text"])

# Split data into features and target variable
X = data["text"]
y = data["label"]

# Initialize stemmer and stop words
stemmer = PorterStemmer()
stop_words = set(stopwords.words("english"))

# Preprocess text
X_processed = []
for text in X:
# Remove punctuation and lowercase
text = text.translate(str.maketrans("", "", string.punctuation)).lower()

# Tokenize and stem words
tokens = [stemmer.stem(word) for word in word_tokenize(text) 
if word not in stop_words]

# Rejoin stemmed words
text = " ".join(tokens)

X_processed.append(text)

# Tokenize and pad sequences
tokenizer = Tokenizer()
X_tokenized = tokenizer.fit_transform(X_processed)
X_padded = pad_sequences(X_tokenized, maxlen=500)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2)

# Create LSTM model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=128))
model.add(LSTM(units=128, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(units=1, activation="sigmoid"))

# Train model
model.fit(X_train, y_train, batch_size=32, epochs=10)

#Predict labels for test data
y_pred = model.predict(X_test)

#Calculate accuracy
accuracy = accuracy_score(y_test, y_pred) 
print("Accuracy:", accuracy)

#Create confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred) 
print("Confusion matrix:", confusion_matrix)

#Create classification report
report = classification_report(y_test, y_pred) 
print("Classification report:", report)

Leave a Comment Cancel reply