Spam Email Classifier

This project provides a Spam Email Classifier using Logistic Regression and Random Forest models. The classifier leverages TF-IDF vectorization and feature engineering based on common spam words to detect potentially malicious or spam emails. Additionally, it includes a web application built with Flask that detects if the input email is spam and, if it is not, uses a self-hosted Llama 3.2 model with Ollama to generate a suggested response.

Project Overview

The project consists of:

Data Preprocessing: Loads and preprocesses data, with TF-IDF vectorization and additional feature engineering.
Model Training: Trains both Logistic Regression and Random Forest models on the preprocessed data.
Model Persistence: Saves the trained models and TF-IDF vectorizer for future predictions.
Evaluation: Provides accuracy, precision, recall, and F1-score metrics.
Web Application: A Flask web app that allows users to input an email, detect if it is spam, and generate a suggested response if it is not spam.

Features

Data Loading and Preprocessing: Loads multiple CSV files, combines them, and extracts features relevant to spam classification.
Feature Engineering: Counts occurrences of common spam words and performs TF-IDF vectorization.
Model Training: Uses Logistic Regression and Random Forest models to classify emails.
Model Evaluation: Calculates and displays performance metrics like accuracy, precision, recall, and F1-score.
Model Saving: Saves the trained models and vectorizer for deployment in a web app.
Web Application: Provides a user interface to input emails, select a model, and get spam detection results along with suggested responses for non-spam emails.

Dataset

The dataset used is a combination of various sources of labeled emails¹. It is stored in five separate CSV files in ./src/Dataset/. These files are concatenated into a single DataFrame for processing. Each file should contain the following columns:

text_combined: The email text to classify.
label: The label for each email (e.g., 1: spam or 0: not spam).

Training the Models

The script src/model.py:

Loads and concatenates the dataset files using the get_dataset function.
Encodes labels using LabelEncoder.
Splits the dataset into training and testing sets.
Applies TF-IDF vectorization with unigram and bigram features.
Combines TF-IDF features with additional features (like spam word counts).
Trains and saves both the Logistic Regression and Random Forest models.

Evaluation Metrics

Each model is evaluated using the following metrics:

Accuracy: Measures the overall correctness.
Precision: Indicates how many identified spam emails were actually spam.
Recall: Shows how well the model identifies actual spam emails.
F1 Score: Provides a balance between precision and recall.

The metrics are displayed in the console after training.

Saving the Models

The trained models and TF-IDF vectorizer are saved to ./src/trained_models/. The vectorizer file (tfidf_model.pkl) and each model file (e.g., LogisticRegression_model.pkl, RandomForestClassifier_model.pkl) are stored in this directory.

Web Application

The web application is built using Flask and provides the following features:

Email Input: Users can input the email text they want to classify.
Model Selection: Users can select between Logistic Regression and Random Forest models.
Spam Detection: The app detects if the input email is spam.
Suggested Response: If the email is not spam, the app uses a self-hosted Llama 3.2 model with Ollama to generate a suggested response.

Setup Instructions

Clone the repository:

git clone <repository-url>
cd <repository-directory>

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```
Download the dataset: Place the dataset CSV files in the ./src/Dataset/ directory.
Train the models: Run the src/model.py script to train and save the models.
```
python src/model.py
```

Set up the Flask web application:

export FLASK_APP=src/app.py
export FLASK_ENV=development  # For development environment
flask run

Set up the self-hosted Llama 3.2 with Ollama: Follow the instructions provided by Ollama to set up the self-hosted Llama 3.2 model.
Access the web application: Open your web browser and go to http://localhost:5000.

Future Improvements

Add More Features: Experiment with additional feature engineering to capture more nuanced aspects of spam emails.
Add More Models: Test additional classifiers like SVM or Neural Networks.
Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV for improved model performance.
Optimize Web App: Develop a user-friendly and responsive front end for the web app, and consider deploying it to a cloud platform like AWS or Heroku.

Al-Subaiey, A., Al-Thani, M., Alam, N. A., Antora, K. F., Khandakar, A., & Zaman, S. A. U. (2024, May 19). Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection. ArXiv.org. https://arxiv.org/abs/2405.11619 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Email Classifier

Table of Contents

Project Overview

Features

Dataset

Training the Models

Evaluation Metrics

Saving the Models

Web Application

Setup Instructions

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spam Email Classifier

Table of Contents

Project Overview

Features

Dataset

Training the Models

Evaluation Metrics

Saving the Models

Web Application

Setup Instructions

Future Improvements

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages