Machine Learning-Powered Phishing and Malware URL Classifier
A machine learning-based cybersecurity system that detects and classifies malicious URLs by analyzing structural and statistical features — without inspecting webpage content.
Key properties:
- Real-time URL analysis with 95%+ detection accuracy
- 30+ engineered URL features across length, entropy, domain, and security dimensions
- No page content inspection — privacy-preserving by design
- Supports single URL lookup, batch processing, and REST API integration
| Layer | Technology |
|---|---|
| Machine Learning | Scikit-learn, XGBoost |
| Data Processing | Pandas, NumPy |
| Web Interface | Streamlit |
| API | Flask |
| Visualization | Matplotlib, Seaborn |
git clone https://github.com/ares-coding/malicious-url-detection-using-ml.git
cd malicious-url-detection-using-ml
pip install -r requirements.txt
streamlit run app.pyDocker:
docker build -t url-detector .
docker run -p 8501:8501 url-detectorstreamlit run app.py
# Access at http://localhost:8501from url_detector import URLDetector
detector = URLDetector(model='xgboost')
# Single URL
result = detector.predict('https://suspicious-site.com')
print(f"Malicious: {result['is_malicious']}")
print(f"Confidence: {result['confidence']:.2%}")
# Batch
urls = ['url1.com', 'url2.com', 'url3.com']
results = detector.predict_batch(urls)python api.py
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'def extract_url_features(url):
features = {
'url_length': len(url),
'num_dots': url.count('.'),
'num_hyphens': url.count('-'),
'num_underscores': url.count('_'),
'num_slashes': url.count('/'),
'num_questionmarks': url.count('?'),
'num_equals': url.count('='),
'num_ats': url.count('@'),
'num_digits': sum(c.isdigit() for c in url),
'has_ip': check_ip_address(url),
'has_https': url.startswith('https'),
'domain_length': len(extract_domain(url)),
# 20+ additional features
}
return featuresmodels = {
'Random Forest': RandomForestClassifier(n_estimators=100),
'XGBoost': XGBClassifier(max_depth=6),
'SVM': SVC(kernel='rbf', probability=True)
}
prediction, confidence = model.predict_proba(features)Features are grouped into the following categories:
| Category | Features |
|---|---|
| Length-based | URL length, domain length, path length |
| Character-based | Dots, hyphens, slashes, special characters |
| Domain | IP address presence, subdomain count, TLD type |
| Path | Directory depth, file extension |
| Query | Parameter count, suspicious patterns |
| Security | HTTPS, certificate validity |
| Entropy | Character distribution randomness |
| Reputation | Domain age, blacklist scores |
Top 10 features by importance:
url_length 0.142
has_ip_address 0.128
num_subdomains 0.095
domain_length 0.087
num_dots 0.076
has_https 0.068
entropy 0.062
num_hyphens 0.055
path_depth 0.051
num_digits 0.048
| Model | Accuracy | Precision | Recall | F1-Score | AUC-ROC |
|---|---|---|---|---|---|
| Random Forest | 94.2% | 93.8% | 94.6% | 94.2% | 0.97 |
| XGBoost | 96.5% | 96.2% | 96.8% | 96.5% | 0.98 |
| SVM (RBF) | 92.8% | 92.3% | 93.2% | 92.7% | 0.96 |
| Ensemble | 97.1% | 96.9% | 97.3% | 97.1% | 0.99 |
Confusion Matrix (XGBoost):
Predicted
Benign Malicious
Actual Benign 4,823 152
Malicious 118 4,907
Analyze a single URL.
Request:
{
"url": "https://example.com/path?param=value"
}Response:
{
"url": "https://example.com/path?param=value",
"is_malicious": false,
"confidence": 0.923,
"risk_score": "low",
"features": {
"url_length": 38,
"has_https": true,
"num_dots": 1
},
"timestamp": "2025-02-13T10:30:00Z"
}Analyze multiple URLs.
Request:
{
"urls": [
"https://google.com",
"http://suspicious-site.tk"
]
}malicious-url-detection/
├── data/
│ ├── raw/ # Original datasets
│ ├── processed/ # Cleaned data
│ └── models/ # Trained model files
├── src/
│ ├── feature_extraction.py
│ ├── model_training.py
│ ├── prediction.py
│ └── utils.py
├── notebooks/
│ ├── 01_data_analysis.ipynb
│ ├── 02_feature_engineering.ipynb
│ └── 03_model_evaluation.ipynb
├── api/
│ ├── app.py
│ └── schemas.py
├── app.py # Streamlit interface
├── train.py # Training script
├── requirements.txt
└── README.md
Licensed under the Apache License 2.0.
Au Amores — AI/ML Engineer