Bachelor's Thesis (TFG) — Data Engineering · Universidad Carlos III de Madrid (UC3M) · 2025
Problem. The XMM-Newton EPIC-pn X-ray telescope produces thousands of multi-band light curves per observation. Identifying astrophysically interesting transients (stellar flares, eclipses, X-ray bursts) in these archives is a labour-intensive expert task with no clean labels — a classic unsupervised anomaly-detection challenge on multivariate time series.
Solution. An end-to-end deep-learning pipeline built on reconstruction-based anomaly scoring. The centrepiece is a Transformer Autoencoder (TAE) trained with a novel Masked Denoising objective: 30 % of input timesteps are randomly zeroed, forcing the model to learn robust temporal representations rather than exploiting the identity shortcut. A complementary validity masking mechanism excludes instrumental background contamination from both training loss and anomaly scoring.
Key Results — blind test set, N = 56 signals, stratified 70/30 holdout:
| Model | AUC-ROC · Scenario 1 | AUC-ROC · Scenario 2 | Notes |
|---|---|---|---|
| TAE + Masked Denoising (best run) | 0.918 | 0.915 | Final blind test |
| TAE + Masked Denoising (5-seed mean) | 0.843 ± 0.019 | 0.856 ± 0.010 | Bootstrap 95 % CI [0.820, 0.974] |
| LSTM-AE + Masked Denoising | 0.407 ± 0.022 | — | 5-seed mean; masking alone is insufficient |
| LSTM-AE (no masking) | 0.267 | — | Identity-shortcut failure |
| Isolation Forest | ~0.862 | — | TAE advantage +5.6 pp (DeLong p = 0.192) |
| Anomaly Transformer | < TAE | < TAE | Underperforms all 3 scenarios (p < 0.05) |
Scenario 1: all "Interesting" signals · Scenario 2: Interesting excluding background-type events
Headline findings:
- Masked denoising is the single largest design improvement: +8.0 pp AUC-ROC in a 43-experiment ablation.
- TAE advantage over IF is not significant under broad detection (+5.6 pp, p = 0.192) but becomes formally confirmed for genuine astrophysical events (+14.3 pp, p = 0.004).
- The Anomaly Transformer consistently underperforms, attributed to its Gaussian temporal prior mismatching Poisson-noise statistics of X-ray light curves.
- Post-hoc attention analysis reveals anomaly detection emerges as a by-product of reconstruction failure, not learned anomaly spotting (attention entropy ≈ 4.0 bits uniformly, indistinguishable between anomalous and normal signals).
Raw XMM-Newton EPIC-pn light curves (.parquet)
│
▼
┌─────────────────────────────────────────────┐
│ Sliding Window Extraction │
│ L=128, stride=64 · 5 energy bands (RATE1–5)│
└─────────────────────────────────────────────┘
│
▼ optional
┌─────────────────────────────────────────────┐
│ Feature Engineering │
│ Hardness Ratios · Total Count Rate │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Masked Denoising (30 % timestep masking) │
│ + Validity Masking (telescope artefacts) │
│ Loss computed on clean targets only │
└─────────────────────────────────────────────┘
│
├──────────────────┬──────────────────┐
▼ ▼ ▼
Transformer AE Anomaly Transformer LSTM AE
(main model) (attention-based) (baseline)
6L · 4H · d=128 3L · 4H · d=128 2-layer BiLSTM
│ │ │
└──────────────────┴──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Reconstruction Error Scoring │
│ + Association Discrepancy (AT only) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Hold-Out Evaluation (no leakage) │
│ Val N=128 → threshold + model sel. │
│ Test N=56 → final blind metrics │
│ DeLong tests · Bootstrap 95 % CIs │
└─────────────────────────────────────┘
| Model | Architecture | Key Innovation |
|---|---|---|
| TransformerAE (TAE) | 6-layer Transformer encoder–decoder | Masked Denoising (30 % masking) + Validity Masking |
| AnomalyTransformer (AT) | Transformer + Gaussian prior attention | Association Discrepancy loss |
| LSTM Autoencoder | Bidirectional LSTM seq2seq | Masked Denoising variant (ablation baseline) |
| Isolation Forest | Ensemble on per-window statistical features | Classical ML baseline |
.
├── src/tfg/ # Core installable package
│ ├── models/
│ │ ├── transformer_AE.py # TransformerAE architecture + training utilities
│ │ └── anomaly_transformer.py # AnomalyTransformer architecture + loss
│ ├── data/
│ │ └── datasets.py # Dataset classes, sliding windows, feature engineering
│ └── inference/
│ └── anomaly_detection.py # Scoring, thresholding, plotting
│
├── scripts/
│ ├── train_transformer_AE.py # Train TransformerAE
│ ├── train_anomaly_transformer.py # Train AnomalyTransformer
│ ├── train_lstm_ae.py # Train LSTM Autoencoder baseline
│ ├── train_isolation_forest.py # Train Isolation Forest baseline
│ ├── detect_anomalies.py # Unified anomaly scoring (TAE / AT)
│ ├── eval_roc.py # ROC / PR evaluation at window & signal level
│ ├── generate_val_test_split.py # Reproducible stratified holdout split
│ ├── compute_delong_tests.py # Paired DeLong (1988) AUC-ROC comparison
│ ├── compute_confidence_intervals.py # Bootstrap 95 % CIs
│ ├── compute_cost_table.py # Cost-sensitive threshold analysis
│ ├── multires_scoring.py # Multi-resolution window scoring
│ ├── extract_attention_analysis.py # Attention pattern visualisation
│ ├── extract_signal_embeddings.py # Latent space extraction
│ ├── plot_qualitative_analysis.py # Qualitative anomaly case studies
│ ├── plot_signal_graph.py # Signal similarity graph
│ ├── regenerate_figures_local.py # Reproduce all thesis figures (local)
│ └── regenerate_figures_server.py # Reproduce all thesis figures (server)
│
├── notebooks/
│ ├── 01_data_exploration.ipynb # Signal statistics, band distributions, correlations
│ └── 02_mask_inspection.ipynb # Telescope mask format inspection
│
├── experiments/ # Bash orchestration for full experiment matrices
│ ├── run_tfg_experiment.sh # Main TFG ablation matrix (4 groups, 12 configs)
│ ├── run_tfg_experiment2.sh # Extended experiment batch
│ ├── run_tfg_experiment3.sh # Final experiment batch
│ ├── run_holdout_evaluation.sh # End-to-end holdout val → test pipeline
│ ├── run_confidence_intervals.sh # Bootstrap CI generation
│ ├── run_b3_groupC_masked.sh # Group C masked denoising ablation
│ └── run_lstm_ae_masked.sh # LSTM-AE masked denoising ablation
│
├── data/
│ ├── raw/ # XMM-Newton .parquet signals [NOT committed — see Data section]
│ ├── masks/ # Telescope mask files [NOT committed]
│ ├── processed/ # Intermediate features [NOT committed]
│ ├── labels/
│ │ ├── signals_to_assess.xlsx # Expert anomaly annotations (ground truth)
│ │ └── windows.xlsx # Window-level label metadata
│ └── splits/
│ ├── data_split.json # Reproducible train/val/test assignment (seed 42)
│ └── optimal_thresholds.json # Val-tuned thresholds per model × scenario
│
├── requirements.txt
├── LICENSE
└── README.md
git clone https://github.com/mikelballay/signals_anomaly_detection.git
cd signals_anomaly_detection
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtRaw XMM-Newton light curves are not included due to size. Place .parquet files as:
data/raw/full_signals/<obsid>.parquet # multiband signals (MultiIndex columns: signal_id × band)
data/masks/<obsid>.parquet # telescope validity masks (optional)
The ground-truth labels (data/labels/) and the train/val/test split (data/splits/) are committed and ready to use.
python scripts/train_transformer_AE.py \
--data_dir data/raw/full_signals \
--out_dir results/runs/tae_masked \
--signals data/labels/signals_to_assess.xlsx \
--windows data/labels/windows.xlsx \
--L 128 --stride 64 --epochs 50 --batch 64 \
--lr 5e-4 --d_model 128 --nhead 4 --num_layers 3 \
--loss poisson --extra_features all \
--denoise_mode mask --denoise_p 0.30 \
--mask_dir data/masks# LSTM Autoencoder + Masked Denoising
python scripts/train_lstm_ae.py \
--data_dir data/raw/full_signals --out_dir results/baselines/lstm_ae \
--signals data/labels/signals_to_assess.xlsx --windows data/labels/windows.xlsx \
--L 128 --stride 64 --hidden 128 --denoise_p 0.30
# Isolation Forest
python scripts/train_isolation_forest.py \
--data_dir data/raw/full_signals --out_dir results/baselines/isolation_forest \
--signals data/labels/signals_to_assess.xlsx --windows data/labels/windows.xlsx \
--L 128 --stride 64# Score windows
python scripts/detect_anomalies.py \
--model_type tae \
--data_dir data/raw/full_signals \
--run_dir results/runs/tae_masked/<run_id> \
--thr_mode percentile --thr_value 99.0
# ROC / PR curves against ground truth
python scripts/eval_roc.py \
--signals data/labels/signals_to_assess.xlsx \
--windows data/labels/windows.xlsx \
--scores results/runs/tae_masked/<run_id>/anomaly_scores.csv \
--output-dir results/eval/tae_masked_simple \
--label-mode simple# End-to-end: val threshold tuning → blind test → DeLong + bootstrap CIs
bash experiments/run_holdout_evaluation.sh
# Standalone DeLong test (TAE vs IF, TAE vs AT)
python scripts/compute_delong_tests.py \
--tae_scores results/runs/tae_masked/ \
--if_scores results/baselines/isolation_forest/anomaly_scores.csv \
--at_scores results/runs/at_baseline/ \
--split test --split-file data/splits/data_split.json| Argument | Values | Notes |
|---|---|---|
--loss |
mse, huber, poisson |
Poisson-weighted MSE best matches count-rate statistics |
--extra_features |
none, total, hr, all |
Hardness ratios improve band-relative anomaly scoring |
--denoise_mode |
none, mask, gaussian |
mask with p=0.30 is the validated best configuration |
--denoise_p |
0.0–1.0 |
Masking fraction; 0.30 confirmed by ablation |
--mask_dir |
path | Telescope validity masks; enables validity-masking in loss |
Signals (data/raw/full_signals/<obsid>.parquet): pandas.DataFrame with MultiIndex columns (signal_id, band), band ∈ {RATE1, …, RATE5}. Rows = time steps, values = photon count rates.
Masks (data/masks/<obsid>.parquet): same naming, columns (signal_id, "OK"). Boolean: True = valid, False = artefact/background.
@thesis{ballay2025anomaly,
author = {Mikel Ballay},
title = {Unsupervised Anomaly Detection in X-Ray Time Series with Transformers},
school = {Universidad Carlos III de Madrid},
year = {2025},
type = {Bachelor's Thesis (TFG), Data Engineering}
}Full thesis PDF: [link to be added after defence]
Mikel Ballay — mikel.ballay@gmail.com
Data Engineering · UC3M · 2025