Production-grade Russian scientific-text classifier over 28 top-level GRNTI (State Rubricator of Scientific and Technical Information) classes. Main model XLM-RoBERTa-base (multilingual transformer, fine-tuned on Russian abstracts); baseline ruBERT-base-cased (single-language BERT). Both are Hydra-configured, Optuna-tuned, evaluated with top-1 / top-5 accuracy and macro / weighted F1, and served by FastAPI as /classify.
Part of the kiselyovd ML portfolio — production-grade ML projects sharing one cookiecutter template.
📖 English docs • 🇷🇺 Русский README • 🤗 HF Hub model
ai-forever/ru-scibench-grnti-classification — Russian scientific abstracts labelled with 28 GRNTI top-level sections. Split statistics:
| Split | Rows | Classes |
|---|---|---|
| Train | 28 476 | 28 (balanced) |
| Test | 2 772 | 28 (balanced) |
Median sequence length ~120 tokens under the XLM-RoBERTa tokenizer (xlm-roberta-base, max_length=256). Fetched by scripts/sync_data.sh via HF snapshot_download.
Test set n = 2 772 abstracts across 28 GRNTI sections.
| Model | Top-1 accuracy | Top-5 accuracy | Macro F1 | Weighted F1 |
|---|---|---|---|---|
| XLM-RoBERTa-base (main) | 72.4% | 96.8% | 72.3% | 72.3% |
| ruBERT-base-cased (baseline) | 72.9% | 95.9% | 72.8% | 72.8% |
Best Optuna trial (20 trials, val macro-F1): lr=3.1e-5, weight_decay=0.012, warmup_ratio=0.147 → val macro-F1 = 73.1%.
Baseline slightly ahead on top-1, main ahead by +0.9pp on top-5 — XLM-R's multilingual pre-training gives a better top-k rerank, while the ru-only ruBERT is marginally sharper on the argmax. Both ship with the model card on HF Hub.
uv sync --all-groups
bash scripts/sync_data.sh
uv run python -m grnti_text_classifier.data.prepare --raw data/raw --out data/processed
uv run python scripts/train_all.pyuvicorn grnti_text_classifier.serving.main:app --reloadClassify a text:
curl -X POST http://localhost:8000/classify \
-H "Content-Type: application/json" \
-d '{"text":"Исследование квантовой электродинамики в кристаллах."}'See docs/serving.md for full endpoint contracts, all request/response schemas, and environment variable reference.
| Variable | Purpose |
|---|---|
GRNTI_MAIN_DIR |
Directory containing the save_pretrained snapshot for XLM-RoBERTa main model. |
GRNTI_BASELINE_DIR |
Directory containing the save_pretrained snapshot for ruBERT baseline. |
GRNTI_LABEL_ENCODER |
Path to label_encoder.json mapping int indices to GRNTI class codes. |
GRNTI_MODEL_VERSION |
Reported in /health response and classification output (e.g. v0.1.0). |
Full documentation (architecture, training runbook, serving guide, API reference) is published at https://kiselyovd.github.io/grnti-text-classifier/.
MIT — see LICENSE.