Skip to content

kiselyovd/grnti-text-classifier

Repository files navigation

grnti-text-classifier

CI Docs Coverage License: MIT Python 3.12+ HF Hub

Production-grade Russian scientific-text classifier over 28 top-level GRNTI (State Rubricator of Scientific and Technical Information) classes. Main model XLM-RoBERTa-base (multilingual transformer, fine-tuned on Russian abstracts); baseline ruBERT-base-cased (single-language BERT). Both are Hydra-configured, Optuna-tuned, evaluated with top-1 / top-5 accuracy and macro / weighted F1, and served by FastAPI as /classify.

Part of the kiselyovd ML portfolio — production-grade ML projects sharing one cookiecutter template.

📖 English docs • 🇷🇺 Русский README • 🤗 HF Hub model

Dataset

ai-forever/ru-scibench-grnti-classification — Russian scientific abstracts labelled with 28 GRNTI top-level sections. Split statistics:

Split Rows Classes
Train 28 476 28 (balanced)
Test 2 772 28 (balanced)

Median sequence length ~120 tokens under the XLM-RoBERTa tokenizer (xlm-roberta-base, max_length=256). Fetched by scripts/sync_data.sh via HF snapshot_download.

Results

Test set n = 2 772 abstracts across 28 GRNTI sections.

Model Top-1 accuracy Top-5 accuracy Macro F1 Weighted F1
XLM-RoBERTa-base (main) 72.4% 96.8% 72.3% 72.3%
ruBERT-base-cased (baseline) 72.9% 95.9% 72.8% 72.8%

Best Optuna trial (20 trials, val macro-F1): lr=3.1e-5, weight_decay=0.012, warmup_ratio=0.147 → val macro-F1 = 73.1%.

Baseline slightly ahead on top-1, main ahead by +0.9pp on top-5 — XLM-R's multilingual pre-training gives a better top-k rerank, while the ru-only ruBERT is marginally sharper on the argmax. Both ship with the model card on HF Hub.

Quick Start

uv sync --all-groups
bash scripts/sync_data.sh
uv run python -m grnti_text_classifier.data.prepare --raw data/raw --out data/processed
uv run python scripts/train_all.py

Serving

uvicorn grnti_text_classifier.serving.main:app --reload

Classify a text:

curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"text":"Исследование квантовой электродинамики в кристаллах."}'

See docs/serving.md for full endpoint contracts, all request/response schemas, and environment variable reference.

Environment Variables

Variable Purpose
GRNTI_MAIN_DIR Directory containing the save_pretrained snapshot for XLM-RoBERTa main model.
GRNTI_BASELINE_DIR Directory containing the save_pretrained snapshot for ruBERT baseline.
GRNTI_LABEL_ENCODER Path to label_encoder.json mapping int indices to GRNTI class codes.
GRNTI_MODEL_VERSION Reported in /health response and classification output (e.g. v0.1.0).

Docs

Full documentation (architecture, training runbook, serving guide, API reference) is published at https://kiselyovd.github.io/grnti-text-classifier/.

License

MIT — see LICENSE.

About

Production-grade Russian multi-class text classifier (GRNTI) - XLM-RoBERTa main, ruBERT baseline, FastAPI serving, HF Hub model.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors