grnti-text-classifier

Production-grade Russian scientific-text classifier over 28 top-level GRNTI (State Rubricator of Scientific and Technical Information) classes. Main model XLM-RoBERTa-base (multilingual transformer, fine-tuned on Russian abstracts); baseline ruBERT-base-cased (single-language BERT). Both are Hydra-configured, Optuna-tuned, evaluated with top-1 / top-5 accuracy and macro / weighted F1, and served by FastAPI as /classify.

Part of the kiselyovd ML portfolio — production-grade ML projects sharing one cookiecutter template.

📖 English docs • 🇷🇺 Русский README • 🤗 HF Hub model

Dataset

ai-forever/ru-scibench-grnti-classification — Russian scientific abstracts labelled with 28 GRNTI top-level sections. Split statistics:

Split	Rows	Classes
Train	28 476	28 (balanced)
Test	2 772	28 (balanced)

Median sequence length ~120 tokens under the XLM-RoBERTa tokenizer (xlm-roberta-base, max_length=256). Fetched by scripts/sync_data.sh via HF snapshot_download.

Results

Test set n = 2 772 abstracts across 28 GRNTI sections.

Model	Top-1 accuracy	Top-5 accuracy	Macro F1	Weighted F1
XLM-RoBERTa-base (main)	72.4%	96.8%	72.3%	72.3%
ruBERT-base-cased (baseline)	72.9%	95.9%	72.8%	72.8%

Best Optuna trial (20 trials, val macro-F1): lr=3.1e-5, weight_decay=0.012, warmup_ratio=0.147 → val macro-F1 = 73.1%.

Baseline slightly ahead on top-1, main ahead by +0.9pp on top-5 — XLM-R's multilingual pre-training gives a better top-k rerank, while the ru-only ruBERT is marginally sharper on the argmax. Both ship with the model card on HF Hub.

Quick Start

uv sync --all-groups
bash scripts/sync_data.sh
uv run python -m grnti_text_classifier.data.prepare --raw data/raw --out data/processed
uv run python scripts/train_all.py

Serving

uvicorn grnti_text_classifier.serving.main:app --reload

Classify a text:

curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"text":"Исследование квантовой электродинамики в кристаллах."}'

See docs/serving.md for full endpoint contracts, all request/response schemas, and environment variable reference.

Environment Variables

Variable	Purpose
`GRNTI_MAIN_DIR`	Directory containing the `save_pretrained` snapshot for XLM-RoBERTa main model.
`GRNTI_BASELINE_DIR`	Directory containing the `save_pretrained` snapshot for ruBERT baseline.
`GRNTI_LABEL_ENCODER`	Path to `label_encoder.json` mapping int indices to GRNTI class codes.
`GRNTI_MODEL_VERSION`	Reported in `/health` response and classification output (e.g. `v0.1.0`).

Docs

Full documentation (architecture, training runbook, serving guide, API reference) is published at https://kiselyovd.github.io/grnti-text-classifier/.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.dvc		.dvc
.github/workflows		.github/workflows
configs		configs
data		data
docker		docker
docs		docs
notebooks		notebooks
scripts		scripts
src/grnti_text_classifier		src/grnti_text_classifier
tests		tests
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.ru.md		README.ru.md
docker-compose.yml		docker-compose.yml
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
mkdocs.yml		mkdocs.yml
params.yaml		params.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

grnti-text-classifier

Dataset

Results

Quick Start

Serving

Environment Variables

Docs

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

grnti-text-classifier

Dataset

Results

Quick Start

Serving

Environment Variables

Docs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages