Dilemma is a diachronic Greek lemmatizer spanning Ancient Greek (Classical, Homeric, Hellenistic), Medieval/Byzantine Greek (both vernacular and literary), and Modern Greek (Demotic and Katharevousa). It combines multiple strategies into a unified pipeline:
- A 12.5M-form lookup table built from Wiktionary inflection tables, Wiktionary's Lua morphological modules applied to LSJ and Sophocles lexicon headwords, gold-standard treebanks (Perseus, PROIEL, Gorman, DiGreC), and annotated corpora (GLAUx, Diorisis, HNC)
- Dialect normalization for Ionic, Doric, Aeolic, and Koine orthographic variants, mapping dialectal forms to their Attic equivalents for lookup
- Surgical rule-based morphological analysis including augment stripping, reduplication removal, particle suffix resolution, elision expansion, and crasis decomposition - these handle systematic transformations that the lookup table cannot enumerate exhaustively (every word x every enclitic particle) and that a transformer might not generalize to for rare forms it has never trained on
- A small supervised character-level transformer (~4M parameters) trained on 3.5M explicit form-lemma pairs, used only for the ~5% of words not resolved by lookup or rules
- Convention remapping to match output lemmas to target dictionaries (LSJ, Cunliffe, Triantafyllidis, Wiktionary)
Most Greek words resolve instantly via the lookup table. For unseen forms, Dilemma falls back through rule-based morphological analysis and dialect normalization before reaching the transformer, which learns morphological patterns at the character level, the standard architecture from SIGMORPHON shared tasks. At 4M parameters it trains from scratch in minutes, compared to fine-tuning approaches like ByT5-small (300M params) which take hours to train. Greek lemmatization is highly pattern-based - a small specialized model matches a large general-purpose one, and the 12.5M lookup table handles the rest.
Most individual components of Dilemma are established techniques. What's novel is the combination and the scale:
- Multi-period Greek in one tool. No other lemmatizer covers Ancient, Byzantine, and Modern Greek in a single system. Tools like Morpheus handle only classical AG. Stanza handles AG or MG but not both. Dilemma resolves Katharevousa, vernacular medieval, and regional MG varieties (Cypriot, Cretan) alongside Homer and Herodotus.
- 12.5M-form lookup table. The largest compiled for Greek, built by applying Wiktionary's Lua inflection modules to LSJ and Sophocles headwords, then merging with five gold-standard treebanks and two corpus-derived pair sets. This is a data engineering contribution, not an algorithmic one.
- Dialect normalization. Systematic Ionic, Doric, Aeolic, and Koine orthographic mapping to Attic equivalents. No other Greek lemmatizer handles dialectal variation this way.
- Elision with consonant de-assimilation and frequency ranking. Recovers prepositions like κατά from assimilated forms like καθ' by reversing the aspiration rule, then ranks candidates by corpus frequency.
- Character-level transformer for Greek lemmatization. The SIGMORPHON encoder-decoder architecture is well established for morphological inflection in other languages, but applying it to Greek lemmatization (rather than inflection) appears to be new.
- The character-level encoder-decoder is the standard SIGMORPHON architecture for morphological inflection/reinflection tasks across many languages.
- The SQLite lookup with monotonic/stripped fallback keys is a straightforward hash table approach.
- Edit-distance spelling correction uses a BK-tree, a well-known data structure for metric-space nearest-neighbor search.
- Wiktionary as a data source for morphological data is widely used (kaikki.org, Lexonomy, etc.).
- Treebank integration (Perseus, PROIEL, Gorman) follows standard practice in computational linguistics.
Note on methodology: Dilemma is a supervised system. The transformer trains on 3.5M explicit form-to-lemma pairs from Wiktionary inflection tables, and the lookup table (which handles 95%+ of words) is literally a dictionary of correct answers. This is not unsupervised learning (pattern discovery from raw text with no labels).
SQLite backend: The lookup table loads from a pre-built SQLite database (instant startup, ~0.3s) instead of parsing 600MB of JSON (~11s). Falls back to JSON if the database isn't present.
ONNX support: For inference, ONNX Runtime (~50 MB) and PyTorch (~2 GB) produce identical results. If you already have PyTorch installed, it works fine. If you're starting fresh, ONNX is the lighter option. PyTorch is only required for training. The lookup table (which handles 95%+ of words) needs neither.
- Quick Start
- Evaluation
- How It Works
- API Reference
- Greek Coverage
- Development
- Data
- Architecture
- Projects using Dilemma
- Credits
- License
pip install "dilemma[onnx] @ git+https://github.com/ciscoriordan/dilemma.git"
python -m dilemma downloadThe first line installs the package plus ONNX Runtime (~50 MB, for unseen-form
inference). If you already have PyTorch installed, use dilemma[torch]
instead, or just plain dilemma to skip the model backend and rely on the
lookup table alone. The second line downloads the lookup tables and ONNX
model files from HuggingFace into ~/.cache/dilemma/ (~1.6 GB).
Dilemma looks for data in this order: $DILEMMA_DATA_DIR,
~/.cache/dilemma/data/, <repo-root>/data/ (when running from a clone),
and <package>/data/. Set DILEMMA_DATA_DIR to point at an existing copy
if you have one.
To work from a git checkout (for development or to rebuild the data):
git clone https://github.com/ciscoriordan/dilemma.git && cd dilemma
pip install -e ".[onnx]"
python -m dilemma download # or use the build pipeline belowTo build the data from scratch instead of downloading:
python build_data.py --download # downloads Wiktionary dumps, builds lookup tables
python build_lookup_db.py # builds SQLite DB for instant startup (optional)
python fix_selfmaps.py # fixes inflected forms that self-map (optional)from dilemma import Dilemma
d = Dilemma() # all periods (default)
d.lemmatize("εσκότωσε") # "σκοτώνω"
d.lemmatize("πάθης") # "παθαίνω"
d.lemmatize_batch(["δώση", "σκότωσε"]) # ["δίνω", "σκοτώνω"]
# Elision expansion (AG elided forms resolved via Wiktionary lookup)
d.lemmatize("ἀλλ̓") # "ἀλλά"
d.lemmatize("ἔφατ̓") # "φημί"
d.lemmatize("δ̓") # "δέ"
d.lemmatize("ἐπ̓") # "ἐπί"
# Single period
d_mg = Dilemma(lang="el") # MG only (falls back to combined model if no el-specific model exists)
d_grc = Dilemma(lang="grc") # AG only
# Specific model scale
d = Dilemma(scale="test") # use test-scale model
# Treebank evaluation mode: resolve articles to ὁ, pronouns to ἐγώ/σύ
d_eval = Dilemma(resolve_articles=True)
d_eval.lemmatize("τῆς") # "ὁ" (not "τῆς")
d_eval.lemmatize("μοι") # "ἐγώ" (not "μοι")
# Byzantine text with orthographic normalization
d_byz = Dilemma(normalize=True, period="byzantine")
d_byz.lemmatize("θεω") # "θεός" (restores iota subscriptum)By default, articles and pronoun clitics self-map (e.g. τῆς returns
τῆς). This is better for alignment pipelines where you want
surface-form matching. Set resolve_articles=True to resolve them
to canonical lemmas (ὁ, ἐγώ, σύ), matching treebank conventions
(AGDT, DiGreC, PROIEL). The triantafyllidis convention auto-enables
article resolution (articles to ο, skipping AG pronoun resolution
for forms like σε/με that are MG prepositions).
Different dictionaries and treebanks use different citation forms for
the same word. The convention parameter remaps Dilemma's output to
match a specific standard. This matters for benchmarking: a tool that
outputs εἰμί and a gold standard that expects είμαι will show
as an error even though both are correct for their respective
conventions.
| Convention | Target | Example mappings |
|---|---|---|
None (default) |
Wiktionary headwords | εἶπον→εἶπον, θεούς→θεός, σπήλαια→σπήλαιον |
lsj |
LSJ dictionary | εἶπον→λέγω, αἰνῶς→αἰνός, σπήλαιο→σπήλαιον |
cunliffe |
Cunliffe Homeric Lexicon | γίνεται→γίγνομαι, θέλει→ἐθέλω, νοῦν→νόος |
triantafyllidis |
Triantafyllidis MG dictionary | ὁ→ο, εἰμί→είμαι, σπήλαιον→σπήλαιο, εἷς→ένας |
The mapping is built automatically from data/lemma_equivalences.json
cross-referenced against the convention's headword list, with explicit
overrides in data/convention_{name}.json. Lemma equivalences also
group valid alternative lemmatizations (comparative/positive adjective
forms, active/deponent pairs, spelling variants) so that benchmarks
score them as correct rather than penalizing convention disagreements.
Note: data/lemma_equivalences.json covers Ancient Greek equivalences.
For Modern Greek, the lookup merge in build_data.py prefers EL
(Greek) Wiktionary lemma forms over EN (English) Wiktionary, since
EL uses the modern contracted forms that reflect actual usage
(τρώω, λέω) while EN tends toward fuller morphological stems
(τρώγω, λέγω). For the remaining cases where the two sources
use genuinely different lemmas, downstream consumers like
Lemma use a separate MG
equivalences file generated by cross-referencing
mg_lookup_scored.json against Wiktionary headwords, with corpus
frequency as a tiebreaker for the canonical form.
For individual form-to-lemma corrections where the lookup table returns
the wrong lemma due to ambiguity (e.g., proper nouns beating common
verbs), build_lookup_db.py has a _LOOKUP_OVERRIDES dict that
hard-corrects specific entries in the database.
Other tools (stanza, spaCy, CLTK) have fixed output conventions matching their training treebanks and cannot be remapped.
# LSJ lemma convention: remap output to LSJ dictionary headwords
d_lsj = Dilemma(convention="lsj")
d_lsj.lemmatize("αἰνῶς") # "αἰνός" (adverb -> adjective)
d_lsj.lemmatize("εἶπον") # "λέγω" (aorist -> present stem)
# Cunliffe convention: remap to Cunliffe Homeric Lexicon headwords
d_cun = Dilemma(convention="cunliffe")
d_cun.lemmatize("γίνεται") # "γίγνομαι" (Homeric form)
d_cun.lemmatize("θέλει") # "ἐθέλω" (Homeric form)
d_cun.lemmatize("νοῦν") # "νόος" (uncontracted Homeric form)
# Triantafyllidis convention: remap to Modern Greek monotonic forms
d_mg = Dilemma(convention="triantafyllidis")
d_mg.lemmatize("σπήλαια") # "σπήλαιο" (not σπήλαιον)
d_mg.lemmatize("Είναι") # "είμαι" (not εἰμί)
d_mg.lemmatize("εργαλεία") # "εργαλείο" (not ἐργαλεῖον)
d_mg.lemmatize("τα") # "ο" (not ὁ)In the benchmark table, the first two Dilemma rows use the Wiktionary
convention. The convention="triantafyllidis" row auto-enables article
resolution (articles to ο, demonstratives to αυτός) and outputs
monotonic MG lemma forms. This is the recommended setting for Modern
Greek text.
Equiv-adjusted accuracy across four periods of Greek. All tools
evaluated with the same normalization (case-folded, accent-stripped)
and lemma equivalence groups (see data/benchmarks/bench_all.py).
Test sets:
- AG Classical: Sextus Empiricus, Pyrrhoniae Hypotyposes 1.1-1.8 (357 tokens, First1KGreek, CC BY-SA 4.0). Not in any UD treebank or Gorman.
- Byzantine: Swaelens et al. (2024) DBBE gold standard (8,342 tokens of unedited Byzantine epigrams, CC BY 4.0). Not in the training data of any tool in the table below.
- Katharevousa: Konstantinos Sathas, Neoelliniki Filologia (1868), biography of Bessarion (318 tokens, el.wikisource.org, public domain). No Katharevousa treebank exists.
- Demotic MG: Greek Wikipedia articles "Σπήλαιο Πετραλώνων" and "Ελαιόλαδο" (400 tokens, CC BY-SA 4.0). Not in any MG treebank. A separate dev set (251 tokens from "Μέλισσα" and "Σαμοθράκη") is also included at
data/benchmarks/demotic_dev.txt.
| Tool | AG Classical | Byzantine (literary) | Katharevousa | Demotic MG |
|---|---|---|---|---|
spaCy el |
-- | 31.7% | 44.6% | 79.9% |
stanza el |
-- | 37.4% | 48.4% | 87.0% |
| Swaelens et al. (2024) | -- | 65.8% | -- | -- |
| CLTK | 81.2% | 66.6% | 74.8% | -- |
| Morpheus (oracle) | -- | 71.1% | -- | -- |
stanza grc |
92.2% | 71.3% | 85.2% | -- |
| Swaelens et al. (2025) | -- | ~74-75% | -- | -- |
| Dilemma (best convention per period) | 99.7% | 92.7% | 95.6% | 96.0%† |
†lang="el" with triantafyllidis scores 95.8%, nearly matching lang="all" (96.0%). For MG-only workloads, lang="el" with triantafyllidis is recommended since it avoids AG lemmas (e.g. σπήλαιον) being returned for MG words that have an AG lookalike.
Cells marked -- indicate the tool doesn't support that period or
wasn't tested. Morpheus "oracle" picks the best candidate from all
its analyses, representing the ceiling for rule-based morphology.
Dilemma detail by convention:
| Lang | Convention | POS | AG Classical | Byzantine (literary) | Katharevousa | Demotic MG |
|---|---|---|---|---|---|---|
all |
wiktionary (default) |
-- | 99.7% | 92.7% | 95.6% | 79.0%* |
all |
wiktionary (default) |
gold | -- | 92.6% | -- | -- |
all |
triantafyllidis |
-- | 85.4% | 83.4% | 90.9% | 96.0%† |
grc |
wiktionary (default) |
-- | 99.7% | 92.3% | 94.3% | 79.0%* |
grc |
triantafyllidis |
-- | 87.4% | 86.9% | 89.9% | 90.0% |
el |
wiktionary (default) |
-- | 93.3% | 86.7% | 92.5% | 73.0%* |
el |
triantafyllidis |
-- | 85.4% | 82.6% | 89.9% | 95.8% |
*Demotic MG scores with wiktionary convention are convention mismatches, not real accuracy gaps: AG citation forms like σπήλαιον don't match the MG gold standard σπήλαιο. Using convention="triantafyllidis" fixes this.
lang="all" searches both AG and MG lookup tables for every token.
Other tools in the comparison table are locked to a single language.
The wiktionary convention outputs polytonic AG citation forms. The
triantafyllidis convention outputs monotonic MG lemma forms and is
the recommended setting for Modern Greek text (see
Conventions).
POS column: -- means Dilemma disambiguates on its own (default).
gold means gold-standard POS tags from the dataset are fed in. Only
DBBE provides gold POS; the negligible difference (92.7% vs 92.6%)
confirms POS ambiguity is not a significant error source.
The eval scripts (eval/eval_dbbe.py, eval/eval_digrec.py,
eval/eval_hnc.py, eval/bench_dbbe.py) provide per-POS breakdowns
and error categorization.
Following SIGMORPHON shared task methodology for out-of-vocabulary evaluation, we exclude the 3,000 most frequent Greek forms and capitalized words, then check whether the output lemma is a valid LSJ/Wiktionary headword. This tests the hard tail that matters for real texts.
| Text | Period | Morpheus | Stanza | Dilemma |
|---|---|---|---|---|
| Xenophon, Cyropaedia | Attic | 99.5% | 84% | 99.6% |
| Kresadlo, Astronautilia 13 | Epic | 74% | 74% | 84% |
| Herodotus, Histories | Ionic | 99.5% | 88% | 99.9% |
On Cyropaedia, gold accuracy vs Gorman treebank annotations is 93.2%. The remaining gap is convention differences (e.g. κτάομαι vs κτέομαι, ᾄδω vs ἀείδω), not missing forms. Herodotus gold-match accuracy (vs PROIEL annotations) is 95.3%, where the gap is almost entirely convention differences (Ionic vs Attic spelling, plural ethnonym lemmas, voice conventions), not missing forms or disambiguation failures.
On the DiGreC treebank (119K tokens,
Homer through 15th century Byzantine Greek), Dilemma reaches 93.7%
equiv-adjusted (90.3% strict). The gap accounts for convention
differences between annotation schemes (e.g. εἶπον/λέγω,
ἐγώ/ἡμεῖς).
eval_hnc.py evaluates against the
HNC Golden Corpus (88K tokens
of gold-standard Modern Greek from CLARIN:EL).
| Layer | Speed | Coverage | Source |
|---|---|---|---|
| Lookup table | hash lookup O(1) |
12.5M known forms | Wiktionary + LSJ + Sophocles + GLAUx + treebanks |
| Normalizer | k candidates O(k) |
Byzantine orthographic variants | Rule-based candidate generation |
| Elision expansion | v=7 vowels O(v) |
AG elided forms | Vowel expansion against lookup |
| Crasis table | hash lookup O(1) |
~50 common crasis forms | Hand-curated |
| Particle suffix stripping | suffix check O(1) |
AG enclitic forms (-per, -ge, -de, deictic -i) | Strip suffix, re-lookup base form |
| Verb morphology stripping | prefix check O(1) |
Unseen augmented/reduplicated verb forms | Strip augment/reduplication, re-lookup |
| Dialect normalization | k candidates O(k) |
Ionic, Doric, Aeolic, Koine dialect forms | Map dialect forms to Attic equivalents |
| Compound decomposition | n=word length O(n) |
Byzantine compound words | Split at linking vowel, look up base |
| Spelling correction | BK-tree O(d·m) |
ED0-2 suggestions for unknown words | Accent-stripped edit distance |
| Transformer | beam search O(b·n²) |
generalizes to unseen forms | Trained on Wiktionary pairs |
The lookup table combines forms from multiple sources:
| Source | Forms | Notes |
|---|---|---|
| Wiktionary (EN + EL, all periods) | 5.2M | Baseline from kaikki.org dumps |
| LSJ (Liddell-Scott-Jones) | 4.2M | 32K nouns, 22K verbs, 14K adjectives expanded via Wiktionary Lua modules |
| Sophocles Lexicon (Byzantine/Patristic) | 1.0M | 13.5K nouns, 4.6K verbs, 1.5K adverbs from OCR'd TEI data |
| GLAUx (Keersmaekers, 2021) | 557K | 17M-token corpus, 8th c. BC - 4th c. AD, 98.8% lemma accuracy |
| Diorisis (Vatri & McGillivray, 2018) | 76K new | 10M-token corpus, Homer - 5th c. AD, 91.4% lemma accuracy. Low-priority pairs (only added when no conflict with existing sources). Also provides frequency data (27M combined tokens with GLAUx). |
| HNC Golden Corpus (CLARIN:EL) | 1K new | 88K-token gold-standard MG corpus, 11K unique form-lemma pairs. Low priority (only added when not in Wiktionary). Also used for MG evaluation. |
| PROIEL (UD treebank) | 33K | Herodotus gold-standard form-lemma pairs (expert-verified) |
| Perseus (UD treebank) | 42K | 178K tokens: Sophocles, Aeschylus, Homer, Hesiod, Herodotus, Thucydides, Plutarch, Polybius, Athenaeus |
| Gorman Treebanks (Gorman) | 79K | 687K-token corpus across Herodotus, Thucydides, Xenophon, Demosthenes, Lysias, Polybius, etc. Gold-standard single annotator. |
| DGE (Diccionario Griego-Espanol) | 52K | Headword filter coverage for spell-check |
| LGPN (Lexicon of Greek Personal Names) | 44K | Proper noun headword coverage |
| Perseus Digital Library (L&S, Pape, Bailly, etc.) | 176K | Headword filter from multiple classical lexica |
| Closed-class fixes | ~500 | Articles, pronouns, prepositions mapped to canonical lemmas |
The LSJ and Sophocles expansions use Wiktionary's own grc-decl and grc-conj Lua modules (via wikitextprocessor) to generate inflection paradigms from headwords with grammatical metadata. Cunliffe's Homeric Lexicon (~12K headwords) is not expanded this way because its headwords are a subset of LSJ and already covered by the LSJ expansion plus GLAUx Homeric corpus data (557K pairs).
The lookup table is built from Wiktionary kaikki dumps
(EN and EL editions for MG and AG, plus EL Medieval Greek), expanded with
inflected forms from LSJ (via Wiktionary Lua modules) and the Sophocles
lexicon of Roman and Byzantine Greek, then augmented with form-lemma pairs
from gold-standard treebanks (PROIEL, Gorman, AGDT). Each form is indexed under
its original, monotonic, and accent-stripped variants, so θεοὶ (polytonic
with grave), θεοί (monotonic with acute), and θεοι (stripped) all
resolve to θεός. Input can be polytonic, monotonic, or unaccented. AG
forms take priority over MG, ensuring classical lemma forms (βιβλίον,
φύσις, θεῖος) are preferred over their MG equivalents (βιβλίο, φύση,
θείο). Medieval Wiktionary entries are merged into the MG table at
build time. When lang="el" is used, 150K MG-specific entries
override the AG-first defaults with MG lemma forms (ο instead of ὁ,
είμαι instead of εἰμί). For polytonic input (breathings/circumflex),
an additional AG-only lookup pass runs first.
When the transformer handles an unseen form, beam search generates multiple candidates and picks the first that matches a known headword from the combined filter (headwords from Wiktionary self-maps, LSJ9 (119K entries + variants), Cunliffe's Homeric Lexicon (12K entries), DGE (52K entries), LGPN proper names (44K entries), and Perseus Digital Library headwords (176K from L&S, Pape, Bailly, etc.)). If nothing matches, the input is returned unchanged.
Wiktionary as upstream: Because Dilemma's lookup tables are built
directly from Wiktionary, any missing or incorrect lemmatization can
often be fixed by editing the Wiktionary entry itself. When the kaikki
dumps are next regenerated and build_data.py re-run, the fix flows
into Dilemma automatically. This means the coverage and accuracy of
Dilemma improve over time as Wiktionary's Greek coverage improves,
without any changes to Dilemma's code.
The rule-based morphological analysis fills the gap between the lookup table and the transformer:
Particle stripping (-περ, -γε, -δε, -ι): These are appended particles
that create forms the lookup table may never have seen. ὅσπερ is not a
separate word in Wiktionary - it is ὅς + -περ. The lookup table would
need to store every word x every enclitic particle combination. Stripping
is simpler.
Augment/reduplication stripping: A rare verb's aorist (e.g.,
ἐμόρμυρσεν) might not appear in any corpus or Wiktionary table, but the
present stem μορμύρω is there. Stripping the augment and tense markers
recovers the connection. The transformer might learn this pattern, but an
explicit rule is more reliable for rare verbs it has never trained on.
Elision/crasis: δ' needs to expand to δέ, κἀγώ needs to
decompose to καὶ ἐγώ. These are mechanical text artifacts, not
morphology - the transformer would waste capacity learning them.
In short: the rules handle systematic, predictable transformations that the lookup table cannot enumerate exhaustively and the transformer might not generalize to for rare forms. They are the cheap, reliable middle layer between brute-force lookup and expensive ML inference.
For Ancient Greek dialect texts (Herodotus, Pindar, Sappho, etc.), the normalizer maps dialect-specific forms to their Attic equivalents so the Attic-heavy lookup table can match them.
Ionic (highest coverage): η/ᾱ alternation after ε, ι, ρ (ἱστορίης → ἱστορίας), uncontracted vowels (ποιέειν → ποιεῖν, τιμέω → τιμῶ), κ/π interrogative interchange (κῶς → πῶς, ὅκου → ὅπου), σσ/ττ alternation (θάλασσα → θάλαττα), ρσ/ρρ alternation (θάρσος → θάρρος), and common word mappings (μοῦνος → μόνος, ξεῖνος → ξένος, κεῖνος → ἐκεῖνος).
Doric: ᾱ/η alternation (Ἀθάνα → Ἀθήνη), word mappings (ποτί → πρός, τύ → σύ), Doric futures (-σέω → -σω).
Aeolic: smooth breathing mapped back to Attic rough breathing (Aeolic systematically drops the rough breathing, a feature known as psilosis).
Koine: σσ/ττ alternation (overlaps with the Ionic rule above and with the period-specific normalization rules in the orthographic normalizer).
d = Dilemma(dialect="ionic") # Ionic texts
d = Dilemma(dialect="doric") # Doric texts
d = Dilemma(dialect="auto") # try all dialects
d = Dilemma(dialect="ionic", period="hellenistic") # combinedDialects can be combined with period profiles. Setting dialect
implicitly enables the normalizer (no need for normalize=True).
For texts with non-standard spelling, Dilemma includes an optional orthographic normalizer that generates candidate normalized forms before lookup. This handles:
- Itacism: η/ει/οι/υ all pronounced [i] and interchanged by scribes
- αι/ε merger: αι pronounced [e] and confused with ε
- ο/ω confusion: loss of vowel length distinction
- Missing iota subscripta: ᾳ/ῃ/ῳ written as α/η/ω
- Spirantization: β/υ interchange, φ/π, θ/τ, χ/κ confusion
- Geminate simplification: λλ→λ, νν→ν, etc.
Period-specific profiles (hellenistic, late_antique, byzantine) weight rules by historical probability.
d = Dilemma(normalize=True, period="byzantine")The transformer is a small (~4M param) character-level encoder-decoder,
the standard architecture from
SIGMORPHON morphological inflection
shared tasks. It learns character-level patterns and generalizes to forms
not in Wiktionary. Training on MG + AG + Medieval data means the model
sees AG augment patterns (ἔλυσε → λύω) alongside MG stem
transformations (σκότωσε → σκοτώνω). For Katharevousa forms like
εσκότωσε, it has both signals to draw from.
from dilemma import Dilemma
d = Dilemma() # all periods (default)
d_mg = Dilemma(lang="el") # MG only
d_grc = Dilemma(lang="grc") # AG only
# LSJ lemma convention
d_lsj = Dilemma(convention="lsj")
# Cunliffe convention
d_cun = Dilemma(convention="cunliffe")
# Triantafyllidis convention (recommended for MG)
d_mg = Dilemma(convention="triantafyllidis")For ambiguous forms, lemmatize_verbose returns all candidates with
metadata so downstream tools can disambiguate using context:
from dilemma import Dilemma
d = Dilemma()
# Proper noun vs common noun: Ἔρις (goddess) vs ἔρις (strife)
candidates = d.lemmatize_verbose("ἔριδι")
for c in candidates:
print(f"{c.lemma:10s} lang={c.lang} proper={c.proper} via={c.via}")
# Ἔρις lang=grc proper=True via=exact
# Multiple language matches
candidates = d.lemmatize_verbose("πόλεμο")
# -> [LemmaCandidate(lemma="πόλεμος", lang="el", ...),
# LemmaCandidate(lemma="πόλεμος", lang="grc", ...)]
# Elision with multiple valid expansions
candidates = d.lemmatize_verbose("δ̓")
# -> [LemmaCandidate(lemma="δέ", source="elision", via="elision:ε"),
# LemmaCandidate(lemma="δή", source="elision", via="elision:η"), ...]Article-agreement disambiguation: When multiple candidates exist, pass the preceding word to rank by gender/number agreement with a Greek article:
# Prefer candidates matching masculine article τόν
candidates = d.lemmatize_verbose("λόγου", prev_word="τοῦ")
# -> masculine λόγος ranked before proper ΛόγοςThis only re-ranks candidates, never excludes them. If the preceding word is not a recognized article form, it has no effect.
Each LemmaCandidate has:
lemma- the lemma stringlang-"el"(MG, including medieval),"grc"(AG),"med"(medieval provenance label in output)proper-Trueif lemma is a proper noun (capitalized headword)source-"lookup","elision","crasis","particle_strip","verb_morphology","compound","model","identity"via- how it matched:"exact","lower","elision:ε","suffix_strip","augment_strip","θεο+φθόγγος","+case_alt", etc.score-1.0for lookup,0.5for model,0.0for identity fallback
When processing a large corpus (thousands of words), call preload() to
enable query-level caching on the SQLite lookup tables. This avoids
repeated SQLite round trips for forms that appear multiple times:
d = Dilemma()
d.preload() # enable query cache - ~40x faster for repeated lookups
for word in corpus:
d.lemmatize_verbose(word) # second lookup of same form is instantpreload() is safe to call multiple times (idempotent) and does not
change output - it only affects performance. It caches query results
on demand rather than loading the full 12M-entry table into memory.
When a POS tagger (e.g. Opla)
provides UPOS tags, lemmatize_pos uses POS to disambiguate between
multiple candidates from the regular lookup:
d = Dilemma()
d.lemmatize_pos("αὐτοῦ", "ADV") # "αὐτοῦ" (adverb: here/there)
d.lemmatize_pos("αὐτοῦ", "PRON") # "αὐτός" (pronoun: genitive)
d.lemmatize_pos("ἄκρα", "NOUN") # "ἄκρον" (noun: summit)
d.lemmatize_pos("ἄκρα", "ADJ") # "ἄκρος" (adjective: outermost)POS disambiguates rather than overrides: the regular lookup runs first to produce all valid candidates, and POS selects among them only when there are multiple options. When a form has just one candidate, POS is ignored, ensuring POS-aware lemmatization never produces worse results than the baseline.
With convention="triantafyllidis" or lang="el", POS tags also fix MG
self-map issues for adjective and verb inflections. MG lookup tables
sometimes return self-maps for inflected forms (e.g. ανθρώπινα maps to
itself instead of ανθρώπινος). When POS is ADJ, the masculine nominative
citation form (-ος, -ής, -ύς) is preferred. When POS is VERB, the
infinitive/1sg form (-ω, -ώ, -μαι) is preferred. Adverbs and nouns keep
their MG self-maps unchanged.
The POS lookup tables (435K AG-only entries, 482K combined) are built from six sources in priority order: UD treebanks (gold), LSJ9 indeclinables (2.2K adverbs, prepositions, conjunctions, particles, interjections with unambiguous POS), GLAUx corpus (8.7K entries), MG Wiktionary, AG Wiktionary, LSJ9 grammar. For polytonic input (breathing marks, circumflex), the AG-only POS entries are checked first to avoid MG lemma overrides on Ancient Greek text, mirroring the main lookup's AG-first logic.
For unknown or misspelled words, suggest_spelling returns candidate
corrections from the lookup table ranked by edit distance:
d = Dilemma()
d.suggest_spelling("θεός") # [("θεός", 0), ...] (exact match)
d.suggest_spelling("θεος") # [("θεός", 0), ...] (diacritic error = free)
d.suggest_spelling("θδός") # [("θεός", 1), ...] (letter-level ED1)The approach works in two layers. First, diacritics are stripped from both the input and the dictionary, collapsing the 12.5M-entry lookup into ~1-3M unique base forms. ED0/ED1/ED2 matches are found on these stripped forms, then expanded back to their original polytonic variants and ranked by true Levenshtein distance. This means accent and breathing errors (wrong accent, missing breathing mark) are corrected for free, while letter-level errors (θ/δ, ρ/ν) use standard edit distance. The spell index is built lazily on first call.
By default, suggestions include all forms in the lookup table (inflected forms and lemmata from all sources). Two filtering options reduce false positives when resolving to a specific dictionary:
# Only return known LSJ headwords (strictest - 152K entries)
d.suggest_spelling("ἀγωνιστήριον", max_distance=1, headwords_only="lsj")
# Only return lemmata/citation forms (less strict - ~700K entries)
d.suggest_spelling("ἀγωνιστήριον", max_distance=1, lemmata_only=True)You can also check headword membership directly:
d.is_headword("θεός") # True (LSJ headword)
d.is_headword("θεοί") # False (inflected form, not a headword)
d.is_headword("θεός", "cunliffe") # check against Cunliffe headwordsAncient Greek texts frequently elide final vowels before a following vowel, marking the elision with an apostrophe (U+0313 in polytonic encoding, U+02B9/U+02BC/U+1FBF/U+2019 in other encodings). Dilemma resolves these by stripping the elision mark and trying each Greek vowel against the lookup table:
| Elided | Expanded | Lemma |
|---|---|---|
ἀλλ̓ |
ἀλλά |
ἀλλά |
δ̓ |
δέ |
δέ |
τ̓ |
τε |
τε |
ἐπ̓ |
ἐπί |
ἐπί |
ἔφατ̓ |
ἔφατο |
φημί |
κατ̓ |
κατά |
κατά |
καθ᾿ |
κατά |
κατά |
ἀφ᾿ |
ἀπό |
ἀπό |
βάλλ̓ |
βάλλε |
βάλλω |
Consonant de-assimilation: Before rough breathing, Greek assimilates
voiceless stops to aspirates (τ->θ, π->φ, κ->χ). The elision expander
reverses this: καθ᾿ tries both καθ- and κατ-, ἀφ᾿ tries both
ἀφ- and ἀπ-, recovering prepositions like κατά and ἀπό.
Frequency ranking: When multiple expansions match the lookup table, candidates are ranked by corpus frequency (from GLAUx), so common prepositions like κατά always beat obscure verbs like κάθω. Function words are further prioritized when the stem matches a known elision pattern, and proper nouns are deprioritized.
Polytonic input automatically restricts expansion to the AG lookup table, avoiding false matches from MG monotonic forms.
| Code | Period | ISO standard |
|---|---|---|
el |
Modern Greek (including vernacular medieval, Katharevousa, regional) | ISO 639-1 |
grc |
Ancient Greek (Homer through Byzantine literary Greek) | ISO 639-2 |
Code and API calls use ISO 639 language codes: el for Modern Greek
and grc for Ancient Greek. In English text we often use the
shorthands MG (Modern Greek) and AG (Ancient Greek).
For Dilemma's purposes, MG (el) includes Katharevousa, even though
Katharevousa often benefits from AG lemmatization due to its archaizing
vocabulary and morphology. Medieval/Byzantine Greek has two components:
vernacular medieval Greek (ancestor of Modern Greek, merged into el)
and literary Byzantine Greek (classicizing, Atticist-influenced, resolved
via the AG lookup under grc).
For lemmatization, the two-way split works because Byzantine literary
Greek is classicizing (handled by grc), while vernacular medieval
Greek is the ancestor of Modern Greek (handled by el). The med
label still appears in LemmaCandidate.lang for forms from the
medieval Wiktionary dump, but these are merged into the el lookup
at build time.
Note: Opla (POS tagging +
dependency parsing) uses lang="grc" for Byzantine text. Byzantine
literary syntax (polytonic, full case system, optative mood) is closer
to Ancient Greek, so the AG-trained POS tagger handles it well.
| Variety | Wiktionary-tagged headwords |
|---|---|
| Standard Modern Greek (SMG/Demotic) | 877K entries (core) |
| Katharevousa | 283+ tagged, hundreds more formal/place terms |
| Cretan | 273 |
| Cypriot | 199 |
| Heptanesian (Ionian) | 18 |
| Maniot | 3 |
| Medieval/Byzantine (vernacular) | 3K (merged into MG - vernacular medieval is the ancestor of MG; literary Byzantine is Atticist-influenced and resolves via the AG lookup, not this table) |
| Variety | Wiktionary-tagged headwords |
|---|---|
| Epic/Homeric | 3,755 |
| Ionic | 1,638 |
| Attic | 1,279 |
| Koine | 1,209 |
| Byzantine (literary) | 496 |
| Doric | 456 |
| Aeolic | 163 |
| Laconian | 52 |
| Boeotian | 15 |
| Arcadocypriot | 11 |
The counts above are Wiktionary headwords explicitly labeled with a dialect tag. Each headword generates a full inflection paradigm (10-40 forms for verbs, 4-8 for nouns), so Wiktionary-derived form coverage is much larger than the headword count suggests.
However, Wiktionary tags are only a fraction of Dilemma's actual dialect coverage. Corpus-derived form-lemma pairs add substantially more: GLAUx contributes 76K Ionic pairs from Herodotus and the Hippocratic corpus, PROIEL adds 33K gold-standard Herodotus pairs, and Gorman adds 79K pairs across Herodotus, Thucydides, Xenophon, Demosthenes, and others. The dialect normalization layer (Ionic, Doric, Aeolic, Koine) then maps remaining dialectal forms to their Attic equivalents for lookup, catching forms that no corpus or dictionary has catalogued.
Katharevousa forms are the primary non-SMG target for Modern Greek - they mix AG morphology (augments, 3rd declension genitives) with MG vocabulary. The strong Epic/Homeric coverage (3,755 tagged headwords plus extensive GLAUx corpus data) is directly relevant for literary texts based on Homer.
Medieval/Byzantine Greek has two distinct registers that Dilemma handles
differently. Vernacular medieval forms are merged into Modern Greek
(el) since they are the direct ancestor of MG. Literary Byzantine
forms are classicizing and resolve via the AG (grc) lookup.
EL Wiktionary's "Medieval Greek"
category (6,735 entries, 2,685 headwords) is roughly 71% vernacular
and 29% literary Byzantine, based on presence of polytonic diacritics:
- Vernacular (~71%): δέρνω, θυμώνω, χτενίζω, βρίσκω, γούνα, ναράντζι, βουρκόλακας, ξεχαρβαλώνω - early MG vocabulary
- Literary Byzantine (~29%): ἀποφθέγγομαι, αἰθεροπόρος, περικαλλής, κριθάλευρον - Atticist-influenced forms
- Medieval-specific: μαξιλάριν, ἀδελφάτον, κασσίδιον, ἴνδικτος, γαστάλδος - neither pure AG nor modern MG
Merging all into el works because the AG lookup runs first. The 29%
literary forms typically already exist in the AG table and resolve
there; only the vernacular and medieval-specific forms actually fall
through to the MG lookup. On the DBBE benchmark, only 2 of 8,342
tokens resolved via the medieval table, while 92.8% came from the AG
lookup.
git clone https://github.com/ciscoriordan/dilemma.git && cd dilemma
pip install -r requirements.txt
python build_data.py --download
python build_lookup_db.py # SQLite for instant startup
python fix_selfmaps.py # fixes inflected forms that self-map
python train.py # full scale (~45 min on RTX 2080)
python export_onnx.py # optional: enable PyTorch-free inferenceDownloads all 5 kaikki dumps and extracts every form-lemma pair from inflection tables. Non-Greek characters are filtered out.
pip install -r requirements.txt
python build_data.py --download # downloads + extracts (~1.5GB total)Trains the character-level transformer on the extracted pairs. Use
--scale to control the training size.
python train.py --scale test # quick sanity check (20K pairs, ~15 sec)
python train.py --scale full # all data (~45 min on RTX 2080, default)
python train.py # same as --scale fullLegacy --scale 1/2/3 flags are still accepted for compatibility.
Every scale includes 100% of non-standard varieties (Medieval, Katharevousa, Cypriot, Cretan, Maniot, Heptanesian, archaic, dialectal). The remaining budget is split 50/50 between Ancient Greek and standard MG. Underrepresented tense categories are oversampled to compensate for their rarity in Wiktionary's paradigm tables, following Swaelens et al. (2025)'s finding that perfects are underrepresented in training data relative to Byzantine text. Aorist forms (3x, critical for stem-changing 2nd aorist), perfect (3x), future (3x), imperfect (2x), and pluperfect (5x, rarest at 0.15% of pool) are oversampled proportionally to their rarity and the degree of stem change from the present form.
| Scale | Training pairs | Varieties | AG | SMG | Time (RTX 2080) |
|---|---|---|---|---|---|
| test | 20K | 9K (100%) | 5.5K | 5.5K | ~15 sec |
| full | 3.5M (all) | 9K (100%) | 1.5M (100%) | 1.7M (100%) | ~95 min |
Models save to model/{lang}-test/ (test scale) or model/{lang}/
(full scale).
Eval accuracy is the model's score on held-out pairs without the lookup table. In practice, the lookup resolves most forms instantly and the model only handles truly novel words. When the model is used, beam search generates 4 candidates and the first one that matches a known headword in the lookup wins. If none match, the input is returned unchanged (safe fallback).
When training pairs include POS tags (from Wiktionary) and morphological features (from GLAUx), the model jointly predicts POS, nominal morphology (gender/number/case, 45 labels), and verbal morphology (tense/mood/voice, 69 labels) alongside the lemma via auxiliary classification heads on the encoder output. This follows Swaelens et al. (2025)'s finding that multi-task learning (joint POS + morphology + lemma) improved Byzantine Greek lemmatization by ~9 percentage points. Each auxiliary loss is weighted at 0.1x relative to the lemmatization loss. At full scale, the heads reach 90.4% POS, 81.5% nominal, and 91.2% verbal accuracy on the held-out set.
Training uses a linear warmup LR scheduler (500 steps warmup, then linear decay) and gradient clipping (max norm 1.0) for stable convergence.
To regenerate the expanded lookup table from LSJ and Sophocles sources:
pip install --force-reinstall --no-deps git+https://github.com/tatuylonen/wikitextprocessor.git
python build/expand_lsj.py --setup # build Wiktionary Lua module database
python build/expand_lsj.py --expand # expand LSJ nouns
python build/expand_lsj.py --expand-verbs # expand LSJ verbs
python build/expand_sophocles.py --expand # expand Sophocles nouns
python build/expand_sophocles.py --expand-verbs # expand Sophocles verbsThis requires LSJ9 data from lsj9
(included in data/lsjgr_bridges.json and data/lsj9_frequency.json) and
the Sophocles TEI data (included in data/sophocles/).
Generates ONNX model files so inference works without PyTorch.
python export_onnx.py # exports encoder.onnx + decoder_step.onnxTests run automatically via GitHub Actions on push and pull request to
main, using a self-hosted runner with GPU access. CI downloads data
files from HuggingFace (lookup.db, spell_index.db, model weights).
python -m pytest tests/ -v # run all tests via pytest (recommended)
python tests/test_integrity.py # data integrity + model inference checks
python tests/test_dilemma.py # lookup table + end-to-end lemmatization tests
python tests/test_dilemma.py --lookup-only # skip model teststests/test_comprehensive.py is the main pytest test suite (263 tests)
covering core lemmatization, particle suffix stripping, verb morphology
stripping, article-agreement disambiguation, crasis resolution, elision
handling, orthographic normalization, dialect normalization (Ionic, Doric,
Aeolic, Koine), convention switching, language filtering, spelling
suggestions, batch operations, PROIEL/Gorman treebank pairs, and edge
cases.
tests/test_integrity.py runs 7 structural checks: ONNX/vocab dimension
match, DB table presence, model load, inference, and ONNX/PyTorch
parity. tests/test_dilemma.py validates lookup correctness and known
form-lemma pairs across Greek varieties.
rank_forms.py produces per-lemma ranked form lists, sorted by corpus
frequency. This is useful for downstream consumers (e.g.
Lemma) that need to know which
inflections of a word are most common.
python rank_forms.py --lang el # Modern Greek (default)
python rank_forms.py --lang grc # Ancient Greek
python rank_forms.py --lang mgr # Medieval/Byzantine Greek
python rank_forms.py --lang all # All threeFor each language, the script produces:
{prefix}_ranked_forms.json- lemma to list of forms sorted by frequency{prefix}_form_freq.json- form to raw frequency count
Frequency sources (used for primary ranking):
- MG: FrequencyWords/OpenSubtitles (1.49M forms)
- AG: GLAUx + Diorisis corpus token frequencies (27M combined tokens)
- Medieval: AG corpus frequencies as proxy
Additional corpora available in --verbose per-form breakdowns:
- Patrologia Graeca (
freq_pg): 3.14M tokens of Byzantine/Patristic Greek from PG corpus (Church Fathers, PG071-PG158) - Byzantine vernacular (
freq_byz_vern): 191K tokens from the Byzantine Vernacular Corpus (Digenes Akritas, Chronicle of Moreas, Erotokritos, etc.)
By default, rank_forms.py downloads pre-built files from the
ciscoriordan/dilemma-data
HuggingFace dataset. Use --rebuild to regenerate locally from the
lookup and frequency source files:
python rank_forms.py --lang el --rebuild # regenerate MG locally
python rank_forms.py --lang all --verbose --rebuild # all languages with per-corpus breakdown
python rank_forms.py --lang el --polytonic --rebuild # MG with polytonic variant rankingThe --polytonic flag generates mg_polytonic_ranked.json, which maps
each monotonic MG form to its attested polytonic variants ranked by
corpus frequency. This is built from mg_polytonic_freq.json (see
build/build_polytonic_freq.py), which extracts polytonic word
frequencies from the glossAPI/Wikisource Greek texts
dataset (~38M tokens, 5,394 texts). Forms appearing fewer than 3 times
are filtered out.
The --verbose flag adds a {prefix}_ranked_forms_verbose.json file
with per-corpus frequency breakdowns for each form. Each entry includes
frequencies from all available corpora (OpenSubtitles, GLAUx, Diorisis,
Patrologia Graeca, Byzantine vernacular), letting consumers re-rank for
mixed-period use cases (e.g., a Modern Greek dictionary for a book about
ancient topics could boost forms with high freq_glaux).
export_hunspell.py produces compact Hunspell .dic + .aff pairs from
lookup.db, aimed at mobile consumers (primarily the
Tonos iOS polytonic keyboard)
where the full 993 MB lookup.db and 482 MB spell_index.db do not
fit inside the ~48 MB memory ceiling of a keyboard extension. Affix
compression collapses each inflection class to a single SFX rule
group, so ~12.5M forms compress to ~2M dictionary entries while
preserving exact-match acceptance.
Default output is the grc variant (Ancient + Medieval polytonic),
which is what Tonos ships. An optional el variant (Modern Greek
monotonic) is retained for other downstream consumers via
--variant el. Output layout under build/hunspell/:
| Variant | Script name | Lang tag | Contents |
|---|---|---|---|
grc_polytonic.{dic,aff,version} |
grc |
grc |
Ancient + Medieval polytonic forms (breathings, circumflex, iota subscript, grave). Acute-only fallback keys are dropped unless corpus-attested. AG function words (definite article, 1st/2nd person pronouns) are injected because dilemma.py resolves those via hardcoded rules rather than the lookup table. |
el_GR_monotonic.{dic,aff,version} |
el |
el_GR |
Modern Greek monotonic forms, including MG-relevant vocabulary drawn from the AG side of lookup.db (articles, common verbs, proper names). Not shipped in Tonos. |
Each dictionary entry carries a morphological field fr:<bucket> where
the bucket is one of C (common), M (medium), R (rare), or X
(unseen in corpus). This lets consumers rank spelling candidates
without shipping the full frequency table.
The bucket is chosen from three signals, in priority order:
- Canonical seed.
data/canonical_ag_forms.jsonpins ~194 iconic polytonic surface forms toC(Iliad/Odyssey/Herodotus incipits, Olympians, Homeric heroes). These are low-token-count forms whose cultural weight exceeds their corpus frequency -ἄειδεappears only 71 times in corpus but is the opening word of the Iliad. - Canonical lemmas. ~168 famous lemmas (
ἀείδω,μῆνις,Πλάτων, etc.) promote any polytonic-marked form of theirs toC. This catches canonical inflections beyond what the seed enumerates. - Lemma aggregate. For polytonic forms, the lemma's total corpus
count promotes the form:
>= 20K -> C,>= 5K -> M. Highly- inflected lemmas (ἀείδωhas 1289 surface forms) would otherwise dilute frequency across the paradigm so no single form crosses the per-form threshold. - Per-form count fallback.
count >= 1000 -> C,>= 100 -> M,>= 1 -> R, elseX.
Only polytonic-marked forms receive the canonical promotions, so
monotonic leaks like άειδε (acute-only, no breathing) stay at R
and downstream rankers can pick the polytonic variant when the two
otherwise tie.
python export_hunspell.py # grc polytonic (default)
python export_hunspell.py --variant both # grc + el
python export_hunspell.py --variant el # el monotonic only
python export_hunspell.py --sanity 10000 # 10K-lemma sanity passOutput layout, with one sidecar .version file per variant so the
consumer can detect updates:
build/hunspell/
el_GR_monotonic.dic
el_GR_monotonic.aff
el_GR_monotonic.version # semver + dilemma commit hash + entry count
grc_polytonic.dic
grc_polytonic.aff
grc_polytonic.version
eval_results.txt # from eval_hunspell.py
eval_hunspell.py is the quality gate. It samples mid-frequency real
Greek words (ranks 100..10K in the corpus), generates synthetic typos
at edit distance 1 and 2, and measures top-1/top-5 correction accuracy
against the compact artifact. Add --compare-full to also benchmark
Dilemma's own suggest_spelling() on the full lookup.db.
python eval_hunspell.py # default 500 targets per variant
python eval_hunspell.py --n 100 # quick sanity eval
python eval_hunspell.py --variant el # only MG
python eval_hunspell.py --compare-full # compare vs full Dilemma (slower)Requires pip install spylls for the Python-side Hunspell consumer.
To regenerate both artifacts and the eval report end-to-end:
python export_hunspell.py
python eval_hunspell.py --n 200train_lm.py + export_lm.py produce a compact next-word prediction
language model, build/lm/grc_ngram.bin, aimed at the Tonos iOS
keyboard extension for a QuickType-style suggestion strip over
polytonic Greek.
This is a classical stupid-backoff trigram over GLAUx + Diorisis (~29.6M polytonic tokens, 1.48M sentences after a deterministic 2% dev split). Classical n-gram is a deliberate choice: inference is a couple of binary searches on mmap'd bytes (no ML runtime, no matmul), trivially under 1 ms per keystroke, and the artifact can be used directly from a keyboard extension without adding a Core ML dependency. A small neural LM would improve perplexity but cannot beat this on the cost side that keyboards are constrained by: cold-start memory and per-keystroke CPU.
Build
python train_lm.py --sanity # few files per corpus, <1 min
python train_lm.py # full GLAUx + Diorisis, ~5 min
python train_lm.py --no-diorisis # GLAUx-only baseline
python export_lm.py # writes grc_ngram.bin + .version, ~30 s
python eval_lm.py # writes eval_results.txt, ~90 sCorpus loaders live in train_lm.py (GLAUx, inline) and
extract_diorisis_lm.py (Diorisis, beta-code to NFC). To add
another Ancient Greek corpus, write a loader that yields
(sentence_id, [<s>, ...NFC tokens..., </s>]) and append one
entry to build_corpus_sources in train_lm.py; the counting,
vocab, split, and eval stages do not need any changes.
Output layout:
build/lm/
grc_ngram.bin mmap-friendly binary, ~55 MB (v2)
grc_ngram.version semver + dilemma commit + vocab/context counts
eval_results.txt from eval_lm.py (combined dev split)
eval_results_glaux.txt eval restricted to GLAUx dev sentences
eval_results_diorisis.txt eval restricted to Diorisis dev sentences
vocab.json intermediate from train_lm.py
unigrams.json intermediate
bigrams.tsv.gz intermediate
trigrams.tsv.gz intermediate
dev_sentences.txt held-out dev set, deterministic split (seed 4242)
dev_sentences_glaux.txt same split, restricted to GLAUx sentences
dev_sentences_diorisis.txt same split, restricted to Diorisis sentences
stats.json training corpus statistics
Artifact layout at a glance (full spec in the docstring of
export_lm.py): a 128-byte little-endian header, a sorted UTF-8
vocab (binary-searchable by the Swift reader), a per-vocab unigram
count column, and three flat sorted tables: unigram top-K,
bigram top-K-per-w1, and trigram top-K-per-w1,w2. Each suggestion
entry is 6 bytes: a u32 word id and an i16 fixed-point log
probability (scale 1024). Lookup is a binary search on the trigram
context (w1, w2); on miss, fall back to bigram (w1); on miss, fall
back to the global unigram top-K. All tables are mmap'd and accessed
by offset; nothing needs to be read into RAM.
Typeahead v2 (format_version = 2) adds two things versus v1:
- Independent top-K per table. Bigram contexts now store the top 30
continuations and trigram contexts the top 15, so the keyboard's
mid-word prefix filter has enough candidates to work with even
after a user has typed a character or two. The global unigram
fallback stays at 10. The v2 reader refuses any file with
format_versionbelow 2, so old v1 binaries must be rebuilt. - A per-vocab unigram count column, one
u32per vocab entry, indexed by sorted-vocab id. The Swift reader uses this to rank global prefix completions by corpus frequency when the current bigram/trigram top-K doesn't cover the user's stem, so the bar is never starved of useful entries even for rare context / common stem combinations.
Default knobs (defined in export_lm.py):
| Knob | Value | Effect |
|---|---|---|
| vocab size | 80,000 | Reduces UNK target rate to 7.6% on the dev split |
| top-K uni | 10 | Global fallback depth |
| top-K bi | 30 | Per-bigram context depth (mid-word prefix filter wants this deep) |
| top-K tri | 15 | Per-trigram context depth |
| min bigram count | 1 | Keep every observed bigram |
| min trigram count | 1 | Keep every observed trigram... |
| min bigram count for trigram | 3 | ...but only when the (w1, w2) bigram context is well attested. Rare contexts fall back to the bigram table at inference. This is the dominant size/quality knob. |
Current build: ~55 MB, ~80K vocab, ~1.1M trigram contexts, ~80K bigram contexts. Within the 60 MB budget Tonos has for a single artifact inside a keyboard extension.
Held-out evaluation, keyboard-realistic regime (exclude </s> and
UNK targets):
| Dev split | training data | sentences | preds scored | top-1 | top-3 | top-5 | top-6 |
|---|---|---|---|---|---|---|---|
| GLAUx only | GLAUx | 19,578 | 315,385 | 13.56% | 23.08% | 27.75% | 29.59% |
| GLAUx only | GLAUx+Diorisis | 19,578 | 315,278 | 14.86% | 28.14% | 34.52% | 36.56% |
| Diorisis only | GLAUx+Diorisis | 10,672 | 188,199 | 15.17% | 30.46% | 37.63% | 39.88% |
| Combined | GLAUx+Diorisis | 30,250 | 503,477 | 14.97% | 29.00% | 35.68% | 37.77% |
| Combined | v2 (bi 30/tri 15) | 32,724 | 534,541 | 14.69% | 28.30% | 34.83% | 37.12% |
The v2 row reports on the newer combined corpus (GLAUx + Diorisis + Katharevousa Wikisource + Byzantine vernacular). Top-1 / top-3 / top-5 are unchanged from the v1 row's combined baseline (as expected: widening top-K per context does not move top-N for N smaller than K); the new number to look at is top-6 at 37.1%.
Adding Diorisis lifts top-3 accuracy on the identical held-out GLAUx sentences by +5.1 points and top-5 by +6.8 points; top-1 moves by +1.3 points. Dev sentences with identical sentence-id hashes across runs make this an apples-to-apples comparison.
Backoff level breakdown: ~74% of predictions hit the trigram table,
~26% fall back to bigrams, <0.01% fall to unigram. Perplexity (stupid
backoff, α=0.4) is reported in eval_results.txt but is not a
clean perplexity: because the binary only stores the top-K
continuations per context, off-list targets are assigned a floor
probability, which drives PPL up. The top-K accuracies are the real
quality metric for this artifact; PPL is a sanity signal, not a
headline number.
Expectations, honest:
- Top-1 ~14% is the expected order of magnitude for polytonic Ancient Greek. AG has highly variable constituent order, rich inflectional morphology, and a training corpus orders of magnitude smaller than English keyboard LMs. English QuickType-style predictors typically hit 15-30% top-1 on in-domain text; for AG, top-5 around 25-30% is a respectable baseline.
- UNK rate is 7.9% on the combined dev split even at 80K vocab. Polytonic inflection multiplies forms (a typical verb has 300+ forms), so long-tail coverage is inherently hard. Tonos can let the user type any form; the LM just won't propose it.
- Homographs are preserved. ἦ (past indicative) and ἤ (disjunction) remain distinct. No accent stripping is done at any stage.
- Diorisis beta code is converted to NFC polytonic via the
betacodePyPI package. A small cleanup handles elision-after- consonant encoded as)rather than'(e.g.par)for παρ’). Elision-after-vowel (ou)= οὐ with smooth breathing) stays untouched.
Reader contract for Tonos: the binary format is versioned in the
header (format_version = 2) and in grc_ngram.version. Any on-disk
layout change will bump format_version. The sidecar version file
also carries the dilemma commit hash, vocab size, and trigram /
bigram context counts so a Tonos build can detect staleness and
refuse mismatched artifacts. The Swift reader rejects any file whose
format_version doesn't match the version it was compiled against;
the v2 reader does not read v1 files (and vice versa).
| Source | Forms | Notes |
|---|---|---|
| EN + EL Wiktionary (MG) | 2.8M | From kaikki.org dumps |
| EN + EL Wiktionary (AG) | 2.4M | From kaikki.org dumps |
| EL Wiktionary (Medieval) | 6.9K | From kaikki.org dumps |
| LSJ noun/verb/adj expansion | 4.2M | Via Wiktionary Lua modules |
| Sophocles lexicon expansion | 1.0M | Byzantine/Patristic vocabulary |
| UD Treebanks (DiGreC) | 27K | Gold annotations from DiGreC treebank |
| PROIEL (gold) | 33K | Herodotus gold-standard form-lemma pairs (expert-verified) |
| Perseus (gold) | 42K | 178K tokens: Sophocles, Aeschylus, Homer, Hesiod, Herodotus, Thucydides, Plutarch, Polybius, Athenaeus |
| Gorman Treebanks | 79K | 687K tokens across Herodotus, Thucydides, Xenophon, Demosthenes, Lysias, Polybius, etc. |
| GLAUx corpus | 557K | 17M tokens, 98.8% accuracy (Keersmaekers 2021) |
| Diorisis corpus | 76K new | 10M tokens, 91.4% accuracy (Vatri & McGillivray 2018) |
| HNC Golden Corpus | 1K new | 88K-token gold MG corpus (CLARIN:EL, openUnderPSI) |
| DGE headwords | 52K | Headword filter coverage from Diccionario Griego-Espanol |
| LGPN names | 44K | Proper noun coverage from Lexicon of Greek Personal Names |
| Perseus Digital Library headwords | 176K | Headword filter from L&S, Pape, Bailly, etc. |
| Total lookup | 12.5M |
All Wiktionary data is extracted automatically from kaikki.org JSONL dumps. LSJ and Sophocles expansions use wikitextprocessor to run Wiktionary's grc-decl and grc-conj Lua modules on headwords extracted from lexicon XML/TEI files.
The GLAUx corpus provides the largest single source of new form-lemma pairs outside Wiktionary. GLAUx is the primary corpus source due to its 98.8% lemma accuracy. The Diorisis corpus (Vatri & McGillivray, 2018; 10M tokens, Homer - 5th c. AD) is used as a secondary source: its 456K form-lemma pairs add 76K new entries not found in GLAUx, and its token frequencies are merged with GLAUx for 27M combined tokens. Because Diorisis has lower lemma accuracy (91.4%), its pairs are only added when they don't conflict with existing entries from Wiktionary, LSJ, or GLAUx.
We chose not to integrate one other large corpus:
- Opera Graeca Adnotata (OGA, 40M tokens): standoff PAULA XML format requires complex alignment code, and at 91.4% accuracy with 4x the size of Diorisis, the noise-to-signal ratio is worse for lookup purposes.
- Pedalion
(5.8M tokens): smaller than GLAUx with similar classical-period
coverage. Would add few forms not already covered by GLAUx + Wiktionary
- LSJ, since the remaining lookup gaps are mostly Byzantine compounds not found in any classical corpus.
All three are CC BY-SA 4.0. Compound decomposition (added in v1.5) reduced the no-lookup-hit rate on DBBE from 4.4% to 2.5% by splitting compound words at linking vowels (ο/ι/υ), stripping known prefixes, and applying Byzantine-specific normalizations. The remaining 2.5% are forms where neither lookup, compound decomposition, nor the seq2seq model can recover the correct lemma.
Each form is indexed under its original, monotonic, and accent-stripped variants for fuzzy matching.
Form-lemma pairs come from three sources per Wiktionary entry:
- Inflection tables (primary). Every cell in a verb conjugation or
noun declension table becomes a form-lemma pair. Covers all tenses,
moods, cases, numbers. Multi-form cells (e.g.
Πηλείδᾱο / Πηλείδεω) are split into separate pairs. form_ofreferences. When a page says "form of X", that gives us an additional pair. Adds ~44K MG and ~6K AG pairs not found in inflection tables.alt_ofreferences. Alternative/variant spellings. Adds ~1K pairs.
Generated data files (ranked form lists, form frequencies, scored
lookups) are hosted at
ciscoriordan/dilemma-data
on HuggingFace Hub. rank_forms.py downloads these by default instead
of regenerating locally.
Available files:
| File | Description |
|---|---|
mg_ranked_forms.json |
MG lemma to frequency-ranked form list |
mg_form_freq.json |
MG form to frequency count |
mg_ranked_forms_verbose.json |
MG ranked forms with per-corpus breakdown |
mg_lookup_scored.json |
MG scored lookup table |
ag_ranked_forms.json |
AG lemma to frequency-ranked form list |
ag_form_freq.json |
AG form to frequency count |
ag_ranked_forms_verbose.json |
AG ranked forms with per-corpus breakdown |
med_ranked_forms.json |
Medieval lemma to frequency-ranked form list |
med_form_freq.json |
Medieval form to frequency count |
mg_polytonic_freq.json |
MG polytonic form frequencies from Wikisource (38M tokens) |
mg_polytonic_ranked.json |
Monotonic MG form to ranked polytonic variants |
This is separate from the main
ciscoriordan/dilemma
model repo, which hosts lookup.db, model weights, and other core
data files.
Not all lookup entries are equally trustworthy. Forms from inflection tables are template-generated and may be wrong for irregular words. Each entry is scored on a 5-point scale:
| Tier | Condition | MG count | AG count |
|---|---|---|---|
| 5 | Both EN + EL Wiktionary have a page for this form | 63K | 14K |
| 4 | EN Wiktionary has a page (no EL page) | 22K | 50K |
| 3 | EL Wiktionary has a page (no EN page) | 1.05M | 131K |
| 2 | Both EN + EL tables agree on the lemma | 199K | 49K |
| 1 | Single source, table-only | 1.49M | 2.12M |
Higher confidence wins when two sources map the same form to different lemmas.
Ancient Greek forms from EN Wiktionary carry dialect tags extracted from inflection table headers (e.g. "Epic declension-1", "Attic contracted present"). These are propagated to every form in that table section:
| Dialect | Tagged forms |
|---|---|
| Attic | 245K |
| Epic | 92K |
| Ionic | 14K |
| Doric | 9K |
| Koine | 9K |
| Aeolic | 3K |
| Laconian | 672 |
| Boeotian | 555 |
| Arcadocypriot | 407 |
- Greek-only filter. All forms must contain only Greek Unicode characters (U+0370-03FF, U+1F00-1FFF, U+0300-036F). Removes Latin letters, digits, template artifacts.
- Chain-breaking. If form A maps to lemma B, and B maps to C, the chain is resolved to the real headword at build time. Fixes ~65K entries caused by accent-stripped key collisions and treebank convention differences.
- Pronoun cross-contamination. Greek Wiktionary dumps the entire
pronoun paradigm table into each pronoun entry (e.g.
εσύlistsεγώas a "form"). Articles and determiners are restricted to headword-only. Pronoun forms that are headwords of other closed-class entries are skipped. - Proper noun plural filter. EL Wiktionary generates plural forms
for proper nouns via templates (413K junk entries like
Αχιλλείς). These are skipped unless EN Wiktionary also lists them (which indicates a human editor intentionally added them, e.g.Έλληνες). - Training pair validation. Every training pair's lemma must be a headword (maps to itself in the lookup). Pairs with non-headword lemmas are resolved to the real headword or dropped.
Vatri & McGillivray (2020) assessed the state of the art in Ancient Greek lemmatization via a blinded evaluation by expert readers. They found that methods using large lexica combined with POS tagging (CLTK backoff lemmatizer, Diorisis corpus) consistently outperformed pure ML approaches with smaller lexica. Dilemma follows the same principle: a large lookup table (12.5M forms) handles the vast majority of words, with a small model as fallback.
Celano (2025) presented state-of-the-art morphosyntactic parsing and lemmatization for Ancient Greek using GreTa and PhilTa models trained on the AGDT and OGA corpora. Best lemmatization F1 was 95.6% on classical text. These models require POS context; Dilemma operates on isolated words but benefits from a much larger form inventory.
Swaelens et al. (2024) tested lemmatization on unedited Byzantine Greek epigrams and found that classical accuracy (~95%) dropped 30+ points on Byzantine text due to itacism, crasis, and non-standard orthography. Their best hybrid method (transformer embeddings + dictionary lookup) reached 65.8%. Dilemma achieves 92.7% on the same dataset (equiv-adjusted).
Swaelens et al. (2025) showed that multi-task learning (joint POS + morphology + lemma prediction) improved Byzantine lemmatization by ~9pp, reaching ~74-75%. They also demonstrated that subword-tokenizing transformers plateau on Byzantine Greek due to orthographic inconsistency, and called for character-level models as the next step. Dilemma's character-level encoder-decoder is this architecture, and its perfect tense oversampling and multi-task POS head are directly informed by their findings.
These are inherent limitations or Wiktionary coverage gaps, not code bugs. Most can be fixed by editing the relevant Wiktionary entry, which will propagate into Dilemma via kaikki dumps.
| Issue | Tokens | Notes |
|---|---|---|
| αὐτοῦ ambiguity | ~200 | Genuine lexical ambiguity: both an adverb ("here/there") and genitive of αὐτός. Resolved when POS context is available via lemmatize_pos(). |
| μιν → ὅς | ~340 | Convention difference. Wiktionary maps μιν to the 3rd person pronoun. Perseus treebank uses μιν as its own lemma. |
| Lemma convention differences | ~400 | αὐτάρ vs ἀτάρ, κε vs ἄν - Wiktionary and Perseus use different citation forms for some Homeric particles. Handled by lemma equivalence groups for evaluation. |
Small character-level encoder-decoder transformer (~4M parameters), trained from scratch on Greek lemmatization pairs. This is the standard architecture from SIGMORPHON morphological inflection shared tasks.
| Component | Config |
|---|---|
| Encoder | 3 transformer layers, 256 hidden, 4 heads |
| Decoder | 3 transformer layers, 256 hidden, 4 heads |
| POS head | Linear (256 -> 10 tags), auxiliary task |
| Nominal head | Linear (256 -> 45 labels), gender/number/case |
| Verbal head | Linear (256 -> 69 labels), tense/mood/voice |
| FFN | 512 dim |
| Vocabulary | ~381 Greek characters + special tokens |
| Parameters | ~4.2M |
| Inference | ONNX or PyTorch, beam search with headword filter |
No pretrained weights - the model is small enough to train from scratch on 500K+ pairs in minutes. The character vocabulary covers all Greek Unicode ranges (monotonic, polytonic, extended). Three auxiliary classification heads (POS, nominal morphology, verbal morphology) share the encoder and improve representations via multi-task learning.
An earlier version of Dilemma fine-tuned Google's ByT5-small (300M params). ByT5 processes raw UTF-8 bytes, so a 10-character Greek word becomes ~20 encoder steps. The custom transformer uses a Greek character vocabulary (~160 tokens), so the same word is ~10 steps. Combined with 75x fewer parameters:
| ByT5-small | Dilemma | |
|---|---|---|
| Approach | Subword tokenizer (UTF-8 bytes) | Character vocabulary (~381 Greek chars) |
| Parameters | 300M | 4M |
| Training (3.5M pairs, 3 epochs) | ~20 hours | ~95 min |
| Dependencies | torch + transformers | torch only (or ONNX only) |
- Lemma - Diachronic Greek dictionary app. Uses Dilemma's frequency-ranked inflection lists (
rank_forms.py) and the MG equivalences derived frommg_lookup_scored.jsonto resolve looked-up words to their canonical headwords across Ancient, Byzantine, and Modern Greek. - Tonos - iOS polytonic Greek keyboard. Ships Dilemma's compact Hunspell exports (
grc_polytonic.{dic,aff}) for spell-check inside the keyboard extension's tight memory budget, plus the trigram language model (grc_ngram.bin) for QuickType-style next-word prediction over polytonic Ancient Greek.
- Training data from English Wiktionary and Greek Wiktionary via kaikki.org JSONL dumps
- LSJ headwords, forms, and POS data from LSJ9 processed exports (
lsj9_headwords_flat.json,lsj9_headword_pos.json,lsj9_frequency.json,lsj9_indeclinables.json) - Sophocles lexicon TEI from Ionian University / Internet Archive
- GLAUx corpus data (Keersmaekers, 2021) (CC BY-SA 4.0)
- Diorisis corpus data (Vatri & McGillivray, 2018) (CC BY-SA 3.0)
- PROIEL Treebank gold-standard annotations (CC BY-NC-SA 3.0)
- Perseus Treebank (AGDT) gold-standard annotations (CC BY-NC-SA 3.0)
- Gorman Treebanks (Gorman) (CC BY-NC-SA 4.0)
- HNC Golden Corpus from CLARIN:EL (openUnderPSI)
- DGE headwords from the Diccionario Griego-Espanol (CSIC)
- LGPN proper names from the Lexicon of Greek Personal Names (Oxford)
- Perseus Digital Library headwords (L&S, Pape, Bailly) from the Perseus project
- MG polytonic frequencies from glossAPI/Wikisource Greek texts on HuggingFace
- DBBE evaluation data from Swaelens et al. (CC BY 4.0)
- Flag icons by svg-flags
MIT. Copyright Francisco Riordan.
