Skip to content

Improving stemmer - milestone 2 #1

@assem-ch

Description

@assem-ch
  • Clear prefixes first, clear suffixes second
    • al kal fal bal bb should marked first, and set is_noun
    • aa ww ff should marked first
  • Greedy to choose between nouns suffixes and verb suffixes: طالبات
  • الزمان
  • والشمس
  • لمعالجة
  • أفنلزمكموها
  • س لا تلتصق إلا بأفعال المضارع ا
  • Detecting است prefix and define using it if noun or verb and also larger the size condition by 3: نسنعين ,
  • in suffixes, جمع مذكر السالم نادرا ماتكون جذع اقل من 4
  • و الفعل المضارع اللواحق يجب أن تترك الحجم 4 لأن للمضارع سابقا من حرف واحد
  • study the case of والأمر
  • make suffixes to set/unset is_noun, is_verb
  • don't stem if it contains a number or english number or size = odd
  • define regions before start stemming, test everything then perform stemming
  • black list: Ignore some predefined words, or does it worth
  • remove feminine marks and study feminine patterns
  • remove broken plural infixes: أطفال، كواسر ،نُمور
  • consider vocalization when exists:
  • tanween means a noun
  • detect and process_vocalized texts
  • Study patterns and guess it before stemming
  • Verb conjugation prefixes: a, t, y, n, if it has suffix, then remove the prefix with it
  • Rename routines to better-explaining names
  • study Alef-tanween
  • study idgham
  • Calculate probability of being noun or being verb
  • Prefix confusion
  • 2 letters words
  • improve from ISRI ideas
  • improve from khoja ideas
  • improve from tashaphine ideas
  • optimize performance
  • filter stop words

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions