Skip to content

grilme99/unicode-segmentation

Repository files navigation

✂️ Unicode Segmentation

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

Static Code Analysis Tests

  • Spec compliant: Up-to-date Unicode data, verified by the official Unicode test suites and fuzzed against Rust's own unicode segmentation.

  • Excellent compatibility: It works well in all Luau environments, on or off Roblox.

  • Zero-dependencies: It doesn't bloat your packages and is very easy to review.

  • Small bundle size: It compresses the Unicode data and maintains a very small memory footprint.

  • Extremely efficient: It's carefully optimized for runtime performance.

  • Modern Luau: It's fully type-checked, runs on the new solver, and takes advantage of modern Luau features.

Quick Start

local segmenter = require("@pkg/segmenter")

-- Grapheme clusters
segmenter.splitGraphemes("a̐éö̲\r\n")
-- { "a̐", "é", "ö̲", "\r\n" }

-- Word segments (with isWordLike flag)
segmenter.words("Hello, world!")
-- {
--   { segment = "Hello", index = 1, isWordLike = true },
--   { segment = ",", index = 6, isWordLike = false },
--   { segment = " ", index = 7, isWordLike = false },
--   { segment = "world", index = 8, isWordLike = true },
--   { segment = "!", index = 13, isWordLike = false },
-- }

-- Sentence boundaries
segmenter.splitSentences("Hello!  Next")
-- { "Hello!  ", "Next" }

-- Category lookups
segmenter.wordCategory(string.byte("A"))
-- => segmenter.WordCategory.ALetter

Installation (Wally)

Add the dependency to your wally.toml:

[dependencies]
UnicodeSegmentation = "grilme99/unicode-segmentation@1.0.0"

API

All segment indices are 1-based byte offsets in the original string.

Segment Types

  • Segment: { segment: string, index: number }
  • WordSegment: { segment: string, index: number, isWordLike: boolean }

Grapheme Segmentation

  • segmenter.graphemes(input: string): { Segment }
  • segmenter.splitGraphemes(input: string): { string }
  • segmenter.countGraphemes(input: string): number

Word Segmentation

  • segmenter.words(input: string): { WordSegment }
  • segmenter.splitWords(input: string): { string }
  • segmenter.countWords(input: string): number

Sentence Segmentation

  • segmenter.sentences(input: string): { Segment }
  • segmenter.splitSentences(input: string): { string }
  • segmenter.countSentences(input: string): number

Category Lookups

Each category lookup returns a numeric enum value; use the matching enum table to interpret it:

  • segmenter.graphemeCategory(codepoint: number): number
  • segmenter.wordCategory(codepoint: number): number
  • segmenter.sentenceCategory(codepoint: number): number
  • segmenter.GraphemeCategory
  • segmenter.WordCategory
  • segmenter.SentenceCategory

Testing

Run the full test suite:

lute test tests

Unicode® Version

Unicode® 17.0.0

Unicode® Standard Annex #29 - Revision 47 (2025-08-17)

Runtime Compatibility

This library runs in any Luau runtime that supports buffer, utf8, bit32, and the standard string/table APIs. It is designed to run both in Lute and on Roblox.

Acknowledgments

About

✂️ Grapheme cluster, word, and sentence segmentation according to UAX#29 rules

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors