Skip to content

HybridChunker.chunk() takes ~2 hours on a large HTML document #3222

@mwcz

Description

@mwcz

We're using Docling for a chunking pipeline associated with the Red Hat Offline Knowledge Portal, and the quality of the results is excellent, much better than the tools we were using before. We have encountered some performance problems, however.

HybridChunker.chunk() hangs for approximately 2 hours when processing large
HTML files, of which we have many. In this reproducer I'll share a ~14 MB, ~14 million character file. Most documents convert and chunk in
seconds, and even some large documents chunk reasonably quickly, so I suspect some documents have certain qualities, maybe deeply nested headings, that make the chunking performance go superlinear.

Steps to reproduce

  1. Save the script below as slowchunk.py
  2. Save large.html (~14 MB)
  3. Run: uv run slowchunk.py large.html
# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "docling>=2.0.0",
#   "docling-core[chunking]>=2.0.0",
#   "transformers>=4.40.0",
# ]
# ///
"""
Minimal HTML chunking script using docling's HybridChunker.

Usage:
  python slowchunk.py /path/to/file.html
"""

import sys

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

EMBED_MODEL_ID = "ibm-granite/granite-embedding-30m-english"
MAX_TOKENS = 512


def main():
    if len(sys.argv) < 2:
        print("Usage: python slowchunk.py <html_file>", file=sys.stderr)
        sys.exit(1)

    file_path = sys.argv[1]

    print(f"Reading {file_path}...", file=sys.stderr)
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()
    print(f"Read {len(content)} chars.", file=sys.stderr)

    print("Initializing tokenizer...", file=sys.stderr)
    tokenizer = HuggingFaceTokenizer(
        tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
        max_tokens=MAX_TOKENS,
    )

    print("Initializing document converter...", file=sys.stderr)
    converter = DocumentConverter()

    print("Converting HTML to document...", file=sys.stderr)
    result = converter.convert_string(content=content, format=InputFormat.HTML)
    doc = result.document

    print("Initializing chunker...", file=sys.stderr)
    chunker = HybridChunker(tokenizer=tokenizer, merge_peers=True)

    print("Chunking...", file=sys.stderr)
    chunks = list(chunker.chunk(doc))
    print(f"Produced {len(chunks)} chunks.", file=sys.stderr)

    print("Contextualizing and counting tokens...", file=sys.stderr)
    total_tokens = 0
    oversized = 0
    for chunk in chunks:
        text = chunker.contextualize(chunk)
        n = tokenizer.count_tokens(text)
        total_tokens += n
        if n > MAX_TOKENS:
            oversized += 1

    print(
        f"Done. {len(chunks)} chunks, {total_tokens} total tokens, "
        f"avg {total_tokens // max(len(chunks), 1)} tokens/chunk, "
        f"{oversized} oversized.",
        file=sys.stderr,
    )


if __name__ == "__main__":
    main()

Observed output

I used ts to measure the duration of each step but more precise instrumentation could be added with other means.

    $ uv run slowchunk.py large_file.html 2>&1 | ts -i
    00:00:03 Reading red_hat_fuse_7.13_html-single_apache_camel_component_reference_index_index.html...
    00:00:00 Read 14172244 chars.
    00:00:00 Initializing tokenizer...
    00:00:00 Initializing document converter...
    00:00:00 Converting HTML to document...
    00:00:16 Initializing chunker...
    00:00:00 Chunking...
    00:02:29 Token indices sequence length is longer than the specified maximum sequence length for this model (33318 > 512). Running this sequence through the model will result in indexing errors
--> 01:55:02 Produced 6287 chunks.
    00:00:00 Contextualizing and counting tokens...
    00:00:02 Done. 6287 chunks, 1761710 total tokens, avg 280 tokens/chunk, 1282 oversized.

The third line from the bottom, "Produced 6287 chunks." took 1h55m.

The chunks = list(chunker.chunk(doc)) call accounts for the ~1h55m delay.

Docling version

docling_version='2.83.0'
docling_core_version='2.71.0'
docling_ibm_models_version='3.13.0'
docling_parse_version='5.7.0'
platform_str='Linux-6.19.9-200.fc43.x86_64-x86_64-with-glibc2.42'
py_impl_version='cpython-314'
py_lang_version='3.14.3'

Python version

3.14.3

Alternative approaches

To unblock us, I've made a Rust port of the chunking algorithm and it's able to chunk large.html in 12 seconds. If there's any interest in collaborating on that, let me know. I am running it through docling's test suite (using PyO3 and pytest monkey patching) to ensure acceptably similar results. It could potentially be plugged into docling chunking pipelines using the same technique.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions