`HybridChunker.chunk()` takes ~2 hours on a large HTML document

We're using Docling for a chunking pipeline associated with the [Red Hat Offline Knowledge Portal](https://docs.redhat.com/en/documentation/red_hat_offline_knowledge_portal/1/html/what_is_the_red_hat_offline_knowledge_portal/pr01), and the quality of the results is excellent, much better than the tools we were using before.  We have encountered some performance problems, however.

`HybridChunker.chunk()` hangs for approximately 2 hours when processing large
HTML files, of which we have many.  In this reproducer I'll share a  ~14 MB, ~14 million character file. Most documents convert and chunk in
seconds, and even some large documents chunk reasonably quickly, so I suspect some documents have certain qualities, maybe deeply nested headings, that make the chunking performance go superlinear.

## Steps to reproduce

1. Save the script below as `slowchunk.py`
2. Save [large.html](https://github.com/user-attachments/files/26412423/large.html) (~14 MB)
3. Run: `uv run slowchunk.py large.html` 

```python
# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "docling>=2.0.0",
#   "docling-core[chunking]>=2.0.0",
#   "transformers>=4.40.0",
# ]
# ///
"""
Minimal HTML chunking script using docling's HybridChunker.

Usage:
  python slowchunk.py /path/to/file.html
"""

import sys

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

EMBED_MODEL_ID = "ibm-granite/granite-embedding-30m-english"
MAX_TOKENS = 512


def main():
    if len(sys.argv) < 2:
        print("Usage: python slowchunk.py <html_file>", file=sys.stderr)
        sys.exit(1)

    file_path = sys.argv[1]

    print(f"Reading {file_path}...", file=sys.stderr)
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()
    print(f"Read {len(content)} chars.", file=sys.stderr)

    print("Initializing tokenizer...", file=sys.stderr)
    tokenizer = HuggingFaceTokenizer(
        tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
        max_tokens=MAX_TOKENS,
    )

    print("Initializing document converter...", file=sys.stderr)
    converter = DocumentConverter()

    print("Converting HTML to document...", file=sys.stderr)
    result = converter.convert_string(content=content, format=InputFormat.HTML)
    doc = result.document

    print("Initializing chunker...", file=sys.stderr)
    chunker = HybridChunker(tokenizer=tokenizer, merge_peers=True)

    print("Chunking...", file=sys.stderr)
    chunks = list(chunker.chunk(doc))
    print(f"Produced {len(chunks)} chunks.", file=sys.stderr)

    print("Contextualizing and counting tokens...", file=sys.stderr)
    total_tokens = 0
    oversized = 0
    for chunk in chunks:
        text = chunker.contextualize(chunk)
        n = tokenizer.count_tokens(text)
        total_tokens += n
        if n > MAX_TOKENS:
            oversized += 1

    print(
        f"Done. {len(chunks)} chunks, {total_tokens} total tokens, "
        f"avg {total_tokens // max(len(chunks), 1)} tokens/chunk, "
        f"{oversized} oversized.",
        file=sys.stderr,
    )


if __name__ == "__main__":
    main()
```

**Observed output**

I used `ts` to measure the duration of each step but more precise instrumentation could be added with other means.

```
    $ uv run slowchunk.py large_file.html 2>&1 | ts -i
    00:00:03 Reading red_hat_fuse_7.13_html-single_apache_camel_component_reference_index_index.html...
    00:00:00 Read 14172244 chars.
    00:00:00 Initializing tokenizer...
    00:00:00 Initializing document converter...
    00:00:00 Converting HTML to document...
    00:00:16 Initializing chunker...
    00:00:00 Chunking...
    00:02:29 Token indices sequence length is longer than the specified maximum sequence length for this model (33318 > 512). Running this sequence through the model will result in indexing errors
--> 01:55:02 Produced 6287 chunks.
    00:00:00 Contextualizing and counting tokens...
    00:00:02 Done. 6287 chunks, 1761710 total tokens, avg 280 tokens/chunk, 1282 oversized.
```

The third line from the bottom, "Produced 6287 chunks." took `1h55m`.

The `chunks = list(chunker.chunk(doc))` call accounts for the ~1h55m delay.

## Docling version

```
docling_version='2.83.0'
docling_core_version='2.71.0'
docling_ibm_models_version='3.13.0'
docling_parse_version='5.7.0'
platform_str='Linux-6.19.9-200.fc43.x86_64-x86_64-with-glibc2.42'
py_impl_version='cpython-314'
py_lang_version='3.14.3'
```

## Python version

`3.14.3`

## Alternative approaches

To unblock us, I've made a Rust port of the chunking algorithm and it's able to chunk `large.html` in 12 seconds.  If there's any interest in collaborating on that, let me know.  I am running it through docling's test suite (using [PyO3](https://github.com/PyO3/pyo3) and pytest monkey patching) to ensure acceptably similar results.  It could potentially be plugged into docling chunking pipelines using the same technique.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`HybridChunker.chunk()` takes ~2 hours on a large HTML document #3222

Steps to reproduce

Docling version

Python version

Alternative approaches

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HybridChunker.chunk() takes ~2 hours on a large HTML document #3222

Description

Steps to reproduce

Docling version

Python version

Alternative approaches

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`HybridChunker.chunk()` takes ~2 hours on a large HTML document #3222