Skip to content

[Bug]: Node.hash uses MetadataMode.ALL, causing unnecessary re-embeds when volatile file-stat metadata changes #21461

@stirelli

Description

@stirelli

Bug Description

Since PR #18303 (merged 2025-03-30), Node.hash includes metadata via MetadataMode.ALL. That mode ignores excluded_embed_metadata_keys, so every metadata field ends up in the hash. This includes file-stat-derived fields that SimpleDirectoryReader populates via default_file_metadata_func (last_modified_date, creation_date, last_accessed_date, file_size).

Under IngestionPipeline._handle_upserts, a hash mismatch triggers vector_store.delete() followed by a full re-embed. Combined with MetadataMode.ALL, this means any modification to volatile filesystem metadata will trigger a full re-embedding of the document's chunks on the next ingestion run, even when the text content is byte-identical.

The trigger is silent: no warning, no log, no indication that the pipeline is re-embedding unchanged content. The cost scales linearly with corpus size, modification rate, and re-indexing frequency.

Root cause

PR #18303 correctly addressed #17871 (metadata changes not being detected by IngestionPipeline) by adding metadata to the hash. The fix used MetadataMode.ALL rather than MetadataMode.EMBED. Since excluded_embed_metadata_keys exists precisely to mark fields as volatile, operational, or not content-relevant, including them in the hash defeats the purpose of that mechanism.

Scope

The bug manifests under these conditions:

  • SimpleDirectoryReader over a local filesystem. last_modified_date is formatted as "%Y-%m-%d" (date only), so the trigger is cross-day modifications. Any file modified between scheduled ingestion runs re-embeds on the next run.
  • Any reader that populates temporal metadata with sub-day precision (via a custom file_metadata callable, or via get_resource_info() code paths). Every modification triggers re-embedding, at the format's precision.
  • Manually-constructed Documents with volatile metadata. Same as above.

The bug does not manifest for SimpleDirectoryReader over fsspec cloud backends (s3fs, gcsfs, adlfs) in the current default load_data() path. This is because default_file_metadata_func (readers/file/base.py:164-168) queries POSIX-style stat keys (mtime, atime, created) that fsspec backends don't emit. The fix for that separate gap (using fs.modified(path) instead) would simultaneously activate this bug for every fsspec-backed reader.

Reproducers

Five progressive reproducers in this repo: https://github.com/stirelli/llamaindex-embedding-churn

Minimal reproduction (no API calls, no cost):

import os
from datetime import datetime, timedelta
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core.embeddings import BaseEmbedding
from pydantic import PrivateAttr
from pathlib import Path

class CountingEmbedding(BaseEmbedding):
    _call_count: int = PrivateAttr(default=0)
    @classmethod
    def class_name(cls): return "Counting"
    def _get_text_embedding(self, text):
        self._call_count += 1
        return [0.0] * 1536
    def _get_query_embedding(self, q): return self._get_text_embedding(q)
    async def _aget_text_embedding(self, t): return self._get_text_embedding(t)
    async def _aget_query_embedding(self, q): return self._get_text_embedding(q)
    def _get_text_embeddings(self, ts): return [self._get_text_embedding(t) for t in ts]
    async def _aget_text_embeddings(self, ts): return self._get_text_embeddings(ts)

docs_dir = Path("/tmp/churn_repro")
docs_dir.mkdir(exist_ok=True)
(docs_dir / "doc.md").write_text("Hello, world." * 100)
os.utime(docs_dir / "doc.md", ((datetime.now() - timedelta(days=1)).timestamp(),) * 2)

embedder = CountingEmbedding()
pipeline = IngestionPipeline(
    transformations=[SentenceSplitter(chunk_size=256), embedder],
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
)

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 1 (initial):  embed_calls={embedder._call_count}")

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 2 (no change): embed_calls={embedder._call_count}  # cache works")

(docs_dir / "doc.md").touch()  # bump mtime to today
pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 3 (post-touch): embed_calls={embedder._call_count}  # bug fires")

Expected behavior:

  • Phase 1 (initial): some positive number N of embed calls (depends on chunking).
  • Phase 2 (no change): same as Phase 1. Cache hit, no new calls.
  • Phase 3 (post-touch): embed count has doubled to 2N. This is an unnecessary re-embed of byte-identical content, triggered by a single touch that moved mtime across a calendar day.

Proposed fix

Three options to consider. Each preserves the intent of #18303 (detect meaningful metadata changes) to varying degrees.

Option A (recommended): Use MetadataMode.EMBED instead of MetadataMode.ALL

One line change in Node.hash:

# llama-index-core/llama_index/core/schema.py, Node.hash @property
-        metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)

This respects excluded_embed_metadata_keys, the mechanism already present in the codebase for exactly this purpose. SimpleDirectoryReader already adds the volatile file-stat fields to excluded_embed_metadata_keys by default:

# llama-index-core/llama_index/core/readers/file/base.py
excluded_embed_metadata_keys=[
    "file_name", "file_type", "file_size",
    "creation_date", "last_modified_date", "last_accessed_date",
],

so the fix correctly excludes volatile metadata from hash computation without requiring any reader-side changes.

Meaningful metadata changes (user-added tags, custom enrichments) remain tracked because excluded_embed_metadata_keys contains only the fields the reader explicitly marked as excluded from embedding. That matches the intent of #17871.

Pros:

Cons:

  • Conflates two concerns in excluded_embed_metadata_keys: "don't send to embedder" and "don't invalidate cached embedding." In practice these align (if a field isn't sent to the embedder, the embedding output won't change when the field changes, so re-embedding is waste), but a maintainer may prefer a dedicated mechanism.

Option B: Remove metadata from the hash entirely (content-only hash)

Revert the metadata inclusion added by #18303. The hash becomes a pure function of the text content.

Pros:

  • Cleanest semantics. No ambiguity about what triggers re-embed.
  • Smallest possible surface for the hash.

Cons:

Option C: Add a dedicated excluded_hash_metadata_keys field

Introduce a new attribute on Node / Document that controls which metadata fields participate in the hash, independent of excluded_embed_metadata_keys.

Pros:

  • Cleanest separation of concerns. Embedding exclusion and hash exclusion are different user-facing levers.
  • Supports edge cases (e.g., a field excluded from embedding but wanted in the hash, or vice versa).

Cons:

  • API surface growth. Every reader that populates volatile metadata would need to populate the new field too.
  • More breaking for downstream users who have already built around the current two-field model.
  • Defaults have to be chosen carefully (should it default to excluded_embed_metadata_keys? to empty? something else?).

My recommendation

Option A. It fixes the bug with a one-line change, preserves the #17871 fix, and uses existing infrastructure. Options B and C are legitimate but come with larger trade-offs. Happy to be overruled by whichever option the maintainers prefer.

Impact

  • Production deployments using scheduled SimpleDirectoryReader ingestion over local corpora that change between runs re-embed modified files on every cross-day ingestion cycle, silently.
  • Users adding custom file_metadata callables that include timestamps (a natural pattern) experience re-embedding on every source modification at the format's precision.
  • get_resource_info() code paths in S3Reader, GCSReader, and similar readers populate sub-second datetime fields and would fire this bug on every source update.
  • Users who manually construct Documents with any volatile metadata field experience the same behavior.

Version

llama-index-core==0.14.21 (current at time of writing). Bug introduced in #18303, merged into llama-index-core prior to v0.12.20 release.

Steps to reproduce

See reproducer above, or run the repo at https://github.com/stirelli/llamaindex-embedding-churn. It has five progressive levels of evidence, including end-to-end verification against a real OpenAI API key for billed-cost confirmation.

I'd be happy to

Open a PR with the one-line fix plus a unit test demonstrating the regression is prevented, if the maintainers agree with the approach. Alternative fixes, such as adding a hash_mode parameter to give users control, are also possible. Happy to discuss the trade-off here first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions