Bug Description
Since PR #18303 (merged 2025-03-30), Node.hash includes metadata via MetadataMode.ALL. That mode ignores excluded_embed_metadata_keys, so every metadata field ends up in the hash. This includes file-stat-derived fields that SimpleDirectoryReader populates via default_file_metadata_func (last_modified_date, creation_date, last_accessed_date, file_size).
Under IngestionPipeline._handle_upserts, a hash mismatch triggers vector_store.delete() followed by a full re-embed. Combined with MetadataMode.ALL, this means any modification to volatile filesystem metadata will trigger a full re-embedding of the document's chunks on the next ingestion run, even when the text content is byte-identical.
The trigger is silent: no warning, no log, no indication that the pipeline is re-embedding unchanged content. The cost scales linearly with corpus size, modification rate, and re-indexing frequency.
Root cause
PR #18303 correctly addressed #17871 (metadata changes not being detected by IngestionPipeline) by adding metadata to the hash. The fix used MetadataMode.ALL rather than MetadataMode.EMBED. Since excluded_embed_metadata_keys exists precisely to mark fields as volatile, operational, or not content-relevant, including them in the hash defeats the purpose of that mechanism.
Scope
The bug manifests under these conditions:
SimpleDirectoryReader over a local filesystem. last_modified_date is formatted as "%Y-%m-%d" (date only), so the trigger is cross-day modifications. Any file modified between scheduled ingestion runs re-embeds on the next run.
- Any reader that populates temporal metadata with sub-day precision (via a custom
file_metadata callable, or via get_resource_info() code paths). Every modification triggers re-embedding, at the format's precision.
- Manually-constructed
Documents with volatile metadata. Same as above.
The bug does not manifest for SimpleDirectoryReader over fsspec cloud backends (s3fs, gcsfs, adlfs) in the current default load_data() path. This is because default_file_metadata_func (readers/file/base.py:164-168) queries POSIX-style stat keys (mtime, atime, created) that fsspec backends don't emit. The fix for that separate gap (using fs.modified(path) instead) would simultaneously activate this bug for every fsspec-backed reader.
Reproducers
Five progressive reproducers in this repo: https://github.com/stirelli/llamaindex-embedding-churn
Minimal reproduction (no API calls, no cost):
import os
from datetime import datetime, timedelta
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core.embeddings import BaseEmbedding
from pydantic import PrivateAttr
from pathlib import Path
class CountingEmbedding(BaseEmbedding):
_call_count: int = PrivateAttr(default=0)
@classmethod
def class_name(cls): return "Counting"
def _get_text_embedding(self, text):
self._call_count += 1
return [0.0] * 1536
def _get_query_embedding(self, q): return self._get_text_embedding(q)
async def _aget_text_embedding(self, t): return self._get_text_embedding(t)
async def _aget_query_embedding(self, q): return self._get_text_embedding(q)
def _get_text_embeddings(self, ts): return [self._get_text_embedding(t) for t in ts]
async def _aget_text_embeddings(self, ts): return self._get_text_embeddings(ts)
docs_dir = Path("/tmp/churn_repro")
docs_dir.mkdir(exist_ok=True)
(docs_dir / "doc.md").write_text("Hello, world." * 100)
os.utime(docs_dir / "doc.md", ((datetime.now() - timedelta(days=1)).timestamp(),) * 2)
embedder = CountingEmbedding()
pipeline = IngestionPipeline(
transformations=[SentenceSplitter(chunk_size=256), embedder],
docstore=SimpleDocumentStore(),
vector_store=SimpleVectorStore(),
)
pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 1 (initial): embed_calls={embedder._call_count}")
pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 2 (no change): embed_calls={embedder._call_count} # cache works")
(docs_dir / "doc.md").touch() # bump mtime to today
pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 3 (post-touch): embed_calls={embedder._call_count} # bug fires")
Expected behavior:
- Phase 1 (
initial): some positive number N of embed calls (depends on chunking).
- Phase 2 (
no change): same as Phase 1. Cache hit, no new calls.
- Phase 3 (
post-touch): embed count has doubled to 2N. This is an unnecessary re-embed of byte-identical content, triggered by a single touch that moved mtime across a calendar day.
Proposed fix
Three options to consider. Each preserves the intent of #18303 (detect meaningful metadata changes) to varying degrees.
Option A (recommended): Use MetadataMode.EMBED instead of MetadataMode.ALL
One line change in Node.hash:
# llama-index-core/llama_index/core/schema.py, Node.hash @property
- metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+ metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
This respects excluded_embed_metadata_keys, the mechanism already present in the codebase for exactly this purpose. SimpleDirectoryReader already adds the volatile file-stat fields to excluded_embed_metadata_keys by default:
# llama-index-core/llama_index/core/readers/file/base.py
excluded_embed_metadata_keys=[
"file_name", "file_type", "file_size",
"creation_date", "last_modified_date", "last_accessed_date",
],
so the fix correctly excludes volatile metadata from hash computation without requiring any reader-side changes.
Meaningful metadata changes (user-added tags, custom enrichments) remain tracked because excluded_embed_metadata_keys contains only the fields the reader explicitly marked as excluded from embedding. That matches the intent of #17871.
Pros:
Cons:
- Conflates two concerns in
excluded_embed_metadata_keys: "don't send to embedder" and "don't invalidate cached embedding." In practice these align (if a field isn't sent to the embedder, the embedding output won't change when the field changes, so re-embedding is waste), but a maintainer may prefer a dedicated mechanism.
Option B: Remove metadata from the hash entirely (content-only hash)
Revert the metadata inclusion added by #18303. The hash becomes a pure function of the text content.
Pros:
- Cleanest semantics. No ambiguity about what triggers re-embed.
- Smallest possible surface for the hash.
Cons:
Option C: Add a dedicated excluded_hash_metadata_keys field
Introduce a new attribute on Node / Document that controls which metadata fields participate in the hash, independent of excluded_embed_metadata_keys.
Pros:
- Cleanest separation of concerns. Embedding exclusion and hash exclusion are different user-facing levers.
- Supports edge cases (e.g., a field excluded from embedding but wanted in the hash, or vice versa).
Cons:
- API surface growth. Every reader that populates volatile metadata would need to populate the new field too.
- More breaking for downstream users who have already built around the current two-field model.
- Defaults have to be chosen carefully (should it default to
excluded_embed_metadata_keys? to empty? something else?).
My recommendation
Option A. It fixes the bug with a one-line change, preserves the #17871 fix, and uses existing infrastructure. Options B and C are legitimate but come with larger trade-offs. Happy to be overruled by whichever option the maintainers prefer.
Impact
- Production deployments using scheduled
SimpleDirectoryReader ingestion over local corpora that change between runs re-embed modified files on every cross-day ingestion cycle, silently.
- Users adding custom
file_metadata callables that include timestamps (a natural pattern) experience re-embedding on every source modification at the format's precision.
get_resource_info() code paths in S3Reader, GCSReader, and similar readers populate sub-second datetime fields and would fire this bug on every source update.
- Users who manually construct
Documents with any volatile metadata field experience the same behavior.
Version
llama-index-core==0.14.21 (current at time of writing). Bug introduced in #18303, merged into llama-index-core prior to v0.12.20 release.
Steps to reproduce
See reproducer above, or run the repo at https://github.com/stirelli/llamaindex-embedding-churn. It has five progressive levels of evidence, including end-to-end verification against a real OpenAI API key for billed-cost confirmation.
I'd be happy to
Open a PR with the one-line fix plus a unit test demonstrating the regression is prevented, if the maintainers agree with the approach. Alternative fixes, such as adding a hash_mode parameter to give users control, are also possible. Happy to discuss the trade-off here first.
Bug Description
Since PR #18303 (merged 2025-03-30),
Node.hashincludes metadata viaMetadataMode.ALL. That mode ignoresexcluded_embed_metadata_keys, so every metadata field ends up in the hash. This includes file-stat-derived fields thatSimpleDirectoryReaderpopulates viadefault_file_metadata_func(last_modified_date,creation_date,last_accessed_date,file_size).Under
IngestionPipeline._handle_upserts, a hash mismatch triggersvector_store.delete()followed by a full re-embed. Combined withMetadataMode.ALL, this means any modification to volatile filesystem metadata will trigger a full re-embedding of the document's chunks on the next ingestion run, even when the text content is byte-identical.The trigger is silent: no warning, no log, no indication that the pipeline is re-embedding unchanged content. The cost scales linearly with corpus size, modification rate, and re-indexing frequency.
Root cause
PR #18303 correctly addressed #17871 (metadata changes not being detected by
IngestionPipeline) by adding metadata to the hash. The fix usedMetadataMode.ALLrather thanMetadataMode.EMBED. Sinceexcluded_embed_metadata_keysexists precisely to mark fields as volatile, operational, or not content-relevant, including them in the hash defeats the purpose of that mechanism.Scope
The bug manifests under these conditions:
SimpleDirectoryReaderover a local filesystem.last_modified_dateis formatted as"%Y-%m-%d"(date only), so the trigger is cross-day modifications. Any file modified between scheduled ingestion runs re-embeds on the next run.file_metadatacallable, or viaget_resource_info()code paths). Every modification triggers re-embedding, at the format's precision.Documents with volatile metadata. Same as above.The bug does not manifest for
SimpleDirectoryReaderover fsspec cloud backends (s3fs, gcsfs, adlfs) in the current defaultload_data()path. This is becausedefault_file_metadata_func(readers/file/base.py:164-168) queries POSIX-style stat keys (mtime,atime,created) that fsspec backends don't emit. The fix for that separate gap (usingfs.modified(path)instead) would simultaneously activate this bug for every fsspec-backed reader.Reproducers
Five progressive reproducers in this repo: https://github.com/stirelli/llamaindex-embedding-churn
Minimal reproduction (no API calls, no cost):
Expected behavior:
initial): some positive number N of embed calls (depends on chunking).no change): same as Phase 1. Cache hit, no new calls.post-touch): embed count has doubled to 2N. This is an unnecessary re-embed of byte-identical content, triggered by a singletouchthat movedmtimeacross a calendar day.Proposed fix
Three options to consider. Each preserves the intent of #18303 (detect meaningful metadata changes) to varying degrees.
Option A (recommended): Use
MetadataMode.EMBEDinstead ofMetadataMode.ALLOne line change in
Node.hash:This respects
excluded_embed_metadata_keys, the mechanism already present in the codebase for exactly this purpose.SimpleDirectoryReaderalready adds the volatile file-stat fields toexcluded_embed_metadata_keysby default:so the fix correctly excludes volatile metadata from hash computation without requiring any reader-side changes.
Meaningful metadata changes (user-added tags, custom enrichments) remain tracked because
excluded_embed_metadata_keyscontains only the fields the reader explicitly marked as excluded from embedding. That matches the intent of #17871.Pros:
Cons:
excluded_embed_metadata_keys: "don't send to embedder" and "don't invalidate cached embedding." In practice these align (if a field isn't sent to the embedder, the embedding output won't change when the field changes, so re-embedding is waste), but a maintainer may prefer a dedicated mechanism.Option B: Remove metadata from the hash entirely (content-only hash)
Revert the metadata inclusion added by #18303. The hash becomes a pure function of the text content.
Pros:
Cons:
Option C: Add a dedicated
excluded_hash_metadata_keysfieldIntroduce a new attribute on
Node/Documentthat controls which metadata fields participate in the hash, independent ofexcluded_embed_metadata_keys.Pros:
Cons:
excluded_embed_metadata_keys? to empty? something else?).My recommendation
Option A. It fixes the bug with a one-line change, preserves the #17871 fix, and uses existing infrastructure. Options B and C are legitimate but come with larger trade-offs. Happy to be overruled by whichever option the maintainers prefer.
Impact
SimpleDirectoryReaderingestion over local corpora that change between runs re-embed modified files on every cross-day ingestion cycle, silently.file_metadatacallables that include timestamps (a natural pattern) experience re-embedding on every source modification at the format's precision.get_resource_info()code paths inS3Reader,GCSReader, and similar readers populate sub-second datetime fields and would fire this bug on every source update.Documents with any volatile metadata field experience the same behavior.Version
llama-index-core==0.14.21(current at time of writing). Bug introduced in #18303, merged intollama-index-coreprior to v0.12.20 release.Steps to reproduce
See reproducer above, or run the repo at https://github.com/stirelli/llamaindex-embedding-churn. It has five progressive levels of evidence, including end-to-end verification against a real OpenAI API key for billed-cost confirmation.
I'd be happy to
Open a PR with the one-line fix plus a unit test demonstrating the regression is prevented, if the maintainers agree with the approach. Alternative fixes, such as adding a
hash_modeparameter to give users control, are also possible. Happy to discuss the trade-off here first.