[Bug]: Node.hash uses MetadataMode.ALL, causing unnecessary re-embeds when volatile file-stat metadata changes


### Bug Description

Since PR #18303 (merged 2025-03-30), `Node.hash` includes metadata via `MetadataMode.ALL`. That mode ignores `excluded_embed_metadata_keys`, so every metadata field ends up in the hash. This includes file-stat-derived fields that `SimpleDirectoryReader` populates via `default_file_metadata_func` (`last_modified_date`, `creation_date`, `last_accessed_date`, `file_size`).

Under `IngestionPipeline._handle_upserts`, a hash mismatch triggers `vector_store.delete()` followed by a full re-embed. Combined with `MetadataMode.ALL`, this means any modification to volatile filesystem metadata will trigger a full re-embedding of the document's chunks on the next ingestion run, even when the text content is byte-identical.

The trigger is silent: no warning, no log, no indication that the pipeline is re-embedding unchanged content. The cost scales linearly with corpus size, modification rate, and re-indexing frequency.

### Root cause

[PR #18303](https://github.com/run-llama/llama_index/pull/18303) correctly addressed #17871 (metadata changes not being detected by `IngestionPipeline`) by adding metadata to the hash. The fix used `MetadataMode.ALL` rather than `MetadataMode.EMBED`. Since `excluded_embed_metadata_keys` exists precisely to mark fields as volatile, operational, or not content-relevant, including them in the hash defeats the purpose of that mechanism.

### Scope

The bug manifests under these conditions:

- `SimpleDirectoryReader` over a local filesystem. `last_modified_date` is formatted as `"%Y-%m-%d"` (date only), so the trigger is cross-day modifications. Any file modified between scheduled ingestion runs re-embeds on the next run.
- Any reader that populates temporal metadata with sub-day precision (via a custom `file_metadata` callable, or via `get_resource_info()` code paths). Every modification triggers re-embedding, at the format's precision.
- Manually-constructed `Document`s with volatile metadata. Same as above.

The bug does *not* manifest for `SimpleDirectoryReader` over fsspec cloud backends (s3fs, gcsfs, adlfs) in the current default `load_data()` path. This is because `default_file_metadata_func` (`readers/file/base.py:164-168`) queries POSIX-style stat keys (`mtime`, `atime`, `created`) that fsspec backends don't emit. The fix for that separate gap (using `fs.modified(path)` instead) would simultaneously activate this bug for every fsspec-backed reader.

### Reproducers

Five progressive reproducers in this repo: https://github.com/stirelli/llamaindex-embedding-churn

Minimal reproduction (no API calls, no cost):

```python
import os
from datetime import datetime, timedelta
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core.embeddings import BaseEmbedding
from pydantic import PrivateAttr
from pathlib import Path

class CountingEmbedding(BaseEmbedding):
    _call_count: int = PrivateAttr(default=0)
    @classmethod
    def class_name(cls): return "Counting"
    def _get_text_embedding(self, text):
        self._call_count += 1
        return [0.0] * 1536
    def _get_query_embedding(self, q): return self._get_text_embedding(q)
    async def _aget_text_embedding(self, t): return self._get_text_embedding(t)
    async def _aget_query_embedding(self, q): return self._get_text_embedding(q)
    def _get_text_embeddings(self, ts): return [self._get_text_embedding(t) for t in ts]
    async def _aget_text_embeddings(self, ts): return self._get_text_embeddings(ts)

docs_dir = Path("/tmp/churn_repro")
docs_dir.mkdir(exist_ok=True)
(docs_dir / "doc.md").write_text("Hello, world." * 100)
os.utime(docs_dir / "doc.md", ((datetime.now() - timedelta(days=1)).timestamp(),) * 2)

embedder = CountingEmbedding()
pipeline = IngestionPipeline(
    transformations=[SentenceSplitter(chunk_size=256), embedder],
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
)

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 1 (initial):  embed_calls={embedder._call_count}")

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 2 (no change): embed_calls={embedder._call_count}  # cache works")

(docs_dir / "doc.md").touch()  # bump mtime to today
pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 3 (post-touch): embed_calls={embedder._call_count}  # bug fires")
```

Expected behavior:

- Phase 1 (`initial`): some positive number N of embed calls (depends on chunking).
- Phase 2 (`no change`): same as Phase 1. Cache hit, no new calls.
- Phase 3 (`post-touch`): embed count has **doubled** to 2N. This is an unnecessary re-embed of byte-identical content, triggered by a single `touch` that moved `mtime` across a calendar day.

### Proposed fix

Three options to consider. Each preserves the intent of #18303 (detect meaningful metadata changes) to varying degrees.

#### Option A (recommended): Use `MetadataMode.EMBED` instead of `MetadataMode.ALL`

One line change in `Node.hash`:

```python
# llama-index-core/llama_index/core/schema.py, Node.hash @property
-        metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
```

This respects `excluded_embed_metadata_keys`, the mechanism already present in the codebase for exactly this purpose. `SimpleDirectoryReader` already adds the volatile file-stat fields to `excluded_embed_metadata_keys` by default:

```python
# llama-index-core/llama_index/core/readers/file/base.py
excluded_embed_metadata_keys=[
    "file_name", "file_type", "file_size",
    "creation_date", "last_modified_date", "last_accessed_date",
],
```

so the fix correctly excludes volatile metadata from hash computation without requiring any reader-side changes.

Meaningful metadata changes (user-added tags, custom enrichments) remain tracked because `excluded_embed_metadata_keys` contains only the fields the reader explicitly marked as excluded from embedding. That matches the intent of #17871.

Pros:

- One-line diff.
- Reuses existing infrastructure (no new API surface).
- Preserves the #17871 fix for content-relevant metadata changes.

Cons:

- Conflates two concerns in `excluded_embed_metadata_keys`: "don't send to embedder" and "don't invalidate cached embedding." In practice these align (if a field isn't sent to the embedder, the embedding output won't change when the field changes, so re-embedding is waste), but a maintainer may prefer a dedicated mechanism.

#### Option B: Remove metadata from the hash entirely (content-only hash)

Revert the metadata inclusion added by #18303. The hash becomes a pure function of the text content.

Pros:

- Cleanest semantics. No ambiguity about what triggers re-embed.
- Smallest possible surface for the hash.

Cons:

- Regresses #17871. Users who rely on metadata updates triggering re-embed (the original bug #18303 was designed to fix) lose that behavior.
- Shifts the burden back to users to explicitly invalidate documents when metadata changes.

#### Option C: Add a dedicated `excluded_hash_metadata_keys` field

Introduce a new attribute on `Node` / `Document` that controls which metadata fields participate in the hash, independent of `excluded_embed_metadata_keys`.

Pros:

- Cleanest separation of concerns. Embedding exclusion and hash exclusion are different user-facing levers.
- Supports edge cases (e.g., a field excluded from embedding but wanted in the hash, or vice versa).

Cons:

- API surface growth. Every reader that populates volatile metadata would need to populate the new field too.
- More breaking for downstream users who have already built around the current two-field model.
- Defaults have to be chosen carefully (should it default to `excluded_embed_metadata_keys`? to empty? something else?).

#### My recommendation

Option A. It fixes the bug with a one-line change, preserves the #17871 fix, and uses existing infrastructure. Options B and C are legitimate but come with larger trade-offs. Happy to be overruled by whichever option the maintainers prefer.

### Impact

- Production deployments using scheduled `SimpleDirectoryReader` ingestion over local corpora that change between runs re-embed modified files on every cross-day ingestion cycle, silently.
- Users adding custom `file_metadata` callables that include timestamps (a natural pattern) experience re-embedding on every source modification at the format's precision.
- `get_resource_info()` code paths in `S3Reader`, `GCSReader`, and similar readers populate sub-second datetime fields and would fire this bug on every source update.
- Users who manually construct `Document`s with any volatile metadata field experience the same behavior.

### Version

`llama-index-core==0.14.21` (current at time of writing). Bug introduced in #18303, merged into `llama-index-core` prior to v0.12.20 release.

### Steps to reproduce

See reproducer above, or run the repo at https://github.com/stirelli/llamaindex-embedding-churn. It has five progressive levels of evidence, including end-to-end verification against a real OpenAI API key for billed-cost confirmation.

### I'd be happy to

Open a PR with the one-line fix plus a unit test demonstrating the regression is prevented, if the maintainers agree with the approach. Alternative fixes, such as adding a `hash_mode` parameter to give users control, are also possible. Happy to discuss the trade-off here first.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Node.hash uses MetadataMode.ALL, causing unnecessary re-embeds when volatile file-stat metadata changes #21461

Bug Description

Root cause

Scope

Reproducers

Proposed fix

Option A (recommended): Use `MetadataMode.EMBED` instead of `MetadataMode.ALL`

Option B: Remove metadata from the hash entirely (content-only hash)

Option C: Add a dedicated `excluded_hash_metadata_keys` field

My recommendation

Impact

Version

Steps to reproduce

I'd be happy to

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Node.hash uses MetadataMode.ALL, causing unnecessary re-embeds when volatile file-stat metadata changes #21461

Description

Bug Description

Root cause

Scope

Reproducers

Proposed fix

Option A (recommended): Use MetadataMode.EMBED instead of MetadataMode.ALL

Option B: Remove metadata from the hash entirely (content-only hash)

Option C: Add a dedicated excluded_hash_metadata_keys field

My recommendation

Impact

Version

Steps to reproduce

I'd be happy to

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option A (recommended): Use `MetadataMode.EMBED` instead of `MetadataMode.ALL`

Option C: Add a dedicated `excluded_hash_metadata_keys` field