Skip to content

[Bug]: TreeSelectLeafRetriever query path returns empty source_nodes #21441

@thomascolden585-svg

Description

@thomascolden585-svg

Bug Description

Summary

TreeSelectLeafRetriever builds a Response with source_nodes=[] in its query path, which drops retrieval provenance and breaks source/citation visibility for consumers that rely on response.source_nodes.

Why This Is a Bug

The retriever traverses real leaf nodes, but the final response explicitly discards all sources. This creates an answer with no traceable origin, even when the underlying index has valid source nodes.

Affected Area

  • Package: llama-index-core
  • File: llama_index/core/indices/tree/select_leaf_retriever.py
  • Relevant line behavior: _query() returns Response(response_str, source_nodes=[]) with an inline TODO: fix source nodes.

CLI isn’t installed in this environment:

Steps to Reproduce

  1. Set up a clean env at repo root:
    • uv sync
  2. Run this minimal script:
    from llama_index.core import Document, TreeIndex
    from llama_index.core.indices.tree.select_leaf_retriever import TreeSelectLeafRetriever
    from llama_index.core.schema import QueryBundle
    
    docs = [
        Document(text="Paris is the capital of France."),
        Document(text="Berlin is the capital of Germany."),
    ]
    index = TreeIndex.from_documents(docs)
    retriever = TreeSelectLeafRetriever(index=index, child_branch_factor=1)
    
    # Direct query path in retriever implementation
    resp = retriever._query(QueryBundle("What is the capital of France?"))
    print("response:", str(resp))
    print("source_nodes:", len(resp.source_nodes))
  3. Observe that source_nodes is empty.

Relevant Logs/Tracebacks

Expected Behavior

Response.source_nodes should include the selected leaf node(s) used to synthesize the final answer.

Actual Behavior

Response.source_nodes is always empty for this query path.

User Impact

  • Citation features cannot show where answers came from.
  • Evaluation/debug workflows lose retriever provenance.
  • Downstream integrations expecting source nodes can misbehave or display incomplete output.

Proposed Fix

  • Track selected leaf nodes during traversal in _query_level() / _query_with_selected_node().
  • Return Response(response_str, source_nodes=[...]) instead of an empty list.
  • Add regression tests asserting non-empty source_nodes for successful tree leaf selection.

Validation Checklist

  • Reproduction script confirms empty source_nodes before fix.
  • Fix returns selected leaf nodes in Response.source_nodes.
  • Regression test added in llama-index-core/tests.
  • uv run make lint passes.
  • uv run -- pytest passes for modified package(s).
  • No unrelated refactors included.

Definition of Done

  • Query path no longer drops source nodes.
  • Tests fail before fix and pass after fix.
  • CI passes for the PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageIssue needs to be triaged/prioritized

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions