notes/ELIXIR_CODEBASE_CONTEXT_SUBSET_SYSTEM_gemini.md at main · nshkrdotcom/notes

Of course. This is a fantastic and highly relevant problem in modern software development, especially when interacting with LLMs. Using Elixir's AST capabilities is a brilliant and very idiomatic approach.

Here is a brainstorm of possibilities for a system to dynamically list a subset of files for a particular feature, as you requested.

Project Codename: "Focus" - A Context-Aware File Subsetting System

The core problem is that a "feature" isn't just one file; it's a web of interconnected modules, structs, and behaviours. We need a system that can trace these connections automatically.

Core Idea: The Feature Context Graph

The central concept is to model the codebase as a directed graph.

Nodes: Each Elixir module (.ex file) is a node in the graph.
Edges: A directed edge from Module A to Module B exists if A depends on B.

The "subset of files for a feature" is then found by starting at a "seed" module for that feature and performing a graph traversal to find all connected, relevant nodes.

Brainstorming the Possibilities

1. How to Build the Graph? -> AST Analysis (as you suggested)

This is the most powerful approach. Instead of simple text-based grep, we parse the code into its Abstract Syntax Tree (AST). This gives us a structured, semantic understanding of the code.

Process:
1. For each .ex file in the codebase...
2. Use Code.string_to_quoted/1 to get the AST.
3. Use Macro.traverse/2 to walk the AST of each module.
What to look for (these become the graph edges):
- Direct Dependencies: alias, require, import, use Module. These are strong, explicit dependencies.
- Function Calls: MyModule.my_function(...). This creates a dependency from the current module to MyModule.
- Struct Usage: %MyStruct{} or defstruct [...]. This creates a dependency on the module that defines the struct.
- Behaviour Implementations: @behaviour MyBehaviour and @impl MyBehaviour. This creates a critical link between an implementation and its contract.
- Protocol Implementations: defimpl MyProtocol, for: MyType. Links the protocol, the implementation, and the data type.

This graph would be pre-calculated and stored (e.g., in an ETS table for fast in-memory access) so that querying for a feature's file set is instantaneous.

2. How to Define a "Feature"? -> The Seed

The system can't magically know what "BEACON" is. The user needs to provide a starting point, or a "seed."

Seed Types:
- A Module: e.g., DSPEx.Teleprompter.BEACON. This is the clearest seed.
- A Public Function: e.g., DSPEx.Teleprompter.BEACON.compile/5. The system would first find the module for this function.
- A File Path: e.g., dspex/teleprompter/beacon.ex. The system would parse the file to find the module(s) defined within.

3. How to Get the File Subset? -> Graph Traversal & Pruning

Once we have the seed module, we traverse the graph.

Traversal Direction:
- Downstream: Find everything the seed module uses (its dependencies). This is the most common need. E.g., BEACON -> BayesianOptimizer -> ConfigManager.
- Upstream: Find every module that uses the seed module (its dependents). This is useful for understanding the impact of a change. E.g., what would break if I changed DSPEx.Program?
- Bidirectional: For full context, a combination of both is often needed, but with a limited depth to prevent pulling in the entire codebase.
The Innovation: Context-Aware Pruning via Contracts (your second point)

This is the most exciting part. A simple traversal is good, but it's not smart. We can use contracts to prune irrelevant branches of the graph.
- Identifying Contracts:
  - Elixir Behaviours (@behaviour) are the primary mechanism.
  - Protocols (defprotocol).
  - In your codebase, a use DSPEx.Program or use DSPEx.Signature also acts as a powerful, domain-specific contract.
- Pruning Logic: When the traversal hits a function call that resolves to a behaviour callback (e.g., DSPEx.Program.forward(program, ...)), the system needs to get smarter.
  1. It identifies that forward is part of the DSPEx.Program contract.
  2. It inspects the arguments of the call to find the concrete implementation. In your code, the program variable holds the struct of the implementation (e.g., %DSPEx.Predict{} or %DSPEx.Teleprompter.BEACON{}).
  3. The traversal now follows the path to that specific implementation (predict.ex or beacon.ex) and discards all other known implementations of that behaviour.
- Example from your codebase:
  - A feature uses DSPEx.Client.request(messages, provider: :gemini).
  - The traversal hits DSPEx.Client.
  - The AST analysis is smart enough to see the provider: :gemini option.
  - Inside DSPEx.Client, the code might branch to use a specific adapter.
  - The system knows the "context" is :gemini. It will therefore include dspex/adapters/instructor_lite_gemini.ex but it will actively exclude a hypothetical dspex/adapters/openai.ex because it's not part of the current feature's execution path.

Putting it all Together: A Proposed System

Name: Focus

Phase 1: Indexing (Run once or on file changes)

Fire up a process (e.g., a GenServer) that watches the lib directory for changes.
On start/change, it scans all .ex files.
For each file, it parses the AST and populates an in-memory graph (ETS) with nodes (modules) and edges (dependencies of all types listed in point #1).

Phase 2: Querying (The user-facing part)

A user calls Focus.get_context("DSPEx.Teleprompter.BEACON").
Focus takes the seed "DSPEx.Teleprompter.BEACON" and finds its node in the graph.
It begins a bidirectional traversal (e.g., 2 levels up, unlimited levels down).
During traversal:
- If it encounters a link to DSPEx.Program, it checks the type of the program being used to select the correct implementation path.
- If it encounters a call like Client.request(..., provider: :gemini), it sets a "context hint" for the rest of the traversal, ensuring only :gemini-related adapters are included.
The result of the traversal is a MapSet of module names.
These module names are mapped back to their file paths.
The final output is a simple list of file paths, ready to be fed to the LLM.

This system would be a massive leap forward. It's not just listing files; it's constructing a "minimum viable context" for a specific feature, using the code's own semantic structure and contractual boundaries to include what's necessary and, just as importantly, exclude what is not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Codename: "Focus" - A Context-Aware File Subsetting System

Core Idea: The Feature Context Graph

Brainstorming the Possibilities

1. How to Build the Graph? -> AST Analysis (as you suggested)

2. How to Define a "Feature"? -> The Seed

3. How to Get the File Subset? -> Graph Traversal & Pruning

Putting it all Together: A Proposed System

FilesExpand file tree

ELIXIR_CODEBASE_CONTEXT_SUBSET_SYSTEM_gemini.md

Latest commit

History

ELIXIR_CODEBASE_CONTEXT_SUBSET_SYSTEM_gemini.md

File metadata and controls

Project Codename: "Focus" - A Context-Aware File Subsetting System

Core Idea: The Feature Context Graph

Brainstorming the Possibilities

1. How to Build the Graph? -> AST Analysis (as you suggested)

2. How to Define a "Feature"? -> The Seed

3. How to Get the File Subset? -> Graph Traversal & Pruning

Putting it all Together: A Proposed System