Of course. This is a fantastic and highly relevant problem in modern software development, especially when interacting with LLMs. Using Elixir's AST capabilities is a brilliant and very idiomatic approach.
Here is a brainstorm of possibilities for a system to dynamically list a subset of files for a particular feature, as you requested.
The core problem is that a "feature" isn't just one file; it's a web of interconnected modules, structs, and behaviours. We need a system that can trace these connections automatically.
The central concept is to model the codebase as a directed graph.
- Nodes: Each Elixir module (
.exfile) is a node in the graph. - Edges: A directed edge from Module A to Module B exists if A depends on B.
The "subset of files for a feature" is then found by starting at a "seed" module for that feature and performing a graph traversal to find all connected, relevant nodes.
This is the most powerful approach. Instead of simple text-based grep, we parse the code into its Abstract Syntax Tree (AST). This gives us a structured, semantic understanding of the code.
- Process:
- For each
.exfile in the codebase... - Use
Code.string_to_quoted/1to get the AST. - Use
Macro.traverse/2to walk the AST of each module.
- For each
- What to look for (these become the graph edges):
- Direct Dependencies:
alias,require,import,use Module. These are strong, explicit dependencies. - Function Calls:
MyModule.my_function(...). This creates a dependency from the current module toMyModule. - Struct Usage:
%MyStruct{}ordefstruct [...]. This creates a dependency on the module that defines the struct. - Behaviour Implementations:
@behaviour MyBehaviourand@impl MyBehaviour. This creates a critical link between an implementation and its contract. - Protocol Implementations:
defimpl MyProtocol, for: MyType. Links the protocol, the implementation, and the data type.
- Direct Dependencies:
This graph would be pre-calculated and stored (e.g., in an ETS table for fast in-memory access) so that querying for a feature's file set is instantaneous.
The system can't magically know what "BEACON" is. The user needs to provide a starting point, or a "seed."
- Seed Types:
- A Module: e.g.,
DSPEx.Teleprompter.BEACON. This is the clearest seed. - A Public Function: e.g.,
DSPEx.Teleprompter.BEACON.compile/5. The system would first find the module for this function. - A File Path: e.g.,
dspex/teleprompter/beacon.ex. The system would parse the file to find the module(s) defined within.
- A Module: e.g.,
Once we have the seed module, we traverse the graph.
-
Traversal Direction:
- Downstream: Find everything the seed module uses (its dependencies). This is the most common need. E.g.,
BEACON->BayesianOptimizer->ConfigManager. - Upstream: Find every module that uses the seed module (its dependents). This is useful for understanding the impact of a change. E.g., what would break if I changed
DSPEx.Program? - Bidirectional: For full context, a combination of both is often needed, but with a limited depth to prevent pulling in the entire codebase.
- Downstream: Find everything the seed module uses (its dependencies). This is the most common need. E.g.,
-
The Innovation: Context-Aware Pruning via Contracts (your second point)
This is the most exciting part. A simple traversal is good, but it's not smart. We can use contracts to prune irrelevant branches of the graph.
-
Identifying Contracts:
- Elixir Behaviours (
@behaviour) are the primary mechanism. - Protocols (
defprotocol). - In your codebase, a
use DSPEx.Programoruse DSPEx.Signaturealso acts as a powerful, domain-specific contract.
- Elixir Behaviours (
-
Pruning Logic: When the traversal hits a function call that resolves to a behaviour callback (e.g.,
DSPEx.Program.forward(program, ...)), the system needs to get smarter.- It identifies that
forwardis part of theDSPEx.Programcontract. - It inspects the arguments of the call to find the concrete implementation. In your code, the
programvariable holds the struct of the implementation (e.g.,%DSPEx.Predict{}or%DSPEx.Teleprompter.BEACON{}). - The traversal now follows the path to that specific implementation (
predict.exorbeacon.ex) and discards all other known implementations of that behaviour.
- It identifies that
-
Example from your codebase:
- A feature uses
DSPEx.Client.request(messages, provider: :gemini). - The traversal hits
DSPEx.Client. - The AST analysis is smart enough to see the
provider: :geminioption. - Inside
DSPEx.Client, the code might branch to use a specific adapter. - The system knows the "context" is
:gemini. It will therefore includedspex/adapters/instructor_lite_gemini.exbut it will actively exclude a hypotheticaldspex/adapters/openai.exbecause it's not part of the current feature's execution path.
- A feature uses
-
Name: Focus
Phase 1: Indexing (Run once or on file changes)
- Fire up a process (e.g., a GenServer) that watches the
libdirectory for changes. - On start/change, it scans all
.exfiles. - For each file, it parses the AST and populates an in-memory graph (ETS) with nodes (modules) and edges (dependencies of all types listed in point #1).
Phase 2: Querying (The user-facing part)
- A user calls
Focus.get_context("DSPEx.Teleprompter.BEACON"). Focustakes the seed"DSPEx.Teleprompter.BEACON"and finds its node in the graph.- It begins a bidirectional traversal (e.g., 2 levels up, unlimited levels down).
- During traversal:
- If it encounters a link to
DSPEx.Program, it checks the type of the program being used to select the correct implementation path. - If it encounters a call like
Client.request(..., provider: :gemini), it sets a "context hint" for the rest of the traversal, ensuring only:gemini-related adapters are included.
- If it encounters a link to
- The result of the traversal is a
MapSetof module names. - These module names are mapped back to their file paths.
- The final output is a simple list of file paths, ready to be fed to the LLM.
This system would be a massive leap forward. It's not just listing files; it's constructing a "minimum viable context" for a specific feature, using the code's own semantic structure and contractual boundaries to include what's necessary and, just as importantly, exclude what is not.