Updated features by anaslimem · Pull Request #7 · anaslimem/CortexaDB

anaslimem · 2026-03-01T20:11:53Z

Summary

Enhanced chunker functionality with improved text splitting and chunking strategies
Improved facade layer with additional features for better data management
Hybrid query improvements with intent-based query adjustments and enhanced score weight normalization
Added new integration tests for comprehensive coverage
Updated store module with persistence improvements
Added monkey_writer binary for testing purposes

…eight normalization

Copilot

Pull request overview

This PR enhances CortexaDB’s querying and persistence behavior by adding intent-based query tuning, improving score-weight normalization, expanding chunking behavior/tests, and adding broader integration coverage. It also modernizes operational output by switching from eprintln! to log across the core and adds a CortexaDBBuilder to make configuration explicit.

Changes:

Add integration tests covering WAL recovery, checkpointing, deletion, graph edges, namespaces, metadata persistence, and capacity eviction.
Introduce intent anchors plumbing (IntentAnchors + QueryOptions.intent_anchors) and relax score-weight normalization tolerance.
Add a CortexaDBBuilder, make vector_dimension explicit in open(), and update namespace-scoped querying behavior.

Reviewed changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
crates/cortexadb-core/tests/integration.rs	Adds end-to-end persistence + recovery integration tests.
crates/cortexadb-core/src/store.rs	Switches to `log` and adds intent-anchor-based option adjustment in `query_with_snapshot`.
crates/cortexadb-core/src/query/mod.rs	Re-exports new `IntentAnchors` type.
crates/cortexadb-core/src/query/hybrid.rs	Adds `IntentAnchors`, adds `intent_anchors` to `QueryOptions`, loosens score weight normalization tolerance + tests.
crates/cortexadb-core/src/index/vector.rs	Switches HNSW fallback output to `log`.
crates/cortexadb-core/src/facade.rs	Adds `CortexaDBBuilder`, changes `open()` signature, adjusts `ask_in_namespace` over-fetching, expands tests.
crates/cortexadb-core/src/chunker.rs	Reworks fixed chunking (word-safe), fixes recursive fallback behavior, adds extensive tests.
crates/cortexadb-core/src/bin/monkey_writer.rs	Switches progress output to `log` and reduces logging frequency.
crates/cortexadb-core/Cargo.toml	Adds `log` dependency and `serial_test` dev-dependency; bumps crate version.
Cargo.lock	Updates lockfile for new dependencies + version bumps.
.gitignore	Fixes `.venv/` formatting and adds `skills/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T20:18:47Z

+    fn apply_intent_adjustments(
+        options: &mut QueryOptions,
+        anchors: &IntentAnchors,
+        query_text: &str,
+        embedder: &dyn QueryEmbedder,
+    ) {
+        if let Ok(query_emb) = embedder.embed(query_text) {
+            if let Some(adj) = QueryPlanner::infer_intent_adjustments(
+                &query_emb,
+                &anchors.semantic,
+                &anchors.recency,
+                &anchors.graph,
+                anchors.graph_hops_2_threshold,
+                anchors.graph_hops_3_threshold,
+                anchors.importance_pct,
+            ) {
+                options.score_weights = adj.score_weights;
+                if let Some(exp) = options.graph_expansion.as_mut() {
+                    exp.hops = adj.graph_hops;
+                }
+            }
+        }


query_with_snapshot applies intent adjustments by calling embedder.embed(query_text) before building the plan, but QueryExecutor::execute will embed the same query again. For real embedders this doubles latency/cost and can introduce inconsistency for non-deterministic embedders. Consider refactoring so embedding happens once (e.g., move intent adjustment logic into the executor after embedding, or allow passing the precomputed query embedding through).

Copilot · 2026-03-01T20:18:48Z

+    pub fn open(path: &str, vector_dimension: usize) -> Result<Self> {
+        Self::builder(path, vector_dimension).build()
+    }
+
+    /// Create a [`CortexaDBBuilder`] to configure advanced database options.
+    pub fn builder(path: &str, vector_dimension: usize) -> CortexaDBBuilder {
+        CortexaDBBuilder::new(path, vector_dimension)


CortexaDB::open is now a breaking API change (requires vector_dimension and no longer accepts a config). There are still in-repo call sites that use the old signature (e.g. examples/rust/basic_usage.rs), which will fail to compile until updated. Please update the examples (and any other public docs) to use open(path, dimension) or open_with_config/builder() as appropriate.

Copilot · 2026-03-01T20:18:48Z

+            capacity_policy: crate::engine::CapacityPolicy::new(None, None),
+            index_mode: crate::index::IndexMode::Exact,
+        };
+        config.checkpoint_policy = crate::store::CheckpointPolicy::Disabled;


This test builds a CortexaDBConfig with checkpoint_policy: Disabled and then immediately re-assigns the same value. The extra assignment is redundant and makes the intent a bit harder to follow; it can be removed.

Suggested change

config.checkpoint_policy = crate::store::CheckpointPolicy::Disabled;

Copilot · 2026-03-01T20:18:48Z

+        // Compute how many words from the tail of this chunk should be repeated
+        // at the start of the next chunk (overlap, measured in chars).
+        let overlap_words = {
+            let mut acc = 0usize;
+            let mut count = 0usize;
+            for word in chunk_words.iter().rev() {
+                let cost = if count == 0 { word.len() } else { word.len() + 1 };
+                if acc + cost <= overlap {


overlap has different units depending on strategy: here in chunk_fixed it’s treated as a character budget (see “overlap, measured in chars”), while apply_overlap() interprets overlap as a number of words for recursive/semantic/markdown. Because the same ChunkingStrategy::{..., overlap} field is exposed publicly, this is easy for callers to misinterpret. Consider documenting the units on the ChunkingStrategy variants (or splitting into overlap_chars vs overlap_words) to avoid accidental misuse.

Copilot · 2026-03-01T20:18:48Z

        // Use HNSW if available (approximate search)
        if let Some(ref hnsw) = self.hnsw_backend {
            match hnsw.search(query, top_k, None) {
                Ok(results) => return Ok(results),
                Err(e) => {
                    // Fall back to exact search if HNSW fails
-                    eprintln!("HNSW search failed, falling back to exact: {:?}", e);
+                    log::warn!("HNSW search failed, falling back to exact: {:?}", e);
                }


search_scoped takes a namespace, but the HNSW fast-path ignores it (hnsw.search(..., None)) and returns results unscoped. This can leak cross-namespace hits whenever HNSW is enabled, and callers like ask_in_namespace then have to over-fetch + filter to compensate. Consider either (a) disabling the HNSW path when namespace.is_some(), or (b) post-filtering HNSW results by id_to_namespace and re-querying with a larger top_k until enough in-namespace results are collected.

Copilot · 2026-03-01T20:18:48Z

+                if let Some(exp) = options.graph_expansion.as_mut() {
+                    exp.hops = adj.graph_hops;
+                }


apply_intent_adjustments updates options.graph_expansion.hops, but the executor currently also auto-adjusts graph_expansion based on the global intent policy (and it does so even when score_weights are non-default). As a result, caller-provided intent_anchors graph hop adjustments are likely overwritten during execution. To make intent_anchors effective, consider disabling executor-side intent auto-adjustment when custom anchors were provided, or only letting the executor override graph hops when the hops are still at the default value.

Suggested change

if let Some(exp) = options.graph_expansion.as_mut() {

exp.hops = adj.graph_hops;

}

Copilot · 2026-03-01T20:18:49Z

        metadata_filter: Option<HashMap<String, String>>,
    ) -> Result<Vec<Hit>> {
        let embedder = StaticEmbedder { embedding: query_embedding };
-        let mut options = QueryOptions::with_top_k(top_k);
+        let mut options = QueryOptions::with_top_k(top_k.saturating_mul(4).max(top_k));
        options.namespace = Some(namespace.to_string());
        options.metadata_filter = metadata_filter;
        let execution = self.inner.query("", options, &embedder)?;


This namespace-overfetch change is coupled with later logic that builds a ns_map by calling all_memories() and sorting the entire DB on every ask_in_namespace() call. That makes ask_in_namespace O(N log N) per query and can dominate latency for large datasets. Consider filtering namespaces per-hit via state_machine.get_memory(hit.id) (O(k)) or exposing a lightweight id -> namespace lookup to avoid scanning all memories.

Copilot · 2026-03-01T20:18:49Z

    fn normalized(self) -> Result<(f32, f32, f32)> {
        let total =
            self.similarity_pct as u16 + self.importance_pct as u16 + self.recency_pct as u16;
-        if total != 100 {
+        if !(98..=102).contains(&total) {
            return Err(HybridQueryError::InvalidScoreWeights {
                similarity_pct: self.similarity_pct,
                importance_pct: self.importance_pct,
                recency_pct: self.recency_pct,
            });
        }
+        // Divide by actual total so the three floats always sum to exactly 1.0.
+        let t = total as f32;
        Ok((
-            self.similarity_pct as f32 / 100.0,
-            self.importance_pct as f32 / 100.0,
-            self.recency_pct as f32 / 100.0,
+            self.similarity_pct as f32 / t,
+            self.importance_pct as f32 / t,
+            self.recency_pct as f32 / t,
        ))


ScoreWeights::normalized() now allows totals in 98..=102, but the store/query path (via QueryExecutor) still appears to require an exact 100% total (it re-validates weights independently). That means weights accepted by HybridQueryEngine (and these new tests) may still error in the main CortexaDBStore query path. It would be better to centralize weight validation/normalization (e.g., have the executor call ScoreWeights::normalized()), so behavior is consistent across query entry points.

anaslimem added 4 commits March 1, 2026 16:44

Updated Test Coverage

610e00d

Fixed the test

67351f7

Implement intent-based query adjustments and add tolerance to score w…

24ac915

…eight normalization

Updated features

dbcffa0

Copilot AI review requested due to automatic review settings March 1, 2026 20:11

Copilot started reviewing on behalf of anaslimem March 1, 2026 20:12 View session

Updated the chunker

f8feaff

Copilot AI reviewed Mar 1, 2026

View reviewed changes

Fixed the error in the CI

07b13df

anaslimem merged commit 8ae7538 into main Mar 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated features#7

Updated features#7
anaslimem merged 6 commits intomainfrom
features

anaslimem commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if let Some(exp) = options.graph_expansion.as_mut() {
	exp.hops = adj.graph_hops;
	}

Conversation

anaslimem commented Mar 1, 2026

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants