Conversation
…eight normalization
Contributor
There was a problem hiding this comment.
Pull request overview
This PR focuses on bringing the workspace into strict clippy -D warnings compliance and consistent formatting, while also making a few behavior/API refinements (query intent anchors, chunking edge-cases, and combined index call ergonomics).
Changes:
- Replace ad-hoc stderr printing with
login core runtime paths and adjust dependencies accordingly. - Improve/expand chunking logic and test coverage (Rust + Python wrapper edge cases).
- Refactor/extend query/index APIs (intent anchors in
QueryOptions, new builder for DB config, and grouping combined-index parameters).
Reviewed changes
Copilot reviewed 15 out of 19 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| crates/cortexadb-py/cortexadb/chunker.py | Adds legacy chunk_text() guards for empty/short inputs. |
| crates/cortexadb-core/tests/integration.rs | New end-to-end integration tests covering WAL recovery, checkpointing, deletion, graph edges, namespaces, metadata, and capacity eviction. |
| crates/cortexadb-core/src/store.rs | Switches diagnostics to log and adds intent-anchor-based option adjustments during query planning. |
| crates/cortexadb-core/src/query/mod.rs | Re-exports updated query types, including IntentAnchors. |
| crates/cortexadb-core/src/query/hybrid.rs | Adds IntentAnchors, relaxes ScoreWeights normalization, and expands tests. |
| crates/cortexadb-core/src/lib.rs | Reorders re-exports (rustfmt/clippy cleanup). |
| crates/cortexadb-core/src/index/vector.rs | Adds Default derive, minor API cleanups, and switches to log for HNSW fallback warnings. |
| crates/cortexadb-core/src/index/temporal.rs | Rustfmt-only formatting change. |
| crates/cortexadb-core/src/index/hnsw.rs | Derive defaults + minor type cleanup in result mapping. |
| crates/cortexadb-core/src/index/graph.rs | Small BFS optimization and assertion cleanup. |
| crates/cortexadb-core/src/index/combined.rs | Introduces TimeRange/GraphScope to reduce argument count in combined-index APIs. |
| crates/cortexadb-core/src/facade.rs | Removes config Default, adds CortexaDBBuilder, changes open() signature, and tweaks namespace query over-fetching. |
| crates/cortexadb-core/src/core/state_machine.rs | Minor clippy-driven simplifications (or_default). |
| crates/cortexadb-core/src/chunker.rs | Substantial fixed chunking refactor + significantly expanded tests. |
| crates/cortexadb-core/src/bin/sync_bench.rs | Formatting / minor clippy cleanups. |
| crates/cortexadb-core/src/bin/startup_bench.rs | Import ordering (rustfmt). |
| crates/cortexadb-core/src/bin/monkey_writer.rs | Switches progress output to log. |
| crates/cortexadb-core/src/bin/monkey_verify.rs | Formatting / minor clippy cleanups. |
| crates/cortexadb-core/src/bin/manual_store.rs | Formatting / minor clippy cleanups. |
| crates/cortexadb-core/benches/storage_bench.rs | Import ordering (rustfmt). |
| crates/cortexadb-core/Cargo.toml | Adds log dependency and serial_test dev-dependency. |
| clippy.toml | Removes unsupported clippy config key. |
| Cargo.lock | Updates lockfile for new deps and version bumps. |
| .gitignore | Ignores skills/ directory. |
Comments suppressed due to low confidence (7)
crates/cortexadb-core/src/store.rs:586
query_with_snapshotapplies intent adjustments by callingembedder.embed(query_text)whenoptions.intent_anchorsis set, but the query execution path embeds the query again later. For expensive embedders this adds avoidable latency. Consider moving theintent_anchorsadjustment to the executor (after the embedding is already computed) or passing a precomputed embedding through, so the query is embedded only once.
let snapshot = self.snapshot();
if let Some(anchors) = options.intent_anchors.take() {
Self::apply_intent_adjustments(&mut options, &anchors, query_text, embedder);
}
crates/cortexadb-core/src/store.rs:619
apply_intent_adjustmentsunconditionally overwritesoptions.score_weightswhen anchors are present. This differs from the existing intent auto-adjust behavior in the executor (which only adjusts when weights are still default) and can unexpectedly override caller-specified weights. Consider applying the same guard (only auto-adjust when weights are still default), or clearly documenting that providingintent_anchorsimplies weights will be overridden.
) {
options.score_weights = adj.score_weights;
if let Some(exp) = options.graph_expansion.as_mut() {
exp.hops = adj.graph_hops;
}
}
crates/cortexadb-core/src/facade.rs:550
- This test config sets
checkpoint_policy: Disabledin the struct literal and then setsconfig.checkpoint_policy = Disabledagain immediately after. The second assignment is redundant and can be removed to keep the test setup minimal.
let mut config = CortexaDBConfig {
vector_dimension: 3,
sync_policy: crate::engine::SyncPolicy::Strict,
checkpoint_policy: crate::store::CheckpointPolicy::Disabled,
capacity_policy: crate::engine::CapacityPolicy::new(None, None),
index_mode: crate::index::IndexMode::Exact,
};
config.checkpoint_policy = crate::store::CheckpointPolicy::Disabled;
crates/cortexadb-core/src/facade.rs:366
- This over-fetch implementation increases
QueryOptions.top_k(4×) to get more candidates, which also increases the amount of ranking/sorting work inside the query engine (since results are truncated tooptions.top_k). If the goal is strictly to increase recall before namespace filtering while still returningtop_k, consider over-fetching viacandidate_multiplier/ a dedicated “candidate_k” knob instead of inflatingtop_k, so downstream ranking work stays proportional to the requested result count.
let mut options = QueryOptions::with_top_k(top_k.saturating_mul(4).max(top_k));
options.namespace = Some(namespace.to_string());
options.metadata_filter = metadata_filter;
let execution = self.inner.query("", options, &embedder)?;
crates/cortexadb-core/src/chunker.rs:55
chunk_fixedtreatsoverlapas a character budget (see comment “overlap, measured in chars”), butapply_overlap(used by recursive/semantic/markdown) interpretsoverlapas a word count. This makesoverlapsemantics inconsistent across strategies and also affectschunk_recursive’s fixed fallback (it passes the sameoverlapvalue intochunk_fixed). Consider standardizing on one unit (words or chars) across all strategies, and updating the calculation accordingly to avoid surprising behavior for callers.
fn chunk_fixed(text: &str, chunk_size: usize, overlap: usize) -> Vec<ChunkResult> {
let text = text.trim();
if text.is_empty() || chunk_size == 0 {
return vec![];
}
// overlap >= chunk_size → step would be 0 → infinite loop guard
if overlap >= chunk_size {
return vec![];
}
crates/cortexadb-core/src/chunker.rs:216
- In the
!split_donefallback,chunk_recursivecallschunk_fixed(text, chunk_size, overlap)and then later callsapply_overlap(chunks, overlap)on the whole result. This effectively applies overlap twice (once insidechunk_fixed, once inapply_overlap) and with potentially different units. To keep overlap behavior predictable, consider generating non-overlapping fixed chunks here (e.g., passoverlap=0) and relying onapply_overlap, or alternatively skipapply_overlapwhen the fallback path is used.
if !split_done {
// Fall back to fixed chunking when no separator works.
let fixed_chunks = chunk_fixed(text, chunk_size, overlap);
for chunk in fixed_chunks {
chunks.push(ChunkResult { text: chunk.text, index: *index, metadata: None });
*index += 1;
}
}
crates/cortexadb-core/src/query/hybrid.rs:82
ScoreWeights::normalized()now allows totals in 98..=102 and normalizes by the actual total, but other execution paths still enforcetotal == 100(e.g.query/executor.rsre-derives weights and errors when total != 100). This makes the publicScoreWeightsbehavior inconsistent depending on which query engine is used. Consider centralizing weight validation/normalization (reuseScoreWeights::normalized()everywhere) so the tolerance change applies consistently and doesn’t reintroduce errors in the planner/executor path.
fn normalized(self) -> Result<(f32, f32, f32)> {
let total =
self.similarity_pct as u16 + self.importance_pct as u16 + self.recency_pct as u16;
if !(98..=102).contains(&total) {
return Err(HybridQueryError::InvalidScoreWeights {
similarity_pct: self.similarity_pct,
importance_pct: self.importance_pct,
recency_pct: self.recency_pct,
});
}
// Divide by actual total so the three floats always sum to exactly 1.0.
let t = total as f32;
Ok((
self.similarity_pct as f32 / t,
self.importance_pct as f32 / t,
self.recency_pct as f32 / t,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation
Notes