Skip to content

fix(msfs): optimistic-lock copy to remove warm-read serialization#69

Open
ankitmaster08 wants to merge 1 commit into
mainfrom
aistore-test
Open

fix(msfs): optimistic-lock copy to remove warm-read serialization#69
ankitmaster08 wants to merge 1 commit into
mainfrom
aistore-test

Conversation

@ankitmaster08

@ankitmaster08 ankitmaster08 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Removes the warm-read lock serialization in the MSFS data cache by moving the cache-line → FUSE reply copy out of the global lock, using the data cache's existing contentGeneration counter:

  1. Under the lock: find the cache line, touch() the LRU, clamp the range, and snapshot contentGeneration (+ source offset/length, reply length).
  2. Release the lock.
  3. Copy outside the lock, straight from dataCacheLinesContent into the reply (safe — that buffer/tracker are allocated once for the mount's life and never moved).
  4. Re-acquire + re-check contentGeneration; every free()/fetch() bumps it, so on a mid-copy evict/refetch the generation mismatches and we roll back the appended bytes and retry. Common path matches and commits.

Only the cheap lookup/touch stays under the lock; byte movement now runs concurrently.

Results (OCI ws-fits, P4.8 warm — all cache-resident, 0 backend fetches)

metric baseline (copy under lock) + optimistic-copy
warm MiB/s 1,861 2,866 (+54%)
mean warm-read latency 209.8 µs ~117 µs (−44%)
cold MiB/s 1,563 1,533 (fetch-bound, unchanged)

Win concentrated on the 64 KiB families (P2/P4/P6) at 4–8 threads; flat on the FUSE-op-rate-bound 4 KiB families.

Files

  • cache.go, fission.go — optimistic copy-outside-lock serve path
  • backend_aistore.go, config.go, globals.go — supporting plumbing
  • tests: config_test.go, fission_test.go

Test plan

  • go build ./..., go test . green
  • gofmt / go vet clean
  • Generation re-check correctly retries on mid-copy eviction (covered by fission_test.go)
  • OCI ws-fits warm matrix reproduces the table above

Note: the per-inode disk cache backend (#68) builds on this; that PR currently includes this change squashed in, so whichever lands first the other should be rebased.

Summary by CodeRabbit

  • New Features

    • Optional manifest_gen_backend setting to route LIST/STAT-DIR to an alternate backend during manifest generation.
  • Bug Fixes

    • More robust read behavior under concurrent cache eviction: failed backend fetches now surface errors, evict bad cache entries, and prevent delivering corrupted data.
  • Documentation

    • Added guidance on manifest ingest throughput tuning and manifest_gen_backend usage to avoid competing manifest pipelines.
  • Tests

    • Added config validation and regression tests for manifest_gen_backend resolution and fetch-failure handling.

@ankitmaster08 ankitmaster08 requested a review from a team June 9, 2026 21:14
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9295117d-b4d5-46c5-90a5-b291326035c7

📥 Commits

Reviewing files that changed from the base of the PR and between e0f8e71 and f3f507e.

📒 Files selected for processing (10)
  • agent-learnings.md
  • multi-storage-file-system/README.md
  • multi-storage-file-system/backend_aistore.go
  • multi-storage-file-system/cache.go
  • multi-storage-file-system/config.go
  • multi-storage-file-system/config_test.go
  • multi-storage-file-system/fission.go
  • multi-storage-file-system/fission_test.go
  • multi-storage-file-system/fs.go
  • multi-storage-file-system/globals.go
✅ Files skipped from review due to trivial changes (2)
  • multi-storage-file-system/README.md
  • agent-learnings.md
🚧 Files skipped from review as they are similar to previous changes (7)
  • multi-storage-file-system/fs.go
  • multi-storage-file-system/fission_test.go
  • multi-storage-file-system/cache.go
  • multi-storage-file-system/config.go
  • multi-storage-file-system/config_test.go
  • multi-storage-file-system/backend_aistore.go
  • multi-storage-file-system/fission.go

📝 Walkthrough

Walkthrough

Adds optional AIStore manifest_gen_backend routing for LIST/STAT-DIR plus runtime wiring and docs, and introduces per-cache-line fetch-failure tracking with DoRead eviction and optimistic-copy revalidation; includes unit tests for config resolution and the DoRead failure path.

Changes

Multi-Storage File System: Manifest Routing & Cache Resilience

Layer / File(s) Summary
Manifest generation backend configuration and tests
multi-storage-file-system/globals.go, multi-storage-file-system/config.go, multi-storage-file-system/config_test.go, multi-storage-file-system/README.md, agent-learnings.md
Adds manifest_gen_backend parsing, validation, and post-parse resolution into AIStore backend config; README and agent-learnings updated; tests verify successful resolution and rejection of a missing target.
AIStore context wiring and setup guard
multi-storage-file-system/backend_aistore.go, multi-storage-file-system/fs.go
AIStore context gains optional listBackend pointer; setupAIStoreContext() refactored to initialize and wire the alternate backend context when present; processToMountList skips redundant setup when context exists.
LIST/STAT-DIR delegation to manifest backend
multi-storage-file-system/backend_aistore.go
listDirectory(), listObjects(), and statDirectory() early-delegate to the configured listBackend.context when present, otherwise use AIS implementations.
Cache line failure flag and reset
multi-storage-file-system/globals.go, multi-storage-file-system/cache.go
Adds fetchFailed boolean to cache line tracker and clears it when a line is reset/freed so readers can detect prior fetch failures.
Cache fetch failure handling
multi-storage-file-system/cache.go
fetch() marks failures (fetchFailed=true), clears contentLength/eTag, logs a warning; on success clears the flag and sets eTag; line still pushed to Clean LRU and waiters notified.
DoRead: eviction-on-failure and optimistic copy with revalidation
multi-storage-file-system/fission.go
DoRead() evicts poisoned cache lines and returns syscall.EIO when fetchFailed==true; implements optimistic memcpy without holding the global lock and revalidates via contentGeneration, truncating and retrying on mismatch to handle concurrent eviction/refetch.
Regression test for DoRead fetch-failure path
multi-storage-file-system/fission_test.go
New test injects a poisoned cache line and asserts DoRead() returns syscall.EIO and that the poisoned line is evicted from inode.cacheMap.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant DoRead
  participant CacheLine
  participant InodeCache
  Client->>DoRead: read request (offset N)
  DoRead->>CacheLine: snapshot contentGeneration
  DoRead->>InodeCache: unlock global lock
  DoRead->>CacheLine: memcpy bytes into reply buffer
  DoRead->>InodeCache: re-acquire global lock
  DoRead->>CacheLine: compare generation -> unchanged or changed
  alt unchanged
    DoRead->>Client: return bytes
  else changed
    DoRead->>DoRead: truncate reply, retry same offset
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • edmc-ss

Poem

🐰 A rabbit hops through caches deep,

Where failed fetches wake from sleep,
Manifests routed, lists now clear,
Optimistic copies persevere,
Poisoned lines shown out the keep.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: moving cache-line copy out of the global lock using optimistic locking to remove warm-read serialization. It directly matches the PR's primary objective and implementation focus.
Description check ✅ Passed The PR description provides comprehensive coverage of the change with detailed explanation of the implementation approach, performance results, affected files, and test plan. However, it does not include the required template checklist items for notifying release notes or version bumping.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch aistore-test

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread multi-storage-file-system/config.go Fixed
Comment thread multi-storage-file-system/config.go Fixed
Comment thread multi-storage-file-system/config.go Fixed
Comment thread multi-storage-file-system/fs.go Fixed
Comment thread multi-storage-file-system/fs.go Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
multi-storage-file-system/config_test.go (1)

708-809: ⚡ Quick win

Add reload-path coverage for manifest_gen_backend.

These tests cover first-load behavior only. Please add a SIGHUP-style second checkConfigFile() scenario that mutates manifest_gen_backend and asserts the intended behavior (reject or remount/apply), so routing drift can’t regress silently.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multi-storage-file-system/config_test.go` around lines 708 - 809, Add a
second "reload" step in both TestManifestGenBackendReference and
TestManifestGenBackendMissing that simulates a SIGHUP-style config reload: after
the first successful checkConfigFile(), write a modified config to
globals.configFilePath changing the AIStore.manifest_gen_backend value (for
TestManifestGenBackendReference change it to another existing backend or remove
it) and call checkConfigFile() again; then assert the new expected behavior by
re-reading globals.backendsToMount["ais"], checking
aisCfg.manifestGenBackendName and aisCfg.manifestGenBackend (for successful
remount/update) or asserting checkConfigFile() returns an error when the new
manifest_gen_backend is missing (for TestManifestGenBackendMissing). Ensure you
use the existing symbols checkConfigFile, globals.backendsToMount,
aisCfg.manifestGenBackendName and aisCfg.manifestGenBackend to locate and
validate the updated state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@multi-storage-file-system/config.go`:
- Around line 1799-1830: During SIGHUP reload immutability checks, ensure
AIStore backends' manifest_gen_backend is handled: compare the new
aisCfg.manifestGenBackendName against the old config's manifestGenBackendName
for the same backend (aisBackend.dirName) and reject the reload with an error if
it changed; alternatively, if you choose to allow changes, re-resolve
aisCfg.manifestGenBackend from config.backends (the same resolution logic that
sets aisCfg.manifestGenBackend) so the runtime pointer matches the new name.
Update the immutability check block that inspects AIStore backends to reference
manifestGenBackendName and either fail the reload on mismatch or refresh the
resolved manifestGenBackend pointer accordingly.

In `@multi-storage-file-system/README.md`:
- Line 149: Update the README entry for manifest_gen_backend to soften the
requirement from "must" to "should" to match runtime behavior: change the phrase
"must match bucket/prefix" to "should match bucket/prefix" (or equivalent
wording), and optionally note that checkConfigFile() currently only warns on
mismatches rather than failing; reference the manifest_gen_backend field in the
README and the checkConfigFile() function so reviewers can see the doc now
reflects the code behavior (or choose to instead make checkConfigFile() enforce
an error if you prefer strict enforcement).

---

Nitpick comments:
In `@multi-storage-file-system/config_test.go`:
- Around line 708-809: Add a second "reload" step in both
TestManifestGenBackendReference and TestManifestGenBackendMissing that simulates
a SIGHUP-style config reload: after the first successful checkConfigFile(),
write a modified config to globals.configFilePath changing the
AIStore.manifest_gen_backend value (for TestManifestGenBackendReference change
it to another existing backend or remove it) and call checkConfigFile() again;
then assert the new expected behavior by re-reading
globals.backendsToMount["ais"], checking aisCfg.manifestGenBackendName and
aisCfg.manifestGenBackend (for successful remount/update) or asserting
checkConfigFile() returns an error when the new manifest_gen_backend is missing
(for TestManifestGenBackendMissing). Ensure you use the existing symbols
checkConfigFile, globals.backendsToMount, aisCfg.manifestGenBackendName and
aisCfg.manifestGenBackend to locate and validate the updated state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4137d3b2-a9f3-4360-ae4e-ad86939eeeb8

📥 Commits

Reviewing files that changed from the base of the PR and between eec2f3b and c85ce61.

📒 Files selected for processing (10)
  • agent-learnings.md
  • multi-storage-file-system/README.md
  • multi-storage-file-system/backend_aistore.go
  • multi-storage-file-system/cache.go
  • multi-storage-file-system/config.go
  • multi-storage-file-system/config_test.go
  • multi-storage-file-system/fission.go
  • multi-storage-file-system/fission_test.go
  • multi-storage-file-system/fs.go
  • multi-storage-file-system/globals.go

Comment on lines +1799 to +1830
// Resolve AIStore `manifest_gen_backend` references now that all backends are
// parsed. Each reference must name another, non-AIStore backend in this config;
// that backend serves LIST/STAT-DIR while reads continue via AIS.
for _, aisBackend := range config.backends {
if aisBackend.backendType != "AIStore" {
continue
}
aisCfg := aisBackend.backendTypeSpecifics.(*backendConfigAIStoreStruct)
if aisCfg.manifestGenBackendName == "" {
continue
}
if aisCfg.manifestGenBackendName == aisBackend.dirName {
err = fmt.Errorf("AIStore backend %q manifest_gen_backend cannot reference itself", aisBackend.dirName)
return
}
srcBackend, found := config.backends[aisCfg.manifestGenBackendName]
if !found {
err = fmt.Errorf("AIStore backend %q manifest_gen_backend %q not found among configured backends", aisBackend.dirName, aisCfg.manifestGenBackendName)
return
}
if srcBackend.backendType == "AIStore" {
err = fmt.Errorf("AIStore backend %q manifest_gen_backend %q must not be an AIStore backend", aisBackend.dirName, aisCfg.manifestGenBackendName)
return
}
if srcBackend.bucketContainerName != aisBackend.bucketContainerName || srcBackend.prefix != aisBackend.prefix {
globals.logger.Printf("[WARN] AIStore backend %q manifest_gen_backend %q targets bucket/prefix %q/%q which differs from %q/%q; listing and reads may not align",
aisBackend.dirName, aisCfg.manifestGenBackendName,
srcBackend.bucketContainerName, srcBackend.prefix,
aisBackend.bucketContainerName, aisBackend.prefix)
}
aisCfg.manifestGenBackend = srcBackend
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle manifest_gen_backend changes on reload (or reject them explicitly).

Line 1799 introduces a new resolved routing field, but the SIGHUP immutability checks for AIStore backends (around Line 2173+) do not compare manifestGenBackendName. A reload that changes manifest_gen_backend can succeed yet keep old routing/context behavior, causing silent config drift.

🔧 Suggested fix
case "AIStore":
+   if backendAsStructOld.backendTypeSpecifics.(*backendConfigAIStoreStruct).manifestGenBackendName !=
+      backendAsStructNew.backendTypeSpecifics.(*backendConfigAIStoreStruct).manifestGenBackendName {
+       err = fmt.Errorf("cannot change AIStore.manifest_gen_backend in backends[\"%s\"]", dirName)
+       return
+   }
+
    if backendAsStructOld.backendTypeSpecifics.(*backendConfigAIStoreStruct).endpoint != backendAsStructNew.backendTypeSpecifics.(*backendConfigAIStoreStruct).endpoint {
        err = fmt.Errorf("cannot change AIStore.endpoint in backends[\"%s\"]", dirName)
        return
    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multi-storage-file-system/config.go` around lines 1799 - 1830, During SIGHUP
reload immutability checks, ensure AIStore backends' manifest_gen_backend is
handled: compare the new aisCfg.manifestGenBackendName against the old config's
manifestGenBackendName for the same backend (aisBackend.dirName) and reject the
reload with an error if it changed; alternatively, if you choose to allow
changes, re-resolve aisCfg.manifestGenBackend from config.backends (the same
resolution logic that sets aisCfg.manifestGenBackend) so the runtime pointer
matches the new name. Update the immutability check block that inspects AIStore
backends to reference manifestGenBackendName and either fail the reload on
mismatch or refresh the resolved manifestGenBackend pointer accordingly.

| authnTokenFile | string | "${AIS_AUTHN_TOKEN_FILE:=~/.config/ais/cli/auth.token}" | If != "", specifies location of AUTHN Token file |
| provider | string | "s3" | IF != "ais", specifies the backend of which bucket contents are cached |
| timeout | decimal milliseconds | 30000 | Limit on allowed duration of requests (including retries) |
| manifest_gen_backend | string | "" | IF != "", `dir_name` of another (non-AIStore) backend used for LIST/STAT-DIR (manifest generation, readdir); object reads still go via AIS. Lets listing hit the underlying store (e.g. S3) directly while reads benefit from AIS caching. The referenced backend must target the same bucket/prefix. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Documentation overstates bucket/prefix enforcement.

Line 149 says the referenced backend must match bucket/prefix, but checkConfigFile() currently only warns on mismatch and still proceeds. Please change wording to “should” (or enforce as an error in code).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multi-storage-file-system/README.md` at line 149, Update the README entry for
manifest_gen_backend to soften the requirement from "must" to "should" to match
runtime behavior: change the phrase "must match bucket/prefix" to "should match
bucket/prefix" (or equivalent wording), and optionally note that
checkConfigFile() currently only warns on mismatches rather than failing;
reference the manifest_gen_backend field in the README and the checkConfigFile()
function so reviewers can see the doc now reflects the code behavior (or choose
to instead make checkConfigFile() enforce an error if you prefer strict
enforcement).

Comment thread multi-storage-file-system/config.go Fixed
Comment thread multi-storage-file-system/fs.go Fixed
Moves the data-cache warm-read copy (cache line -> FUSE reply) out of the
global lock using the data cache's contentGeneration counter: snapshot the
generation + range under the lock, copy unlocked, then re-acquire and re-check
the generation, rolling back the appended bytes and retrying on mismatch. This
removes the single-lock serialization that capped warm-read throughput as
application threads scaled.

Result (OCI ws-fits, P4.8 warm): 1,861 -> 2,866 MiB/s (+54%); mean warm-read
latency ~210 -> ~117 us. Cold path and default behavior unchanged; warm
backend fetches remain 0.

- cache.go/fission.go: optimistic copy-outside-lock serve path
- backend_aistore.go/config.go/globals.go: supporting plumbing
- Do not log backend-derived values in the backend setup-failure and
  manifest_gen mismatch warnings (CodeQL go/clear-text-logging): the secret
  taints those structures, so the warnings log a fixed message only.
- tests: config_test.go, fission_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants