Skip to content

KV Cache Contamination in Parallel Mode with State Loading #258

@Prasad-178

Description

@Prasad-178

Hi,
When using parallel completions with load_state_path for prompt caching, instructions from parallel completions contaminate each other, causing incorrect output formats as well as outputs in general.
This is probably caused by shared KV cache between slots, and I'm not sure if there is a possible workaround for this or not.

Environment

  • llama.rn version - 0.8.3
  • Model - Qwen3-0.6B-Q4_K_M.gguf
  • enableOpenCL: false in app.config.ts
  • n_parallel set to 4
  • I haven't set kv_unified or flash_attn_type during initLlama
  • n_ctx: 8192

Reproduction Steps

1. Warm System Prompts with Different Formats

// Warm state for JSON format completion
const jsonSystemPrompt = {
  role: 'system',
  content: 'You are a helpful assistant. Always respond in valid JSON format with a "result" field.'
};

await context.completion({
  messages: [jsonSystemPrompt],
  save_state_path: '/path/to/json_state.bin',
  n_predict: 1, // minimal tokens
});

// Warm state for plain text completion
const textSystemPrompt = {
  role: 'system',
  content: 'You are a helpful assistant. Always respond in plain text, never use JSON or code blocks.'
};

await context.completion({
  messages: [textSystemPrompt],
  save_state_path: '/path/to/text_state.bin',
  n_predict: 1,
});

2. Run Parallel Completions with Loaded States

// Enable parallel mode
await context.parallel.enable({ n_parallel: 4 });

// Request A: Should return JSON
const jsonPromise = context.parallel.completion({
  messages: [
    jsonSystemPrompt,
    { role: 'user', content: 'Analyze this: "The project deadline is next week"' }
  ],
  load_state_path: '/path/to/json_state.bin',
  temperature: 0.7,
  n_predict: 256,
});

// Request B: Should return plain text (run concurrently)
const textPromise = context.parallel.completion({
  messages: [
    textSystemPrompt,
    { role: 'user', content: 'Summarize this: "The project deadline is next week"' }
  ],
  load_state_path: '/path/to/text_state.bin',
  temperature: 0.7,
  n_predict: 256,
});

const [jsonResult, textResult] = await Promise.all([jsonPromise, textPromise]);

Observed Contamination

Expected:

// jsonResult.text
{ "result": "Analysis: The statement indicates..." }

// textResult.text
"The project has a deadline scheduled for next week..."

Actual:

// jsonResult.text - sometimes gets plain text instead of JSON!
"Analysis: The statement indicates a time constraint..."

// textResult.text - sometimes gets JSON instead of plain text! Also you can see that the answer itself is completely wrong, probably due to the shared KV cache
{ "result": "Analysis: The statement indicates a time constraint...." }

Root Cause Analysis

These are the resources I've looked at: docs #230 #221 ggerganov comment
I've arrived at this conclusion so far:

  • State Loading: load_state_path loads a pre-computed KV cache into the shared buffer
  • Slot Allocation: Multiple parallel requests are assigned to different slots (0, 1, 2, 3)
  • Shared KV Cache: All slots read/write to the same unified KV cache buffer
  • Contamination: When slot 0 loads a state and slot 1 loads a different state concurrently, the instructions cross-contaminate because they're operating on overlapping KV cache regions

The kv_unified parameter is supposed to control whether slots share a unified KV cache buffer, I think. But setting this didn't solve the issue.

Overall, I want to know if there's a way we can perform parallel decoding + state loading, and have separate KV caches for each slot? Is this even possible? Is there a workaround that can enable this functionality?
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions