KV Cache Contamination in Parallel Mode with State Loading

Hi,
When using parallel completions with load_state_path for prompt caching, instructions from parallel completions contaminate each other, causing incorrect output formats as well as outputs in general.
This is probably caused by shared KV cache between slots, and I'm not sure if there is a possible workaround for this or not. 

## Environment
- llama.rn version - 0.8.3
- Model - Qwen3-0.6B-Q4_K_M.gguf
- enableOpenCL: false in app.config.ts
- n_parallel set to 4
- I haven't set kv_unified or flash_attn_type during initLlama
- n_ctx: 8192

## Reproduction Steps

### 1. Warm System Prompts with Different Formats
```js
// Warm state for JSON format completion
const jsonSystemPrompt = {
  role: 'system',
  content: 'You are a helpful assistant. Always respond in valid JSON format with a "result" field.'
};

await context.completion({
  messages: [jsonSystemPrompt],
  save_state_path: '/path/to/json_state.bin',
  n_predict: 1, // minimal tokens
});

// Warm state for plain text completion
const textSystemPrompt = {
  role: 'system',
  content: 'You are a helpful assistant. Always respond in plain text, never use JSON or code blocks.'
};

await context.completion({
  messages: [textSystemPrompt],
  save_state_path: '/path/to/text_state.bin',
  n_predict: 1,
});
```

### 2. Run Parallel Completions with Loaded States
```js
// Enable parallel mode
await context.parallel.enable({ n_parallel: 4 });

// Request A: Should return JSON
const jsonPromise = context.parallel.completion({
  messages: [
    jsonSystemPrompt,
    { role: 'user', content: 'Analyze this: "The project deadline is next week"' }
  ],
  load_state_path: '/path/to/json_state.bin',
  temperature: 0.7,
  n_predict: 256,
});

// Request B: Should return plain text (run concurrently)
const textPromise = context.parallel.completion({
  messages: [
    textSystemPrompt,
    { role: 'user', content: 'Summarize this: "The project deadline is next week"' }
  ],
  load_state_path: '/path/to/text_state.bin',
  temperature: 0.7,
  n_predict: 256,
});

const [jsonResult, textResult] = await Promise.all([jsonPromise, textPromise]);
```

### Observed Contamination
Expected:
```js
// jsonResult.text
{ "result": "Analysis: The statement indicates..." }

// textResult.text
"The project has a deadline scheduled for next week..."
```

Actual:
```js
// jsonResult.text - sometimes gets plain text instead of JSON!
"Analysis: The statement indicates a time constraint..."

// textResult.text - sometimes gets JSON instead of plain text! Also you can see that the answer itself is completely wrong, probably due to the shared KV cache
{ "result": "Analysis: The statement indicates a time constraint...." }
```


### Root Cause Analysis
These are the resources I've looked at: [docs](https://github.com/mybigday/llama.rn?tab=readme-ov-file#notes-1) #230 #221 [ggerganov comment](https://github.com/ggml-org/llama.cpp/issues/3137#issuecomment-1722306526)
I've arrived at this conclusion so far:
- State Loading: load_state_path loads a pre-computed KV cache into the shared buffer
- Slot Allocation: Multiple parallel requests are assigned to different slots (0, 1, 2, 3)
- Shared KV Cache: All slots read/write to the same unified KV cache buffer
- Contamination: When slot 0 loads a state and slot 1 loads a different state concurrently, the instructions cross-contaminate because they're operating on overlapping KV cache regions

The kv_unified parameter is supposed to control whether slots share a unified KV cache buffer, I think. But setting this didn't solve the issue. 

Overall, I want to know if there's a way we can perform parallel decoding + state loading, and have separate KV caches for each slot? Is this even possible? Is there a workaround that can enable this functionality?
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV Cache Contamination in Parallel Mode with State Loading #258

Environment

Reproduction Steps

1. Warm System Prompts with Different Formats

2. Run Parallel Completions with Loaded States

Observed Contamination

Root Cause Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KV Cache Contamination in Parallel Mode with State Loading #258

Description

Environment

Reproduction Steps

1. Warm System Prompts with Different Formats

2. Run Parallel Completions with Loaded States

Observed Contamination

Root Cause Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions