Skip to content

Performance: prompt building duplicates full few-shot preamble per prompt, causing O(batch_length × preamble_size) peak memory #446

@dan504512

Description

@dan504512

Description

QAPromptGenerator.render() concatenates description + examples + chunk_text into a single string per prompt. When few-shot examples are large, this duplicates the shared preamble (description + formatted examples) into every prompt string independently:

# prompting.py:115-138
def render(self, question, additional_context=None):
    prompt_lines = [f"{self.template.description}\n"]
    if self.template.examples:
        prompt_lines.append(self.examples_heading)
        for ex in self.template.examples:
            prompt_lines.append(self.format_example_as_text(ex))
    prompt_lines.append(f"{self.question_prefix}{question}")
    prompt_lines.append(self.answer_prefix)
    return "\n".join(prompt_lines)

Then in _annotate_documents_single_pass (annotation.py:370-375), all prompts for a batch are materialized simultaneously:

prompts = [
    prompt_builder.build_prompt(chunk.chunk_text, chunk.document_id, chunk.additional_context)
    for chunk in batch
]

For large example sets, this creates extreme memory pressure:

  • Example preamble: ~300,000 characters formatted (text + extraction outputs)
  • Python string encoding: if examples contain any non-ASCII characters, Python uses UCS-2 (2 bytes/char) or UCS-4 (4 bytes/char), multiplying memory 2-4×
  • Observed per-prompt size: ~640 KB
  • At batch_length=10000: 10,000 × 640 KB = 6.4 GB per batch just for prompt strings
  • Due to delayed cyclic GC between batches: two batches' prompts can overlap at peak, reaching 12.8 GB

Profiling data

memray flamegraph from a 1000-document extraction run (batch_length=10000, ~170 KB raw example text):

Call site Peak memory Allocations
render (prompt strings) 12,820 MB 20,000
Tokenization 509 MB 915
Batch/chunk infrastructure 512 MB 922
Total peak 14,400 MB

Prompt string allocation accounts for 89% of peak memory usage.

Proposed optimization: use multi-part contents in batch requests

The Gemini API's contents[0].parts field accepts multiple parts. Instead of concatenating the preamble into each prompt string, the shared preamble can be stored once and referenced by all requests:

# Instead of:
"contents": [{"role": "user", "parts": [{"text": "preamble...chunk_text"}]}]

# Use:
"contents": [{"role": "user", "parts": [
    {"text": preamble_shared},   # one instance, shared by reference across all requests
    {"text": chunk_text},         # unique per prompt
]}]

This is semantically equivalent from Gemini's perspective (parts within a single turn are concatenated). Python's reference semantics mean all 10,000 requests share the same preamble_shared string object in memory.

Expected impact:

  • Per-batch prompt memory: 10,000 × 640 KB → 1 × 640 KB + 10,000 × ~1 KB = ~10 MB (640× reduction)
  • No change to model behavior (same content presented to the model)
  • No change to Vertex billing (same token count per request)

Implementation scope

  1. PromptBuilder.build_prompt returns a structured value (e.g., PromptParts(preamble, chunk_text)) instead of a concatenated string
  2. _build_request in gemini_batch.py constructs multi-part contents from the structured value
  3. Non-batch providers (_process_single_prompt) can still concatenate for the existing generate_content call, or also use multi-part contents
  4. Cache key computation needs to hash the parts equivalently to the old concatenated form (or include a migration/versioning scheme)

Alternatives considered

  • Context caching (Vertex AI cached content): also eliminates per-prompt duplication AND reduces token billing. Larger implementation surface (cache lifecycle management, TTL, minimum size thresholds). Could be a follow-up.
  • systemInstruction for examples: moves examples to a different API field. Changes prompt semantics — model may respond differently to examples in system vs user turn.
  • Streamed prompt construction: build and serialize prompts one-at-a-time into the JSONL file without materializing the full list. Reduces peak memory but doesn't reduce total allocation volume. More invasive refactor across annotation.py → model.infer → gemini_batch.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions