Description
QAPromptGenerator.render() concatenates description + examples + chunk_text into a single string per prompt. When few-shot examples are large, this duplicates the shared preamble (description + formatted examples) into every prompt string independently:
# prompting.py:115-138
def render(self, question, additional_context=None):
prompt_lines = [f"{self.template.description}\n"]
if self.template.examples:
prompt_lines.append(self.examples_heading)
for ex in self.template.examples:
prompt_lines.append(self.format_example_as_text(ex))
prompt_lines.append(f"{self.question_prefix}{question}")
prompt_lines.append(self.answer_prefix)
return "\n".join(prompt_lines)
Then in _annotate_documents_single_pass (annotation.py:370-375), all prompts for a batch are materialized simultaneously:
prompts = [
prompt_builder.build_prompt(chunk.chunk_text, chunk.document_id, chunk.additional_context)
for chunk in batch
]
For large example sets, this creates extreme memory pressure:
- Example preamble: ~300,000 characters formatted (text + extraction outputs)
- Python string encoding: if examples contain any non-ASCII characters, Python uses UCS-2 (2 bytes/char) or UCS-4 (4 bytes/char), multiplying memory 2-4×
- Observed per-prompt size: ~640 KB
- At
batch_length=10000: 10,000 × 640 KB = 6.4 GB per batch just for prompt strings
- Due to delayed cyclic GC between batches: two batches' prompts can overlap at peak, reaching 12.8 GB
Profiling data
memray flamegraph from a 1000-document extraction run (batch_length=10000, ~170 KB raw example text):
| Call site |
Peak memory |
Allocations |
render (prompt strings) |
12,820 MB |
20,000 |
| Tokenization |
509 MB |
915 |
| Batch/chunk infrastructure |
512 MB |
922 |
| Total peak |
14,400 MB |
|
Prompt string allocation accounts for 89% of peak memory usage.
Proposed optimization: use multi-part contents in batch requests
The Gemini API's contents[0].parts field accepts multiple parts. Instead of concatenating the preamble into each prompt string, the shared preamble can be stored once and referenced by all requests:
# Instead of:
"contents": [{"role": "user", "parts": [{"text": "preamble...chunk_text"}]}]
# Use:
"contents": [{"role": "user", "parts": [
{"text": preamble_shared}, # one instance, shared by reference across all requests
{"text": chunk_text}, # unique per prompt
]}]
This is semantically equivalent from Gemini's perspective (parts within a single turn are concatenated). Python's reference semantics mean all 10,000 requests share the same preamble_shared string object in memory.
Expected impact:
- Per-batch prompt memory: 10,000 × 640 KB → 1 × 640 KB + 10,000 × ~1 KB = ~10 MB (640× reduction)
- No change to model behavior (same content presented to the model)
- No change to Vertex billing (same token count per request)
Implementation scope
PromptBuilder.build_prompt returns a structured value (e.g., PromptParts(preamble, chunk_text)) instead of a concatenated string
_build_request in gemini_batch.py constructs multi-part contents from the structured value
- Non-batch providers (
_process_single_prompt) can still concatenate for the existing generate_content call, or also use multi-part contents
- Cache key computation needs to hash the parts equivalently to the old concatenated form (or include a migration/versioning scheme)
Alternatives considered
- Context caching (Vertex AI cached content): also eliminates per-prompt duplication AND reduces token billing. Larger implementation surface (cache lifecycle management, TTL, minimum size thresholds). Could be a follow-up.
systemInstruction for examples: moves examples to a different API field. Changes prompt semantics — model may respond differently to examples in system vs user turn.
- Streamed prompt construction: build and serialize prompts one-at-a-time into the JSONL file without materializing the full list. Reduces peak memory but doesn't reduce total allocation volume. More invasive refactor across annotation.py → model.infer → gemini_batch.
Environment
Description
QAPromptGenerator.render()concatenatesdescription + examples + chunk_textinto a single string per prompt. When few-shot examples are large, this duplicates the shared preamble (description + formatted examples) into every prompt string independently:Then in
_annotate_documents_single_pass(annotation.py:370-375), all prompts for a batch are materialized simultaneously:For large example sets, this creates extreme memory pressure:
batch_length=10000: 10,000 × 640 KB = 6.4 GB per batch just for prompt stringsProfiling data
memray flamegraph from a 1000-document extraction run (
batch_length=10000, ~170 KB raw example text):render(prompt strings)Prompt string allocation accounts for 89% of peak memory usage.
Proposed optimization: use multi-part
contentsin batch requestsThe Gemini API's
contents[0].partsfield accepts multiple parts. Instead of concatenating the preamble into each prompt string, the shared preamble can be stored once and referenced by all requests:This is semantically equivalent from Gemini's perspective (parts within a single turn are concatenated). Python's reference semantics mean all 10,000 requests share the same
preamble_sharedstring object in memory.Expected impact:
Implementation scope
PromptBuilder.build_promptreturns a structured value (e.g.,PromptParts(preamble, chunk_text)) instead of a concatenated string_build_requestingemini_batch.pyconstructs multi-partcontentsfrom the structured value_process_single_prompt) can still concatenate for the existinggenerate_contentcall, or also use multi-partcontentsAlternatives considered
systemInstructionfor examples: moves examples to a different API field. Changes prompt semantics — model may respond differently to examples in system vs user turn.Environment
mainat a281692 (post-1.2.1, includes perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP #442 LCS alignment)batch_length=10000