Skip to content

Speech segments bypass cache — VEED videos re-render every run #161

@SecurityQQ

Description

@SecurityQQ

Problem

Running speech-segments.tsx or speech-segments-voiceover.tsx multiple times with the same text/voice/model always re-generates VEED videos from scratch. The cache never hits for the Video elements that depend on speech segments.

Root Cause

Three levels of non-determinism prevent cache hits:

1. Speech generation itself is NOT cached (root cause)

experimental_generateSpeech from the Vercel ai SDK does not accept a cacheKey parameter — it's silently ignored. Every run calls ElevenLabs from scratch. Neural TTS is inherently non-deterministic: same text/voice/model produces different audio bytes each time. This means different alignment timings → different segment boundaries → different sliced audio bytes on every run.

2. Segment audio bytes are embedded in the Video cache key

computeCacheKey for a Video element serializes the entire prompt prop via serializeValue(). When prompt.audio is a ResolvedElement (segment), serializeValue recursively walks Object.entries() and reaches meta.file._data — the raw Uint8Array — which gets base64-encoded into the cache key string. Any byte-level difference in the audio produces a different key.

3. Non-deterministic floats in the cache key

The serialized segment also includes meta.duration (ffprobe float), meta.words (ElevenLabs timing floats), and start/end properties — all of which change per API call.

Suggested Fix

Fix A — Cache speech generation. Wrap the ElevenLabs /with-timestamps fetch with withCache, keyed on text + voice + model. This prevents redundant API calls and produces stable audio bytes on cache hit.

Fix B — Exclude audio bytes from Video cache key. When serializeValue encounters a ResolvedElement inside a prop (e.g., prompt.audio), it should use the element's semantic identity (type + props + children text, via computeCacheKey) rather than serializing the physical bytes from meta.file._data. This makes the Video cache key depend on what was said, not the exact bytes of the audio.

Fix A alone is sufficient if the cache produces bit-identical results (which it should — returning the same cached bytes). Fix B is the robust solution for cases where a resolved element is embedded in another element's props.

Affected Files

  • src/react/resolve.tsresolveSpeechElement() needs to wrap the speech generation in withCache
  • src/react/renderers/utils.tsserializeValue() should detect ResolvedElement/VargElement and use computeCacheKey instead of recursive Object.entries() walk
  • src/ai-sdk/cache.ts — may need a speech-specific cache wrapper
  • src/react/renderers/cache.test.ts — missing test for props containing resolved elements

Repro

# Run once — generates speech + 2 VEED videos (~$0.50 + wait time)
bun run src/react/examples/async/speech-segments.tsx

# Run again — speech is re-generated, VEED videos are re-rendered (cache miss)
bun run src/react/examples/async/speech-segments.tsx

Expected: second run should hit cache for both Speech and Video elements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions