Problem
Running speech-segments.tsx or speech-segments-voiceover.tsx multiple times with the same text/voice/model always re-generates VEED videos from scratch. The cache never hits for the Video elements that depend on speech segments.
Root Cause
Three levels of non-determinism prevent cache hits:
1. Speech generation itself is NOT cached (root cause)
experimental_generateSpeech from the Vercel ai SDK does not accept a cacheKey parameter — it's silently ignored. Every run calls ElevenLabs from scratch. Neural TTS is inherently non-deterministic: same text/voice/model produces different audio bytes each time. This means different alignment timings → different segment boundaries → different sliced audio bytes on every run.
2. Segment audio bytes are embedded in the Video cache key
computeCacheKey for a Video element serializes the entire prompt prop via serializeValue(). When prompt.audio is a ResolvedElement (segment), serializeValue recursively walks Object.entries() and reaches meta.file._data — the raw Uint8Array — which gets base64-encoded into the cache key string. Any byte-level difference in the audio produces a different key.
3. Non-deterministic floats in the cache key
The serialized segment also includes meta.duration (ffprobe float), meta.words (ElevenLabs timing floats), and start/end properties — all of which change per API call.
Suggested Fix
Fix A — Cache speech generation. Wrap the ElevenLabs /with-timestamps fetch with withCache, keyed on text + voice + model. This prevents redundant API calls and produces stable audio bytes on cache hit.
Fix B — Exclude audio bytes from Video cache key. When serializeValue encounters a ResolvedElement inside a prop (e.g., prompt.audio), it should use the element's semantic identity (type + props + children text, via computeCacheKey) rather than serializing the physical bytes from meta.file._data. This makes the Video cache key depend on what was said, not the exact bytes of the audio.
Fix A alone is sufficient if the cache produces bit-identical results (which it should — returning the same cached bytes). Fix B is the robust solution for cases where a resolved element is embedded in another element's props.
Affected Files
src/react/resolve.ts — resolveSpeechElement() needs to wrap the speech generation in withCache
src/react/renderers/utils.ts — serializeValue() should detect ResolvedElement/VargElement and use computeCacheKey instead of recursive Object.entries() walk
src/ai-sdk/cache.ts — may need a speech-specific cache wrapper
src/react/renderers/cache.test.ts — missing test for props containing resolved elements
Repro
# Run once — generates speech + 2 VEED videos (~$0.50 + wait time)
bun run src/react/examples/async/speech-segments.tsx
# Run again — speech is re-generated, VEED videos are re-rendered (cache miss)
bun run src/react/examples/async/speech-segments.tsx
Expected: second run should hit cache for both Speech and Video elements.
Problem
Running
speech-segments.tsxorspeech-segments-voiceover.tsxmultiple times with the same text/voice/model always re-generates VEED videos from scratch. The cache never hits for the Video elements that depend on speech segments.Root Cause
Three levels of non-determinism prevent cache hits:
1. Speech generation itself is NOT cached (root cause)
experimental_generateSpeechfrom the VercelaiSDK does not accept acacheKeyparameter — it's silently ignored. Every run calls ElevenLabs from scratch. Neural TTS is inherently non-deterministic: same text/voice/model produces different audio bytes each time. This means different alignment timings → different segment boundaries → different sliced audio bytes on every run.2. Segment audio bytes are embedded in the Video cache key
computeCacheKeyfor a Video element serializes the entirepromptprop viaserializeValue(). Whenprompt.audiois aResolvedElement(segment),serializeValuerecursively walksObject.entries()and reachesmeta.file._data— the raw Uint8Array — which gets base64-encoded into the cache key string. Any byte-level difference in the audio produces a different key.3. Non-deterministic floats in the cache key
The serialized segment also includes
meta.duration(ffprobe float),meta.words(ElevenLabs timing floats), andstart/endproperties — all of which change per API call.Suggested Fix
Fix A — Cache speech generation. Wrap the ElevenLabs
/with-timestampsfetch withwithCache, keyed on text + voice + model. This prevents redundant API calls and produces stable audio bytes on cache hit.Fix B — Exclude audio bytes from Video cache key. When
serializeValueencounters aResolvedElementinside a prop (e.g.,prompt.audio), it should use the element's semantic identity (type + props + children text, viacomputeCacheKey) rather than serializing the physical bytes frommeta.file._data. This makes the Video cache key depend on what was said, not the exact bytes of the audio.Fix A alone is sufficient if the cache produces bit-identical results (which it should — returning the same cached bytes). Fix B is the robust solution for cases where a resolved element is embedded in another element's props.
Affected Files
src/react/resolve.ts—resolveSpeechElement()needs to wrap the speech generation inwithCachesrc/react/renderers/utils.ts—serializeValue()should detectResolvedElement/VargElementand usecomputeCacheKeyinstead of recursiveObject.entries()walksrc/ai-sdk/cache.ts— may need a speech-specific cache wrappersrc/react/renderers/cache.test.ts— missing test for props containing resolved elementsRepro
Expected: second run should hit cache for both Speech and Video elements.