Skip to content

Add maxSpeechMs option#256

Open
alielbekov wants to merge 3 commits intomasterfrom
max-segment-duration
Open

Add maxSpeechMs option#256
alielbekov wants to merge 3 commits intomasterfrom
max-segment-duration

Conversation

@alielbekov
Copy link
Copy Markdown
Collaborator

Description of changes

Adds a new maxSpeechMs parameter to control the maximum duration of speech segments. When a speech segment exceeds this duration, it is automatically force-cut and emitted, and a new segment starts if speech is still detected.

Checklist

  • Verified that the typechecking & formatting Github actions passes successfully
  • Verified that changes work on the test site, adding changes to the test site if necessary to try out your changes

@vercel
Copy link
Copy Markdown

vercel Bot commented Jan 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
vad_test_site Ready Ready Preview, Comment Jan 3, 2026 2:38pm

@ricky0123
Copy link
Copy Markdown
Owner

ricky0123 commented Jan 11, 2026

Nice! This is awesome. I tried it and it works in the test site. At the same time, I think we should think carefully before we commit to this approach. Here are a couple of my concerns

  • When max speech ms is reached and speech end/speech start is called, it's not transparent in the callbacks that this is all part of the same utterance. People may want to combine these audio segments in their backend and now they have to find some way to figure out if two instances of onSpeechEnd being called were part of the same larger utterance.
  • I'm not sure about resetting redemptionCounter when maxSpeechMS is reached. I think that reaching maxSpeechMS is more about "flushing" the buildup of audio rather than updating the state of the VAD algorithm.

I'm tempted to go more the route of, instead of adding a maxSpeechMS parameter, adding a method that allows the user to "flush" audio so that they can do something with it. The user could then manually set an interval timer to call that as often as they want. One of the benefits then is that they could also get fancy with it, like calling it when speech probability is relatively low instead of just at regular intervals.

I'm not committed to any one approach but I want to talk about the pros and cons of each before we commit.

Feel free to tag additional people who you think may have an opinion.

@ricky0123
Copy link
Copy Markdown
Owner

@pepe95270 feel free also to share your thoughts as a user

@pepe95270
Copy link
Copy Markdown

Thank you both, this is a great addition to the project !
As pointed by ricky, I agree that a method to allow the user to "flush" audio add even more value because it will give the developer far more control and will unable:

  • create a frontend button for flushing
  • do fancy stuff to decide when to flush
  • simulate the same behavior as "maxSpeechMs". Example : setInterval( MicVAD.newFlushMethod, 60000 );

@alielbekov
Copy link
Copy Markdown
Collaborator Author

alielbekov commented Jan 12, 2026

Thank you for your comments and reviews! Yes, I also think adding flushAudio: (audio: Float32Array) => {}, could be a lot more useful.

Currently, we flush emit audio in frames onFrameProcessed: ({...}, frame: Float32Array). We could potentially just work around by collecting the processed frames instead of calling "flushAudio"?

But yes I think being able to get the current audio: Float32Array any time is a lot nicer.

Also, for non-real-time vad? how would it look like?

Tagging @AmgadHasan if they have any input on this. #79

@ricky0123
Copy link
Copy Markdown
Owner

Thank you for your comments and reviews! Yes, I also think adding flushAudio: (audio: Float32Array) => {}, could be a lot more useful.

Currently, we flush emit audio in frames onFrameProcessed: ({...}, frame: Float32Array). We could potentially just work around by collecting the processed frames instead of calling "flushAudio"?

But yes I think being able to get the current audio: Float32Array any time is a lot nicer.

Also, for non-real-time vad? how would it look like?

Tagging @AmgadHasan if they have any input on this. #79

I think we can ignore this feature for non-real-time vad. It can be a real time-only feature. I think that yeah, we can try going with adding a flushAudio method like you mention. It should have the property that if you call flushAudio, the audio segment that you get does not appear in the audio segment in onSpeechEnd, i.e. it truly flushes the audio instead of just giving you access to it. Does that sound good to everyone?

@alielbekov
Copy link
Copy Markdown
Collaborator Author

it truly flushes the audio instead of just giving you access to it.

Sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants