How VAD Works
VAD assigns a probability score to an audio chunk, determining whether it contains speech or noise.- If audio is missing expected speech segments, lowering this threshold may help include more speech but increases the risk of false positives (hallucinations).
- If hallucinations occur, it indicates that the model is filtering too aggressively.
VAD Onset and Offset
- VAD Onset: The probability threshold that determines the start of detected speech.
- VAD Offset: The probability threshold that determines the end of detected speech.
Impact of Onset and Offset Settings
- If the start of speech is missing, the onset threshold may be too high.
- If the end of speech is missing, the offset threshold may be too high.
- The model processes audio in chunks, so determining exact start and end points can be complex.
- Adjusting onset often requires adjusting offset as well, since both affect how the model processes speech segments.
Recommended VAD Ranges
Audio Type | Recommended VAD Offset | Recommended VAD Onset |
---|---|---|
Normal audio | 0.3 | 0.5 |
Noisy audio (with background noise) | 0.1 | 0.3 |
High-quality clear audio | 0.5 | 0.7 |
VAD Limitations
- VAD only provides a probability score indicating whether a chunk is speech or noise—it does not classify or transcribe speech.
- The effectiveness of VAD also depends on which Whisper model version is used.
- VAD tuning is often a trial-and-error process to find the best configuration for specific audio.
Experimentation and Optimization
Since VAD performance varies based on audio characteristics, fine-tuning the onset and offset values requires testing different settings. Factors to consider include:- The amount of background noise in the recording.
- The quality of speech recording (clarity, volume, and distortion).
- The model and being used for processing.