Whisper Troubleshooting
VAD Parameter Tuning
Guide to tuning Voice Activity Detection parameters
Voice Activity Detection (VAD), also known as Speech Activity Detection or Speech Detection, is the process of identifying the presence or absence of human speech in an audio signal. It is widely used in speech processing to filter out non-speech segments.
How VAD Works
VAD assigns a probability score to an audio chunk, determining whether it contains speech or noise.
- If audio is missing expected speech segments, lowering this threshold may help include more speech but increases the risk of false positives (hallucinations).
- If hallucinations occur, it indicates that the model is filtering too aggressively.
VAD Onset and Offset
- VAD Onset: The probability threshold that determines the start of detected speech.
- VAD Offset: The probability threshold that determines the end of detected speech.
Impact of Onset and Offset Settings
- If the start of speech is missing, the onset threshold may be too high.
- If the end of speech is missing, the offset threshold may be too high.
- The model processes audio in chunks, so determining exact start and end points can be complex.
- Adjusting onset often requires adjusting offset as well, since both affect how the model processes speech segments.
Recommended VAD Ranges
Audio Type | Recommended VAD Offset | Recommended VAD Onset |
---|---|---|
Normal audio | 0.3 | 0.5 |
Noisy audio (with background noise) | 0.1 | 0.3 |
High-quality clear audio | 0.5 | 0.7 |
VAD Limitations
- VAD only provides a probability score indicating whether a chunk is speech or noise—it does not classify or transcribe speech.
- The effectiveness of VAD also depends on which Whisper model version is used.
- VAD tuning is often a trial-and-error process to find the best configuration for specific audio.
Experimentation and Optimization
Since VAD performance varies based on audio characteristics, fine-tuning the onset and offset values requires testing different settings. Factors to consider include:
- The amount of background noise in the recording.
- The quality of speech recording (clarity, volume, and distortion).
- The model and being used for processing.
By experimenting with different VAD thresholds and configurations, you can optimize speech detection for your specific use cases.