VAD Parameter Tuning

Voice Activity Detection (VAD), also known as Speech Activity Detection or Speech Detection, is the process of identifying the presence or absence of human speech in an audio signal. It is widely used in speech processing to filter out non-speech segments.

How VAD Works

VAD assigns a probability score to an audio chunk, determining whether it contains speech or noise.

If audio is missing expected speech segments, lowering this threshold may help include more speech but increases the risk of false positives (hallucinations).
If hallucinations occur, it indicates that the model is filtering too aggressively.

VAD Onset and Offset

VAD Onset: The probability threshold that determines the start of detected speech.
VAD Offset: The probability threshold that determines the end of detected speech.

Impact of Onset and Offset Settings

If the start of speech is missing, the onset threshold may be too high.
If the end of speech is missing, the offset threshold may be too high.
The model processes audio in chunks, so determining exact start and end points can be complex.
Adjusting onset often requires adjusting offset as well, since both affect how the model processes speech segments.

Recommended VAD Ranges

Audio Type	Recommended VAD Offset	Recommended VAD Onset
Normal audio	0.3	0.5
Noisy audio (with background noise)	0.1	0.3
High-quality clear audio	0.5	0.7

VAD Limitations

VAD only provides a probability score indicating whether a chunk is speech or noise—it does not classify or transcribe speech.
The effectiveness of VAD also depends on which Whisper model version is used.
VAD tuning is often a trial-and-error process to find the best configuration for specific audio.

Experimentation and Optimization

Since VAD performance varies based on audio characteristics, fine-tuning the onset and offset values requires testing different settings. Factors to consider include:

The amount of background noise in the recording.
The quality of speech recording (clarity, volume, and distortion).
The model and being used for processing.

By experimenting with different VAD thresholds and configurations, you can optimize speech detection for your specific use cases.

On this page

How VAD Works
VAD Onset and Offset
Impact of Onset and Offset Settings
Recommended VAD Ranges
VAD Limitations
Experimentation and Optimization

How VAD Works

VAD assigns a probability score to an audio chunk, determining whether it contains speech or noise.

If audio is missing expected speech segments, lowering this threshold may help include more speech but increases the risk of false positives (hallucinations).
If hallucinations occur, it indicates that the model is filtering too aggressively.

VAD Onset and Offset

VAD Onset: The probability threshold that determines the start of detected speech.
VAD Offset: The probability threshold that determines the end of detected speech.

Impact of Onset and Offset Settings

If the start of speech is missing, the onset threshold may be too high.
If the end of speech is missing, the offset threshold may be too high.
The model processes audio in chunks, so determining exact start and end points can be complex.
Adjusting onset often requires adjusting offset as well, since both affect how the model processes speech segments.

Recommended VAD Ranges

Audio Type	Recommended VAD Offset	Recommended VAD Onset
Normal audio	0.3	0.5
Noisy audio (with background noise)	0.1	0.3
High-quality clear audio	0.5	0.7

VAD Limitations

VAD only provides a probability score indicating whether a chunk is speech or noise—it does not classify or transcribe speech.
The effectiveness of VAD also depends on which Whisper model version is used.
VAD tuning is often a trial-and-error process to find the best configuration for specific audio.

Experimentation and Optimization

Since VAD performance varies based on audio characteristics, fine-tuning the onset and offset values requires testing different settings. Factors to consider include:

The amount of background noise in the recording.
The quality of speech recording (clarity, volume, and distortion).
The model and being used for processing.

By experimenting with different VAD thresholds and configurations, you can optimize speech detection for your specific use cases.

On this page

How VAD Works
VAD Onset and Offset
Impact of Onset and Offset Settings
Recommended VAD Ranges
VAD Limitations
Experimentation and Optimization

How VAD Works

VAD Onset and Offset

Impact of Onset and Offset Settings

Recommended VAD Ranges

VAD Limitations

Experimentation and Optimization

Training Configurations

Compilation Configurations

VAD Parameter Tuning

How VAD Works

VAD Onset and Offset

Impact of Onset and Offset Settings

Recommended VAD Ranges

VAD Limitations

Experimentation and Optimization

​How VAD Works

​VAD Onset and Offset

​Impact of Onset and Offset Settings

​Recommended VAD Ranges

​VAD Limitations

​Experimentation and Optimization

Training Configurations

Compilation Configurations

​How VAD Works

​VAD Onset and Offset

​Impact of Onset and Offset Settings

​Recommended VAD Ranges

​VAD Limitations

​Experimentation and Optimization

How VAD Works

VAD Onset and Offset

Impact of Onset and Offset Settings

Recommended VAD Ranges

VAD Limitations

Experimentation and Optimization

How VAD Works

VAD Onset and Offset

Impact of Onset and Offset Settings

Recommended VAD Ranges

VAD Limitations

Experimentation and Optimization