audio_file
(string): Audio input - either Base64-encoded audio file data or publicly accessible audio URL.language
(string): Source language code. (e.g., “hi” for Hindi, “en” for English, None
for auto-detection)
task
(string): Processing task - transcribe (speech-to-text in same language) or translate (translate to English)
vad_model
(string): VAD model to use - “silero
” (recommended for accuracy and speed)
vad_onset
(float): Threshold for speech start detection (0.0-1.0, default: 0.5)
vad_offset
(float): Threshold for speech end detection (0.0-1.0, default: 0.3)
word_timestamps
(boolean): Enable word-level timestamps in output (true
/false
).without_timestamps
(boolean): Exclude timestamps from transcription text. Only sample text tokens from whisper model. (true
/false
).diarization
(boolean): Enable speaker Diarization to identify different speakers (true/false)strict_hallucination_reduction
(boolean): Apply post-processing filters to remove repeated phrases in the transcription (true/false)