Whisper V3 API
Process audio files for transcription or translation with enhanced language support. Supports multiple audio formats and provides detailed word-level timestamps and speaker diarization.
Authorizations
JWT token for authentication
Body
TEXT FIELD: This is a string field (not a file upload). Provide the audio as a base64-encoded string. First convert your audio file (.mp3, .wav, .flac) to base64, then paste the resulting string here.
"base64_encoded_audio_content"
Language code (e.g. 'en' for English)
"en"
Task type - transcribe in source language or translate to English
transcribe
, translate
"transcribe"
Optional starting text prompt for context
"Meeting transcript between John and Sarah:"
Number of parallel sequences evaluated
1 <= x <= 5
Number of best sequences considered
1 <= x <= 5
Include word-level timestamps
Enable speaker diarization
Enable voice activity detection filter
Exclude timestamps from output
Enable streaming output
Minimum number of speakers to detect (0 for automatic)
x >= 0
Maximum number of speakers to detect (0 for automatic)
x >= 0
Number of audio samples processed in one batch
0 <= x <= 24
Penalty for longer sequences (1.0 means no penalty)
x >= 0
Beam search patience factor
0 <= x <= 1
Minimum duration of silence for a break (seconds)
x >= 0
Minimum duration for speech detection (seconds)
x >= 0
Voice activity detection onset threshold
0 <= x <= 1
Voice activity detection offset threshold
0 <= x <= 1
Additional padding at segment end (seconds)
x >= 0
Additional padding at segment start (seconds)
x >= 0
Maximum duration to process (seconds)
x >= 0
Response
Array of transcribed text segments
[
"Hello, this is a test.",
"The audio quality is good."
]
Total processing time in seconds
2.5
Detected or specified language
"en"