POST
/
model
/
v2
/
infer
/
whisper
curl --request POST \
  --url https://http.whisper.proxy.prod.s9t.link/model/v2/infer/whisper \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "audio_data": "base64_encoded_audio_content",
  "language": "en",
  "task": "transcribe",
  "beam_size": 5,
  "best_of": 5,
  "word_timestamps": 1,
  "diarization": 0,
  "streaming": 0,
  "batch_size": 24,
  "length_penalty": 1,
  "patience": 1,
  "vad_onset": 0.5,
  "vad_offset": 0.363
}'
{
  "transcription": [
    "<string>"
  ],
  "segments": [
    {
      "start": 123,
      "end": 123,
      "text": "<string>",
      "words": [
        {
          "word": "<string>",
          "start": 123,
          "end": 123
        }
      ]
    }
  ],
  "request_time": 123,
  "language": "<string>"
}

Authorizations

Authorization
string
header
required

JWT token for authentication

Body

application/json
audio_data
string
required

TEXT FIELD: This is a string field (not a file upload). Provide the audio as a base64-encoded string. First convert your audio file (.mp3, .wav, .flac) to base64, then paste the resulting string here.

language
string
required

Language code (e.g. 'en' for English)

Example:

"en"

task
enum<string>
required

Task type - transcribe in source language or translate to English

Available options:
transcribe,
translate
Example:

"transcribe"

initial_prompt
string

Optional starting text prompt for context

beam_size
integer
default:5

Number of parallel sequences evaluated

Required range: 1 <= x <= 5
best_of
integer
default:5

Number of best sequences considered

Required range: 1 <= x <= 5
word_timestamps
enum<integer>
default:0

Include word-level timestamps (0=false, 1=true)

Available options:
0,
1
diarization
enum<integer>
default:0

Enable speaker diarization (0=false, 1=true)

Available options:
0,
1
vad_filter
enum<integer>
default:1

Enable voice activity detection filter (0=false, 1=true)

Available options:
0,
1
without_timestamps
enum<integer>
default:0

Exclude timestamps from output (0=false, 1=true)

Available options:
0,
1
streaming
enum<integer>
default:0

Enable streaming output (0=false, 1=true)

Available options:
0,
1
min_speakers
number
default:0

Minimum number of speakers to detect

max_speakers
number
default:0

Maximum number of speakers to detect

batch_size
integer
default:24

Number of audio samples processed in one batch

Required range: 0 <= x <= 24
length_penalty
number
default:1

Penalty for longer sequences

patience
number
default:1

Beam search patience factor

Required range: 0 <= x <= 1
min_duration_off
number
default:0

Minimum duration of silence for a break

min_duration_on
number
default:0

Minimum duration for speech detection

vad_onset
number
default:0.5

Voice activity detection onset threshold

Required range: 0 <= x <= 1
vad_offset
number
default:0.363

Voice activity detection offset threshold

Required range: 0 <= x <= 1
pad_offset
number
default:0

Additional padding at segment end

pad_onset
number
default:0

Additional padding at segment start

max_duration
number
default:30

Maximum duration to process

Response

200
application/json
Successful transcription
transcription
string[]

Array of transcribed text segments

segments
object[]
request_time
number

Total processing time in seconds

language
string

Detected or specified language