POST
/
model
/
infer
/
whisper
curl --request POST \
  --url https://http.whisper.proxy.prod.s9t.link/model/infer/whisper \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "audio_data": "base64_encoded_audio_content",
  "language": "en",
  "task": "transcribe",
  "word_timestamps": true,
  "diarization": false,
  "streaming": false,
  "batch_size": 24,
  "length_penalty": 1,
  "patience": 1,
  "vad_onset": 0.5,
  "vad_offset": 0.363
}'
{
  "transcription": [
    "Hello, this is a test.",
    "The audio quality is good."
  ],
  "segments": [
    {
      "start": 0,
      "end": 2.5,
      "text": "Hello, this is a test.",
      "words": [
        {
          "word": "Hello",
          "start": 0,
          "end": 0.5
        }
      ]
    }
  ],
  "request_time": 2.5,
  "language": "en"
}

Authorizations

Authorization
string
header
required

JWT token for authentication

Body

application/json
audio_data
string
required

TEXT FIELD: This is a string field (not a file upload). Provide the audio as a base64-encoded string. First convert your audio file (.mp3, .wav, .flac) to base64, then paste the resulting string here.

Example:

"base64_encoded_audio_content"

language
string
required

Language code (e.g. 'en' for English)

Example:

"en"

task
enum<string>
required

Task type - transcribe in source language or translate to English

Available options:
transcribe,
translate
Example:

"transcribe"

initial_prompt
string

Optional starting text prompt for context

Example:

"Meeting transcript between John and Sarah:"

beam_size
integer
default:5

Number of parallel sequences evaluated

Required range: 1 <= x <= 5
best_of
integer
default:5

Number of best sequences considered

Required range: 1 <= x <= 5
word_timestamps
boolean
default:false

Include word-level timestamps

diarization
boolean
default:false

Enable speaker diarization

vad_filter
boolean
default:true

Enable voice activity detection filter

without_timestamps
boolean
default:false

Exclude timestamps from output

streaming
boolean
default:false

Enable streaming output

min_speakers
integer
default:0

Minimum number of speakers to detect (0 for automatic)

Required range: x >= 0
max_speakers
integer
default:0

Maximum number of speakers to detect (0 for automatic)

Required range: x >= 0
batch_size
integer
default:24

Number of audio samples processed in one batch

Required range: 0 <= x <= 24
length_penalty
number
default:1

Penalty for longer sequences (1.0 means no penalty)

Required range: x >= 0
patience
number
default:1

Beam search patience factor

Required range: 0 <= x <= 1
min_duration_off
number
default:0

Minimum duration of silence for a break (seconds)

Required range: x >= 0
min_duration_on
number
default:0

Minimum duration for speech detection (seconds)

Required range: x >= 0
vad_onset
number
default:0.5

Voice activity detection onset threshold

Required range: 0 <= x <= 1
vad_offset
number
default:0.363

Voice activity detection offset threshold

Required range: 0 <= x <= 1
pad_offset
number
default:0

Additional padding at segment end (seconds)

Required range: x >= 0
pad_onset
number
default:0

Additional padding at segment start (seconds)

Required range: x >= 0
max_duration
number
default:30

Maximum duration to process (seconds)

Required range: x >= 0

Response

200
application/json
Successful transcription
transcription
string[]
required

Array of transcribed text segments

Example:
[
  "Hello, this is a test.",
  "The audio quality is good."
]
request_time
number
required

Total processing time in seconds

Example:

2.5

language
string
required

Detected or specified language

Example:

"en"

segments
object[]