Whisper Deployment Guide

Overview

This API provides speech-to-text transcription and translation services using OpenAI’s Whisper V3 model with advanced features like voice activity detection (VAD), speaker diarization, and hallucination reduction.

Endpoint

POST "<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict"

Authentication

Include Bearer token in headers:

headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}

Request Parameters

Required Parameters

audio_file (string): Audio input - either Base64-encoded audio file data or publicly accessible audio URL.

Core Processing Parameters

language (string): Source language code. (e.g., “hi” for Hindi, “en” for English, None for auto-detection)
task (string): Processing task - transcribe (speech-to-text in same language) or translate (translate to English)
Here is the list of Supported Language in the Whisper Model.
Voice Activity Detection (VAD) Parameters
vad_model (string): VAD model to use - “silero” (recommended for accuracy and speed)
vad_onset (float): Threshold for speech start detection (0.0-1.0, default: 0.5)
vad_offset (float): Threshold for speech end detection (0.0-1.0, default: 0.3)
Need help with VAD parameter tuning or Whisper troubleshooting?

Check our detailed guides on VAD tuning and Whisper troubleshooting.

Timestamp Parameters

word_timestamps (boolean): Enable word-level timestamps in output (true/false).
without_timestamps (boolean): Exclude timestamps from transcription text. Only sample text tokens from whisper model. (true/false).

Speaker Identification

diarization (boolean): Enable speaker Diarization to identify different speakers (true/false)

Hallucination Reduction

strict_hallucination_reduction (boolean): Apply post-processing filters to remove repeated phrases in the transcription (true/false)

Audio Format Support

MP3, WAV, M4A, AAC
URLs must be publicly accessible or a pre-signed URL.

Example Request

Option 1: Base64-encoded audio file

import requests
import base64
import json

# Read and encode audio file
with open("audio_file.mp3", "rb") as f:
    audio_data = f.read()
    audio_base64 = base64.b64encode(audio_data).decode("utf-8")

# API request with base64
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}
payload = {
    "audio_file": audio_base64,                    # Base64-encoded audio data
    "language": "hi",                              # Hindi audio
    "task": "translate",                           # Translate to English
    "vad_model": "silero",                         # Use Silero VAD
    "word_timestamps": True,                       # Include word timestamps
    "without_timestamps": False,                   # Keep timestamps in text
    "diarization": True,                          # Identify speakers
    "vad_onset": 0.5,                             # Speech detection threshold
    "vad_offset": 0.3,                            # Speech end threshold
    "strict_hallucination_reduction": True         # Reduce false content
}

response = requests.post(
    "<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict",
    json=payload,
    headers=headers
)

Option 2: Audio URL

import requests
import json

# API request with URL
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}
payload = {
    "audio_file": "https://example.com/audio.mp3", # Publicly accessible audio URL
    "language": "en",                              # English audio
    "task": "transcribe",                          # Transcribe in same language
    "vad_model": "silero",                         # Use Silero VAD
    "word_timestamps": True,                       # Include word timestamps
    "without_timestamps": False,                   # Keep timestamps in text
    "diarization": False,                         # Single speaker
    "vad_onset": 0.5,                             # Speech detection threshold
    "vad_offset": 0.3,                            # Speech end threshold
    "strict_hallucination_reduction": True         # Reduce false content
}

response = requests.post(
    "<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict",
    json=payload,
    headers=headers
)

Response Format

Successful Response (200)

{
    "duration": [
        3.02
    ],
    "transcription": [
        {
            "text": "Hello World",
            "start": 0.0,
            "end": 3.02
        }
    ],
    "word_timestamps": [
        {
            "word": " Hello",
            "start": 0.0,
            "end": 0.34,
            "probability": 0.92
        },
        {
            "word": " World",
            "start": 0.34,
            "end": 0.6,
            "probability": 0.92
        }
    ],
    "diarization": [
        {
            "start": 0.0,
            "end": 0.34,
            "text": "Hello World",
            "speaker": 1
        }
    ],
    "info": {
        "language": "en",
        "probability": 1
    },
    "metrics": {
        "audio_loading_preprocessing": 0.009385824203491211,
        "audio_chunk": 0.014404058456420898,
        "audio_prediction": 0.08741092681884766,
        "audio_word_processing": 0.07052230834960938,
        "audio_diarization": 0.07052850723266602,
        "hallucination_reduction": 0.0004951953887939453
    }
}

Error Response (4xx)

{
  "error": "Invalid or expired authentication token",
  "code": "401"
} 

Error Response (5xx)

{
  "error": "Internal server error",
  "code": "500"
}

Error Codes

400: Invalid request format or parameters
401: Authentication failed
413: File too large
422: Unsupported audio format
500: Internal server error

Flux Compilation

LLM Compilation

Cluster Import

Deployment Guides

Whisper Deployment Guide

Overview

Endpoint

Authentication

Request Parameters

Required Parameters

Core Processing Parameters

Timestamp Parameters

Speaker Identification

Hallucination Reduction

Audio Format Support

Example Request

Option 1: Base64-encoded audio file

Option 2: Audio URL

Response Format

Successful Response (200)

Error Response (4xx)

Error Response (5xx)

Error Codes

Flux Compilation

LLM Compilation

Cluster Import

Deployment Guides

​Overview

​Endpoint

​Authentication

​Request Parameters

​Required Parameters

​Core Processing Parameters

​Timestamp Parameters

​Speaker Identification

​Hallucination Reduction

​Audio Format Support

​Example Request

​Option 1: Base64-encoded audio file

​Option 2: Audio URL

​Response Format

​Successful Response (200)

​Error Response (4xx)

​Error Response (5xx)

​Error Codes

Overview

Endpoint

Authentication

Request Parameters

Required Parameters

Core Processing Parameters

Timestamp Parameters

Speaker Identification

Hallucination Reduction

Audio Format Support

Example Request

Option 1: Base64-encoded audio file

Option 2: Audio URL

Response Format

Successful Response (200)

Error Response (4xx)

Error Response (5xx)

Error Codes