Overview

This API provides speech-to-text transcription and translation services using OpenAI’s Whisper V3 model with advanced features like voice activity detection (VAD), speaker diarization, and hallucination reduction.

Endpoint

POST "<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict"

Authentication

Include Bearer token in headers:
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}

Request Parameters

Required Parameters

  • audio_file (string): Audio input - either Base64-encoded audio file data or publicly accessible audio URL.

Core Processing Parameters

  • language (string): Source language code. (e.g., “hi” for Hindi, “en” for English, None for auto-detection)
  • task (string): Processing task - transcribe (speech-to-text in same language) or translate (translate to English)
    Here is the list of Supported Language in the Whisper Model.
    Voice Activity Detection (VAD) Parameters
  • vad_model (string): VAD model to use - “silero” (recommended for accuracy and speed)
  • vad_onset (float): Threshold for speech start detection (0.0-1.0, default: 0.5)
  • vad_offset (float): Threshold for speech end detection (0.0-1.0, default: 0.3)
    Need help with VAD parameter tuning or Whisper troubleshooting?

    Check our detailed guides on VAD tuning and Whisper troubleshooting.

Timestamp Parameters

  • word_timestamps (boolean): Enable word-level timestamps in output (true/false).
  • without_timestamps (boolean): Exclude timestamps from transcription text. Only sample text tokens from whisper model. (true/false).

Speaker Identification

  • diarization (boolean): Enable speaker Diarization to identify different speakers (true/false)

Hallucination Reduction

  • strict_hallucination_reduction (boolean): Apply post-processing filters to remove repeated phrases in the transcription (true/false)

Audio Format Support

  • MP3, WAV, M4A, AAC
  • URLs must be publicly accessible or a pre-signed URL.

Example Request

Option 1: Base64-encoded audio file

import requests
import base64
import json

# Read and encode audio file
with open("audio_file.mp3", "rb") as f:
    audio_data = f.read()
    audio_base64 = base64.b64encode(audio_data).decode("utf-8")

# API request with base64
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}
payload = {
    "audio_file": audio_base64,                    # Base64-encoded audio data
    "language": "hi",                              # Hindi audio
    "task": "translate",                           # Translate to English
    "vad_model": "silero",                         # Use Silero VAD
    "word_timestamps": True,                       # Include word timestamps
    "without_timestamps": False,                   # Keep timestamps in text
    "diarization": True,                          # Identify speakers
    "vad_onset": 0.5,                             # Speech detection threshold
    "vad_offset": 0.3,                            # Speech end threshold
    "strict_hallucination_reduction": True         # Reduce false content
}

response = requests.post(
    "<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict",
    json=payload,
    headers=headers
)

Option 2: Audio URL

import requests
import json

# API request with URL
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}
payload = {
    "audio_file": "https://example.com/audio.mp3", # Publicly accessible audio URL
    "language": "en",                              # English audio
    "task": "transcribe",                          # Transcribe in same language
    "vad_model": "silero",                         # Use Silero VAD
    "word_timestamps": True,                       # Include word timestamps
    "without_timestamps": False,                   # Keep timestamps in text
    "diarization": False,                         # Single speaker
    "vad_onset": 0.5,                             # Speech detection threshold
    "vad_offset": 0.3,                            # Speech end threshold
    "strict_hallucination_reduction": True         # Reduce false content
}

response = requests.post(
    "<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict",
    json=payload,
    headers=headers
)

Response Format

Successful Response (200)

{
    "duration": [
        3.02
    ],
    "transcription": [
        {
            "text": "Hello World",
            "start": 0.0,
            "end": 3.02
        }
    ],
    "word_timestamps": [
        {
            "word": " Hello",
            "start": 0.0,
            "end": 0.34,
            "probability": 0.92
        },
        {
            "word": " World",
            "start": 0.34,
            "end": 0.6,
            "probability": 0.92
        }
    ],
    "diarization": [
        {
            "start": 0.0,
            "end": 0.34,
            "text": "Hello World",
            "speaker": 1
        }
    ],
    "info": {
        "language": "en",
        "probability": 1
    },
    "metrics": {
        "audio_loading_preprocessing": 0.009385824203491211,
        "audio_chunk": 0.014404058456420898,
        "audio_prediction": 0.08741092681884766,
        "audio_word_processing": 0.07052230834960938,
        "audio_diarization": 0.07052850723266602,
        "hallucination_reduction": 0.0004951953887939453
    }
}

Error Response (4xx)

{
  "error": "Invalid or expired authentication token",
  "code": "401"
} 

Error Response (5xx)

{
  "error": "Internal server error",
  "code": "500"
}

Error Codes

  • 400: Invalid request format or parameters
  • 401: Authentication failed
  • 413: File too large
  • 422: Unsupported audio format
  • 500: Internal server error