Documentation Index
Fetch the complete documentation index at: https://docs.simplismart.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This API provides speech-to-text transcription and translation services using OpenAI’s Whisper V3 model with advanced features like voice activity detection (VAD), speaker diarization, and hallucination reduction.
Endpoint
POST "<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict"
Authentication
Include Bearer token in headers:
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}
Request Parameters
Required Parameters
audio_file (string): Audio input - either Base64-encoded audio file data or publicly accessible audio URL.
Core Processing Parameters
-
language (string): Source language code. (e.g., “hi” for Hindi, “en” for English, None for auto-detection)
-
task (string): Processing task - transcribe (speech-to-text in same language) or translate (translate to English)
Voice Activity Detection (VAD) Parameters
-
vad_model (string): VAD model to use - “silero” (recommended for accuracy and speed)
-
vad_onset (float): Threshold for speech start detection (0.0-1.0, default: 0.5)
-
vad_offset (float): Threshold for speech end detection (0.0-1.0, default: 0.3)
Timestamp Parameters
word_timestamps (boolean): Enable word-level timestamps in output (true/false).
without_timestamps (boolean): Exclude timestamps from transcription text. Only sample text tokens from whisper model. (true/false).
Speaker Identification
diarization (boolean): Enable speaker Diarization to identify different speakers (true/false)
Hallucination Reduction
strict_hallucination_reduction (boolean): Apply post-processing filters to remove repeated phrases in the transcription (true/false)
- MP3, WAV, M4A, AAC
- URLs must be publicly accessible or a pre-signed URL.
Example Request
Option 1: Base64-encoded audio file
import requests
import base64
import json
# Read and encode audio file
with open("audio_file.mp3", "rb") as f:
audio_data = f.read()
audio_base64 = base64.b64encode(audio_data).decode("utf-8")
# API request with base64
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}
payload = {
"audio_file": audio_base64, # Base64-encoded audio data
"language": "hi", # Hindi audio
"task": "translate", # Translate to English
"vad_model": "silero", # Use Silero VAD
"word_timestamps": True, # Include word timestamps
"without_timestamps": False, # Keep timestamps in text
"diarization": True, # Identify speakers
"vad_onset": 0.5, # Speech detection threshold
"vad_offset": 0.3, # Speech end threshold
"strict_hallucination_reduction": True # Reduce false content
}
response = requests.post(
"<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict",
json=payload,
headers=headers
)
Option 2: Audio URL
import requests
import json
# API request with URL
headers = {"Authorization": "Bearer YOUR_JWT_TOKEN"}
payload = {
"audio_file": "https://example.com/audio.mp3", # Publicly accessible audio URL
"language": "en", # English audio
"task": "transcribe", # Transcribe in same language
"vad_model": "silero", # Use Silero VAD
"word_timestamps": True, # Include word timestamps
"without_timestamps": False, # Keep timestamps in text
"diarization": False, # Single speaker
"vad_onset": 0.5, # Speech detection threshold
"vad_offset": 0.3, # Speech end threshold
"strict_hallucination_reduction": True # Reduce false content
}
response = requests.post(
"<YOUR_PRIVATE_WHISPER_DEPLOYMENT>/predict",
json=payload,
headers=headers
)
Successful Response (200)
{
"duration": [
3.02
],
"transcription": [
{
"text": "Hello World",
"start": 0.0,
"end": 3.02
}
],
"word_timestamps": [
{
"word": " Hello",
"start": 0.0,
"end": 0.34,
"probability": 0.92
},
{
"word": " World",
"start": 0.34,
"end": 0.6,
"probability": 0.92
}
],
"diarization": [
{
"start": 0.0,
"end": 0.34,
"text": "Hello World",
"speaker": 1
}
],
"info": {
"language": "en",
"probability": 1
},
"metrics": {
"audio_loading_preprocessing": 0.009385824203491211,
"audio_chunk": 0.014404058456420898,
"audio_prediction": 0.08741092681884766,
"audio_word_processing": 0.07052230834960938,
"audio_diarization": 0.07052850723266602,
"hallucination_reduction": 0.0004951953887939453
}
}
Error Response (4xx)
{
"error": "Invalid or expired authentication token",
"code": "401"
}
Error Response (5xx)
{
"error": "Internal server error",
"code": "500"
}
Error Codes
- 400: Invalid request format or parameters
- 401: Authentication failed
- 413: File too large
- 422: Unsupported audio format
- 500: Internal server error