The Whisper playground is designed for audio processing and transcription. You can:

  • Upload Audio Files: Test the model by uploading audio files and receiving transcriptions.
  • Set Advanced Parameters: Configure settings such as initial prompts, the number of speakers, beam size, audio sample rate, and more to fine-tune transcription accuracy.
  • Process Audio Real-Time: Experience real-time audio processing and transcription to evaluate performance in various scenarios.
  • Evaluate Results: Review and analyze transcriptions to ensure they meet the desired accuracy and quality.


Settings explained

language: Language spoken in the audio, specify None to perform language detection.

task: Determines if the Whisper model should perform translation or transcription.

initial prompt: Optional starting text prompt for the model, useful for guiding the initial context. e.g. custom vocabularies or proper nouns to make it more likely to predict those word correctly.

best of: Specifies how many decoding paths to consider and choose the best from, higher values can improve quality.

no of speakers: The number of speakers in the audio, important for separating dialogues.

diarization: Assignment of speakers to different parts of the text.

word timestamps: Indicates if word-level timestamps should be in the output.

without timestamps: Option to exclude timestamps in the output.

beam size: Controls the breadth of search in beam search decoding, larger values improve accuracy but increase computation.

length penalty: A factor that penalizes longer predictions, helps control output length.

batch size: The number of audio samples processed together in one batch.

patience: The duration to wait before making a prediction, useful for adjusting responsiveness.

minimum duration on: Minimum duration of speech to consider it as an active segment.

minimum duration off: Minimum duration of silence to consider it as a break.

maximum duration: The longest duration of speech to process in one go, prevents excessive processing time.

maximum speakers: The maximum number of speakers expected in the audio.

minimum speakers: The minimum number of speakers expected in the audio.

vad onset: Sensitivity for detecting the start of speech.

vad offset: Sensitivity for detecting the end of speech.

pad onset: Additional padding time added to the start of detected speech.

pad offset: Additional padding time added to the end of detected speech.


Access the Whisper model API documentation here for endpoints, parameters, and code examples.

Need help with VAD parameter tuning or Whisper troubleshooting? Check our detailed guides on VAD tuning and Whisper troubleshooting.


Supported Languages with their Codes

LANGUAGES =  {
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
    "yue": "cantonese",
}