Frequently Asked Questions
Language Issues
Q: Why is the language detected incorrectly or the transcription quality poor for some languages?
A: This can happen in two scenarios:
- If you are specifying a language and it is incorrect, pass an empty string for the
Language
parameter to allow the model to auto-detect the language. - If you are already passing an empty string and the issue persists, specify the correct language explicitly.
Q: Why is the transcription inconsistent when there are multiple languages spoken in the audio?
A: Enable the multilingual flag, but be aware that initial segments may have inconsistencies. Set the Language
parameter to an empty string and include "Multilingual": True
in the payload.
Q: Why is the model not detecting the language correctly, resulting in transcription in the wrong language?
A: The model may sometimes struggle with automatic language detection. To improve recognition, add an initial prompt
in the expected language, such as "यह बातचीत हिंदी में है"
for Hindi or "ಈ ಸಂಭಾಷಣೆ ಕನ್ನಡದಲ್ಲಿದೆ"
for Kannada.
Diarization Issues
Q: Why is the diarization not matching with transcription timestamps or showing incorrect timestamps?
A: Diarization is handled by a separate model that relies on transcription results. This can lead to cases where two transcriptions spoken by one speaker are combined into a single segment, affecting timestamps. Additionally, if the without_timestamps
parameter is incorrectly set, it may cause misalignment. Ensure without_timestamps
is set to False
, resend the request, and check if the diarization timestamps align correctly. Note that OpenAI models do not support diarization directly.
Q: Why are incorrect speaker IDs assigned during diarization?
A: Set max speakers and min speakers to the actual number of speakers in the audio. Providing an arbitrary number can lead to incorrect speaker assignments.
Hallucination Issues
Q: What causes repeated transcript/ hallucination in transcription?
A: Hallucinations can occur due to an initial prompt influencing the output, noisy audio, incorrect language selection, or Voice Activity Detection (VAD) settings. To reduce hallucinations, try:
- Removing or refining the initial prompt.
- Improving audio quality by reducing background noise.
- Ensuring the correct language is selected.
- Tuning VAD parameters for better segmentation.
For details on VAD parameters, refer to VAD Parameter Tuning documentation.
Incorrect Transcriptions
Q: Why is the model output different from the expected format (translation vs. transcription)?
A: Set the task to translate
for an English output or transcribe
to get the text in the original language.
Q: Why are transcriptions incorrect on the Streaming Endpoint?
A: For streaming, his may happen due to mismatched audio format, sampling rate, or chunk size. Ensure the source audio is in the correct format (e.g., pcml) and not in the default, potentially mismatched format (e.g., wav or mp3) and matches the expected sampling rate and chunk size.
Missing Transcriptions
Q: Why are some parts of the transcription missing?
A: The issue could be due to incorrect VAD settings. Additionally, if the audio has long pauses, adjust the VAD onset
to compensate for initial silence and ensure proper transcription
Q: How do VAD settings affect transcription, and how can I improve missing transcriptions?
A: Improper VAD settings can impact transcription quality. Adjust onset and offset values and fine-tune them based on audio quality. Different VAD settings may be needed depending on the type and noise level of the audio.