vllm.config.speech_to_text ¶
SpeechToTextConfig ¶
Configuration for speech-to-text models.
Source code in vllm/config/speech_to_text.py
max_audio_clip_s class-attribute
instance-attribute
¶
max_audio_clip_s: int = 30
Maximum duration in seconds for a single audio clip without chunking. Audio longer than this will be split into smaller chunks if allow_audio_chunking
evaluates to True, otherwise it will be rejected.
min_energy_split_window_size class-attribute
instance-attribute
¶
Window size in samples for finding low-energy (quiet) regions to split audio chunks. The algorithm looks for the quietest moment within this window to minimize cutting through speech. Default 1600 samples ≈ 100ms at 16kHz. If None, no chunking will be done.