Skip to content

vllm.model_executor.models

Modules:

Name Description
adapters
aimv2
apertus

Inference-only Apertus model compatible with HuggingFace weights.

arcee
arctic

Inference-only Snowflake Arctic model.

aria
aya_vision
baichuan

Inference-only BaiChuan model compatible with HuggingFace weights.

bailing_moe

Inference-only BailingMoE model compatible with HuggingFace weights.

bamba

Inference-only Bamba model.

bert
bert_with_rope
blip

Minimal implementation of BlipVisionModel intended to be only used

blip2
bloom

Inference-only BLOOM model compatible with HuggingFace weights.

chameleon
chatglm

Inference-only ChatGLM model compatible with THUDM weights.

clip

Minimal implementation of CLIPVisionModel intended to be only used

cohere2_vision

Command-A-Vision (Cohere2Vision) multimodal model implementation for vLLM.

commandr

PyTorch Cohere model.

config
dbrx
deepseek

Inference-only Deepseek model.

deepseek_eagle
deepseek_mtp
deepseek_v2

Inference-only DeepseekV2/DeepseekV3 model.

deepseek_vl2

Inference-only Deepseek-VL2 model compatible with HuggingFace weights.

dots1

Inference-only dots1 model.

dots_ocr
ernie45

Inference-only Erine model compatible with HuggingFace weights.

ernie45_moe

Inference-only ErineMoE model compatible with HuggingFace weights.

ernie45_vl

Inference-only Erine VL model compatible with HuggingFace weights.

ernie45_vl_moe

Inference-only Erine VL model compatible with HuggingFace weights.

ernie_mtp

Inference-only Ernie-MTP model.

exaone

Inference-only Exaone model compatible with HuggingFace weights.

exaone4

Inference-only Exaone model compatible with HuggingFace weights.

fairseq2_llama

Llama model for fairseq2 weights.

falcon

PyTorch Falcon model.

falcon_h1

Inference-only FalconH1 model.

fuyu

PyTorch Fuyu model.

gemma

Inference-only Gemma model compatible with HuggingFace weights.

gemma2
gemma3
gemma3_mm
gemma3n
gemma3n_mm
glm

Inference-only HF format GLM-4 model compatible with THUDM weights.

glm4

Inference-only GLM-4-0414 model compatible with HuggingFace weights.

glm4_1v

Inference-only GLM-4V model compatible with HuggingFace weights.

glm4_moe

Inference-only GLM-4.5, GLM-4.6 model compatible with HuggingFace weights.

glm4_moe_mtp

Inference-only GLM-4.5 MTP model compatible with HuggingFace weights.

glm4v

Inference-only CogAgent model compatible with THUDM weights.

gpt2

Inference-only GPT-2 model compatible with HuggingFace weights.

gpt_bigcode

Inference-only GPTBigCode model compatible with HuggingFace weights.

gpt_j

Inference-only GPT-J model compatible with HuggingFace weights.

gpt_neox

Inference-only GPT-NeoX model compatible with HuggingFace weights.

gpt_oss
granite

Inference-only IBM Granite model compatible with HuggingFace weights.

granite_speech

Inference-only IBM Granite speech model.

granitemoe

Inference-only GraniteMoe model.

granitemoehybrid

Inference-only GraniteMoeHybrid model.

granitemoeshared

Inference-only GraniteMoeShared model.

gritlm
grok1

Inference-only Grok1 model.

h2ovl
hunyuan_v1

Inference-only HunYuan model compatible with HuggingFace weights.

hyperclovax_vision
idefics2_vision_model

PyTorch Idefics2 model.

idefics3

Inference-only Idefics3 model compatible with HuggingFace weights.

interfaces
interfaces_base
intern_vit
internlm2
internlm2_ve
interns1
interns1_vit
internvl
jais

Inference-only Jais model compatible with HuggingFace weights.

jamba

Inference-only Jamba model.

jina_vl
keye
keye_vl1_5
kimi_vl
lfm2
llama

Inference-only LLaMA model compatible with HuggingFace weights.

llama4

Inference-only LLaMA model compatible with HuggingFace weights.

llama4_eagle
llama_eagle
llama_eagle3
llava
llava_next
llava_next_video
llava_onevision
longcat_flash

Inference-only Flash model compatible with HuggingFace weights.

longcat_flash_mtp
mamba

PyTorch MAMBA model.

mamba2

PyTorch MAMBA2 model.

medusa
midashenglm

Inference-only MiDashengLM model compatible with HuggingFace weights.

mimo

Inference-only MiMo model compatible with HuggingFace weights.

mimo_mtp

Inference-only MiMo-MTP model.

minicpm

Inference-only MiniCPM model compatible with HuggingFace weights.

minicpm3

Inference-only MiniCPM3 model compatible with HuggingFace weights.

minicpm_eagle

Inference-only EagleMiniCPM model compatible with HuggingFace weights.

minicpmo

Inference-only MiniCPM-O model compatible with HuggingFace weights.

minicpmv

Inference-only MiniCPM-V model compatible with HuggingFace weights.

minimax_text_01

Inference-only MiniMaxText01 model.

minimax_vl_01
mistral3
mixtral

Inference-only Mixtral model.

mllama4
mlp_speculator
modernbert
module_mapping
molmo
moonvit
mpt
nano_nemotron_vl
nemotron

Inference-only Nemotron model compatible with HuggingFace weights.

nemotron_h

Inference-only NemotronH model.

nemotron_nas

Inference-only deci model compatible with HuggingFace weights.

nemotron_vl
nvlm_d
olmo

Inference-only OLMo model compatible with HuggingFace weights.

olmo2

Inference-only OLMo2 model compatible with HuggingFace weights.

olmoe

Inference-only OLMoE model compatible with HuggingFace weights.

opt

Inference-only OPT model compatible with HuggingFace weights.

orion

Inference-only Orion-14B model compatible with HuggingFace weights.

ovis

PyTorch Ovis model.

ovis2_5

PyTorch Ovis model.

paligemma
persimmon

Inference-only persimmon model compatible with HuggingFace weights.

phi

Inference-only Phi-1.5 model compatible with HuggingFace weights.

phi3

Inference-only Phi3 model code inherit from Llama.py

phi3v
phi4_multimodal
phi4mm
phi4mm_audio
phi4mm_utils
phimoe

Inference-only PhiMoE model.

pixtral
plamo2

Inference-only PLaMo2 model.

qwen

Inference-only QWen model compatible with HuggingFace weights.

qwen2

Inference-only Qwen2 model compatible with HuggingFace weights.

qwen2_5_omni_thinker

Inference-only Qwen2.5-Omni model (thinker part).

qwen2_5_vl

Inference-only Qwen2.5-VL model compatible with HuggingFace weights.

qwen2_audio

Inference-only Qwen2-Audio model compatible with HuggingFace weights.

qwen2_moe

Inference-only Qwen2MoE model compatible with HuggingFace weights.

qwen2_rm

Inference-only Qwen2-RM model compatible with HuggingFace weights.

qwen2_vl

Inference-only Qwen2-VL model compatible with HuggingFace weights.

qwen3

Inference-only Qwen3 model compatible with HuggingFace weights.

qwen3_moe

Inference-only Qwen3MoE model compatible with HuggingFace weights.

qwen3_next

Inference-only Qwen3Next model.

qwen3_next_mtp

Inference-only Qwen3Next MTP model.

qwen3_vl

Inference-only Qwen3VL model compatible with HuggingFace weights.

qwen3_vl_moe

Inference-only Qwen3-VL-MoE model compatible with HuggingFace weights.

qwen_vl

Inference-only Qwen-VL model compatible with HuggingFace weights.

radio
registry

Whenever you add an architecture to this page, please also update

roberta
rvl
seed_oss

Inference-only SeedOss model compatible with HuggingFace weights.

siglip

Implementation of SiglipVisionModel intended to be only used

siglip2navit

Implementation of SiglipVisionModel intended to be only used

skyworkr1v
smolvlm
solar

Inference-only Solar model compatible with HuggingFace weights.

stablelm

Inference-only StabeLM (https://github.com/Stability-AI/StableLM)

starcoder2

PyTorch Starcoder2 model.

step3_text

Inference-only Jurassic model.

step3_vl
swin
tarsier
telechat2
teleflm
terratorch

Wrapper around Terratorch models

transformers

Wrapper around transformers models

ultravox

PyTorch Ultravox model.

utils
vision
voxtral
whisper
zamba2

PyTorch Zamba2 model implementation for vLLM.

ModelRegistry module-attribute

ModelRegistry = _ModelRegistry(
    {
        model_arch: (
            _LazyRegisteredModel(
                module_name=f"vllm.model_executor.models.{mod_relname}",
                class_name=cls_name,
            )
        )
        for (model_arch, (mod_relname, cls_name)) in (
            items()
        )
    }
)

__all__ module-attribute

__all__ = [
    "ModelRegistry",
    "VllmModelForPooling",
    "is_pooling_model",
    "VllmModelForTextGeneration",
    "is_text_generation_model",
    "HasInnerState",
    "has_inner_state",
    "SupportsLoRA",
    "supports_lora",
    "SupportsMultiModal",
    "supports_multimodal",
    "SupportsMRoPE",
    "supports_mrope",
    "SupportsPP",
    "supports_pp",
    "SupportsTranscription",
    "supports_transcription",
    "SupportsV0Only",
    "supports_v0_only",
]

HasInnerState

Bases: Protocol

The interface required for all models that has inner state.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class HasInnerState(Protocol):
    """The interface required for all models that has inner state."""

    has_inner_state: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has inner state.
        Models that has inner state usually need access to the scheduler_config
        for max_num_seqs, etc. True for e.g. both Mamba and Jamba.
    """

has_inner_state class-attribute

has_inner_state: Literal[True] = True

A flag that indicates this model has inner state. Models that has inner state usually need access to the scheduler_config for max_num_seqs, etc. True for e.g. both Mamba and Jamba.

SupportsLoRA

Bases: Protocol

The interface required for all models that support LoRA.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsLoRA(Protocol):
    """The interface required for all models that support LoRA."""

    supports_lora: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports LoRA.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """
    # The `embedding_module` and `embedding_padding_modules`
    # are empty by default.
    embedding_modules: ClassVar[dict[str, str]] = {}
    embedding_padding_modules: ClassVar[list[str]] = []
    packed_modules_mapping: ClassVar[dict[str, list[str]]] = {}

embedding_modules class-attribute

embedding_modules: dict[str, str] = {}

embedding_padding_modules class-attribute

embedding_padding_modules: list[str] = []

packed_modules_mapping class-attribute

packed_modules_mapping: dict[str, list[str]] = {}

supports_lora class-attribute

supports_lora: Literal[True] = True

A flag that indicates this model supports LoRA.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

SupportsMRoPE

Bases: Protocol

The interface required for all models that support M-RoPE.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsMRoPE(Protocol):
    """The interface required for all models that support M-RoPE."""

    supports_mrope: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports M-RoPE.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def get_mrope_input_positions(
        self,
        input_tokens: list[int],
        hf_config: PretrainedConfig,
        image_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
        video_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
        second_per_grid_ts: Optional[list[float]] = None,
        context_len: int = 0,
        seq_len: Optional[int] = None,
        audio_feature_lengths: Optional[torch.Tensor] = None,
        use_audio_in_video: bool = False,
    ) -> tuple[torch.Tensor, int]:
        """
        Get M-RoPE input positions and delta value for this specific model.

        This method should be implemented by each model that supports M-RoPE
        to provide model-specific logic for computing input positions.

        Args:
            input_tokens: List of input token IDs
            hf_config: HuggingFace model configuration
            image_grid_thw: Image grid dimensions (t, h, w)
            video_grid_thw: Video grid dimensions (t, h, w)
            second_per_grid_ts: Seconds per grid timestep for videos
            context_len: Context length
            seq_len: Sequence length
            audio_feature_lengths: Audio feature lengths for multimodal models
            use_audio_in_video: Whether to use audio in video for interleaving

        Returns:
            Tuple of (llm_positions, mrope_position_delta)
            - llm_positions: Tensor of shape [3, num_tokens]
                with T/H/W positions
            - mrope_position_delta: Delta for position calculations
        """
        ...

supports_mrope class-attribute

supports_mrope: Literal[True] = True

A flag that indicates this model supports M-RoPE.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_mrope_input_positions

get_mrope_input_positions(
    input_tokens: list[int],
    hf_config: PretrainedConfig,
    image_grid_thw: Optional[
        Union[list[list[int]], Tensor]
    ],
    video_grid_thw: Optional[
        Union[list[list[int]], Tensor]
    ],
    second_per_grid_ts: Optional[list[float]] = None,
    context_len: int = 0,
    seq_len: Optional[int] = None,
    audio_feature_lengths: Optional[Tensor] = None,
    use_audio_in_video: bool = False,
) -> tuple[Tensor, int]

Get M-RoPE input positions and delta value for this specific model.

This method should be implemented by each model that supports M-RoPE to provide model-specific logic for computing input positions.

Parameters:

Name Type Description Default
input_tokens list[int]

List of input token IDs

required
hf_config PretrainedConfig

HuggingFace model configuration

required
image_grid_thw Optional[Union[list[list[int]], Tensor]]

Image grid dimensions (t, h, w)

required
video_grid_thw Optional[Union[list[list[int]], Tensor]]

Video grid dimensions (t, h, w)

required
second_per_grid_ts Optional[list[float]]

Seconds per grid timestep for videos

None
context_len int

Context length

0
seq_len Optional[int]

Sequence length

None
audio_feature_lengths Optional[Tensor]

Audio feature lengths for multimodal models

None
use_audio_in_video bool

Whether to use audio in video for interleaving

False

Returns:

Type Description
Tensor

Tuple of (llm_positions, mrope_position_delta)

int
  • llm_positions: Tensor of shape [3, num_tokens] with T/H/W positions
tuple[Tensor, int]
  • mrope_position_delta: Delta for position calculations
Source code in vllm/model_executor/models/interfaces.py
def get_mrope_input_positions(
    self,
    input_tokens: list[int],
    hf_config: PretrainedConfig,
    image_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
    video_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
    second_per_grid_ts: Optional[list[float]] = None,
    context_len: int = 0,
    seq_len: Optional[int] = None,
    audio_feature_lengths: Optional[torch.Tensor] = None,
    use_audio_in_video: bool = False,
) -> tuple[torch.Tensor, int]:
    """
    Get M-RoPE input positions and delta value for this specific model.

    This method should be implemented by each model that supports M-RoPE
    to provide model-specific logic for computing input positions.

    Args:
        input_tokens: List of input token IDs
        hf_config: HuggingFace model configuration
        image_grid_thw: Image grid dimensions (t, h, w)
        video_grid_thw: Video grid dimensions (t, h, w)
        second_per_grid_ts: Seconds per grid timestep for videos
        context_len: Context length
        seq_len: Sequence length
        audio_feature_lengths: Audio feature lengths for multimodal models
        use_audio_in_video: Whether to use audio in video for interleaving

    Returns:
        Tuple of (llm_positions, mrope_position_delta)
        - llm_positions: Tensor of shape [3, num_tokens]
            with T/H/W positions
        - mrope_position_delta: Delta for position calculations
    """
    ...

SupportsMultiModal

Bases: Protocol

The interface required for all multi-modal models.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsMultiModal(Protocol):
    """The interface required for all multi-modal models."""

    supports_multimodal: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports multi-modal inputs.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    supports_multimodal_raw_input_only: ClassVar[bool] = False
    """
    A flag that indicates this model supports multi-modal inputs and processes
    them in their raw form and not embeddings.
    """

    supports_encoder_tp_data: ClassVar[bool] = False
    """
    A flag that indicates whether this model supports
    `multimodal_config.mm_encoder_tp_mode="data"`.
    """

    merge_by_field_config: ClassVar[bool] = False
    """
    A flag that indicates which implementation of
    `vllm.multimodal.utils.group_mm_kwargs_by_modality` to use.
    """

    @classmethod
    def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
        """
        Get the placeholder text for the `i`th `modality` item in the prompt.
        """
        ...

    def get_multimodal_embeddings(self,
                                  **kwargs: object) -> MultiModalEmbeddings:
        """
        Returns multimodal embeddings generated from multimodal kwargs 
        to be merged with text embeddings.

        Note:
            The returned multimodal embeddings must be in the same order as
            the appearances of their corresponding multimodal data item in the
            input prompt.
        """
        ...

    def get_language_model(self) -> VllmModel:
        """
        Returns the underlying language model used for text generation.

        This is typically the `torch.nn.Module` instance responsible for 
        processing the merged multimodal embeddings and producing hidden states

        Returns:
            torch.nn.Module: The core language model component.
        """
        ...

    @overload
    def get_input_embeddings(self, input_ids: Tensor) -> Tensor:
        ...

    @overload
    def get_input_embeddings(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings,
        *,
        is_multimodal: torch.Tensor,
        handle_oov_mm_token: bool = False,
    ) -> Tensor:
        ...

    def _get_text_embeddings(
        self,
        input_ids: Tensor,
        get_input_embeddings: Callable[[Tensor], Tensor],
        *,
        is_multimodal: Optional[Tensor],
        handle_oov_mm_token: bool,
    ) -> Tensor:
        if handle_oov_mm_token and is_multimodal is not None:
            is_text = ~is_multimodal
            text_embeds = get_input_embeddings(input_ids[is_text])

            return torch.empty(
                (input_ids.shape[0], text_embeds.shape[1]),
                dtype=text_embeds.dtype,
                device=text_embeds.device,
            ).masked_scatter_(is_text.unsqueeze_(-1), text_embeds)

        return get_input_embeddings(input_ids)

    def get_input_embeddings(
        self,
        input_ids: Tensor,
        multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
        *,
        is_multimodal: Optional[Tensor] = None,
        handle_oov_mm_token: bool = False,
    ) -> Tensor:
        """
        Apply token embeddings to `input_ids`.

        If `multimodal_embeddings` is passed, scatter them into
        `input_ids` according to the mask `is_multimodal`.

        In case the multi-modal token IDs exceed the vocabulary size of
        the language model, you can set `handle_oov_mm_token=False`
        to avoid calling the language model's `get_input_embeddings` method
        on those tokens. Note however that doing so increases memory usage
        as an additional buffer is needed to hold the input embeddings.
        """
        from .utils import _merge_multimodal_embeddings

        inputs_embeds = self._get_text_embeddings(
            input_ids,
            self.get_language_model().get_input_embeddings,
            is_multimodal=is_multimodal,
            handle_oov_mm_token=handle_oov_mm_token,
        )

        if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
            return inputs_embeds

        if is_multimodal is None:
            raise ValueError(
                "`get_input_embeddings` now requires `is_multimodal` arg, "
                "please update your model runner according to "
                "https://github.com/vllm-project/vllm/pull/16229.")

        return _merge_multimodal_embeddings(
            inputs_embeds=inputs_embeds,
            multimodal_embeddings=multimodal_embeddings,
            is_multimodal=is_multimodal,
        )

merge_by_field_config class-attribute

merge_by_field_config: bool = False

A flag that indicates which implementation of vllm.multimodal.utils.group_mm_kwargs_by_modality to use.

supports_encoder_tp_data class-attribute

supports_encoder_tp_data: bool = False

A flag that indicates whether this model supports multimodal_config.mm_encoder_tp_mode="data".

supports_multimodal class-attribute

supports_multimodal: Literal[True] = True

A flag that indicates this model supports multi-modal inputs.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

supports_multimodal_raw_input_only class-attribute

supports_multimodal_raw_input_only: bool = False

A flag that indicates this model supports multi-modal inputs and processes them in their raw form and not embeddings.

_get_text_embeddings

_get_text_embeddings(
    input_ids: Tensor,
    get_input_embeddings: Callable[[Tensor], Tensor],
    *,
    is_multimodal: Optional[Tensor],
    handle_oov_mm_token: bool,
) -> Tensor
Source code in vllm/model_executor/models/interfaces.py
def _get_text_embeddings(
    self,
    input_ids: Tensor,
    get_input_embeddings: Callable[[Tensor], Tensor],
    *,
    is_multimodal: Optional[Tensor],
    handle_oov_mm_token: bool,
) -> Tensor:
    if handle_oov_mm_token and is_multimodal is not None:
        is_text = ~is_multimodal
        text_embeds = get_input_embeddings(input_ids[is_text])

        return torch.empty(
            (input_ids.shape[0], text_embeds.shape[1]),
            dtype=text_embeds.dtype,
            device=text_embeds.device,
        ).masked_scatter_(is_text.unsqueeze_(-1), text_embeds)

    return get_input_embeddings(input_ids)

get_input_embeddings

get_input_embeddings(input_ids: Tensor) -> Tensor
get_input_embeddings(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings,
    *,
    is_multimodal: Tensor,
    handle_oov_mm_token: bool = False,
) -> Tensor
get_input_embeddings(
    input_ids: Tensor,
    multimodal_embeddings: Optional[
        MultiModalEmbeddings
    ] = None,
    *,
    is_multimodal: Optional[Tensor] = None,
    handle_oov_mm_token: bool = False,
) -> Tensor

Apply token embeddings to input_ids.

If multimodal_embeddings is passed, scatter them into input_ids according to the mask is_multimodal.

In case the multi-modal token IDs exceed the vocabulary size of the language model, you can set handle_oov_mm_token=False to avoid calling the language model's get_input_embeddings method on those tokens. Note however that doing so increases memory usage as an additional buffer is needed to hold the input embeddings.

Source code in vllm/model_executor/models/interfaces.py
def get_input_embeddings(
    self,
    input_ids: Tensor,
    multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
    *,
    is_multimodal: Optional[Tensor] = None,
    handle_oov_mm_token: bool = False,
) -> Tensor:
    """
    Apply token embeddings to `input_ids`.

    If `multimodal_embeddings` is passed, scatter them into
    `input_ids` according to the mask `is_multimodal`.

    In case the multi-modal token IDs exceed the vocabulary size of
    the language model, you can set `handle_oov_mm_token=False`
    to avoid calling the language model's `get_input_embeddings` method
    on those tokens. Note however that doing so increases memory usage
    as an additional buffer is needed to hold the input embeddings.
    """
    from .utils import _merge_multimodal_embeddings

    inputs_embeds = self._get_text_embeddings(
        input_ids,
        self.get_language_model().get_input_embeddings,
        is_multimodal=is_multimodal,
        handle_oov_mm_token=handle_oov_mm_token,
    )

    if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
        return inputs_embeds

    if is_multimodal is None:
        raise ValueError(
            "`get_input_embeddings` now requires `is_multimodal` arg, "
            "please update your model runner according to "
            "https://github.com/vllm-project/vllm/pull/16229.")

    return _merge_multimodal_embeddings(
        inputs_embeds=inputs_embeds,
        multimodal_embeddings=multimodal_embeddings,
        is_multimodal=is_multimodal,
    )

get_language_model

get_language_model() -> VllmModel

Returns the underlying language model used for text generation.

This is typically the torch.nn.Module instance responsible for processing the merged multimodal embeddings and producing hidden states

Returns:

Type Description
VllmModel

torch.nn.Module: The core language model component.

Source code in vllm/model_executor/models/interfaces.py
def get_language_model(self) -> VllmModel:
    """
    Returns the underlying language model used for text generation.

    This is typically the `torch.nn.Module` instance responsible for 
    processing the merged multimodal embeddings and producing hidden states

    Returns:
        torch.nn.Module: The core language model component.
    """
    ...

get_multimodal_embeddings

get_multimodal_embeddings(
    **kwargs: object,
) -> MultiModalEmbeddings

Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.

Note

The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.

Source code in vllm/model_executor/models/interfaces.py
def get_multimodal_embeddings(self,
                              **kwargs: object) -> MultiModalEmbeddings:
    """
    Returns multimodal embeddings generated from multimodal kwargs 
    to be merged with text embeddings.

    Note:
        The returned multimodal embeddings must be in the same order as
        the appearances of their corresponding multimodal data item in the
        input prompt.
    """
    ...

get_placeholder_str classmethod

get_placeholder_str(modality: str, i: int) -> Optional[str]

Get the placeholder text for the ith modality item in the prompt.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
    """
    Get the placeholder text for the `i`th `modality` item in the prompt.
    """
    ...

SupportsPP

Bases: Protocol

The interface required for all models that support pipeline parallel.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsPP(Protocol):
    """The interface required for all models that support pipeline parallel."""

    supports_pp: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pipeline parallel.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def make_empty_intermediate_tensors(
        self,
        batch_size: int,
        dtype: torch.dtype,
        device: torch.device,
    ) -> "IntermediateTensors":
        """Called when PP rank > 0 for profiling purposes."""
        ...

    def forward(
        self,
        *,
        intermediate_tensors: Optional["IntermediateTensors"],
    ) -> Union[Tensor, "IntermediateTensors"]:
        """
        Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
        PP rank > 0.

        Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
        for the last PP rank.
        """
        ...

supports_pp class-attribute

supports_pp: Literal[True] = True

A flag that indicates this model supports pipeline parallel.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

forward

forward(
    *, intermediate_tensors: Optional[IntermediateTensors]
) -> Union[Tensor, IntermediateTensors]

Accept IntermediateTensors when PP rank > 0.

Return IntermediateTensors only for the last PP rank.

Source code in vllm/model_executor/models/interfaces.py
def forward(
    self,
    *,
    intermediate_tensors: Optional["IntermediateTensors"],
) -> Union[Tensor, "IntermediateTensors"]:
    """
    Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
    PP rank > 0.

    Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
    for the last PP rank.
    """
    ...

make_empty_intermediate_tensors

make_empty_intermediate_tensors(
    batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors

Called when PP rank > 0 for profiling purposes.

Source code in vllm/model_executor/models/interfaces.py
def make_empty_intermediate_tensors(
    self,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device,
) -> "IntermediateTensors":
    """Called when PP rank > 0 for profiling purposes."""
    ...

SupportsTranscription

Bases: Protocol

The interface required for all models that support transcription.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsTranscription(Protocol):
    """The interface required for all models that support transcription."""
    # Mapping from ISO639_1 language codes: language names
    supported_languages: ClassVar[Mapping[str, str]]

    supports_transcription: ClassVar[Literal[True]] = True

    supports_transcription_only: ClassVar[bool] = False
    """
    Transcription models can opt out of text generation by setting this to
    `True`.
    """

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        # language codes in supported_languages
        # that don't exist in the full language map
        invalid = set(cls.supported_languages) - set(LANGUAGES.keys())
        if invalid:
            raise ValueError(
                f"{cls.__name__}.supported_languages contains invalid "
                f"language codes: {sorted(invalid)}\n. "
                f"Valid choices are: {sorted(LANGUAGES.keys())}")

    @classmethod
    def get_generation_prompt(cls, audio: np.ndarray,
                              stt_config: SpeechToTextConfig,
                              model_config: ModelConfig,
                              language: Optional[str],
                              task_type: Literal["transcribe", "translate"],
                              request_prompt: str,
                              to_language: Optional[str]) -> PromptType:
        """Get the prompt for the ASR model.
        The model has control over the construction, as long as it
        returns a valid PromptType."""
        ...

    @classmethod
    def get_other_languages(cls) -> Mapping[str, str]:
        # other possible language codes from the whisper map
        return {
            k: v
            for k, v in LANGUAGES.items() if k not in cls.supported_languages
        }

    @classmethod
    def validate_language(cls, language: Optional[str]) -> Optional[str]:
        """
        Ensure the language specified in the transcription request 
        is a valid ISO 639-1 language code. If the request language is 
        valid, but not natively supported by the model, trigger a 
        warning (but not an exception).
        """
        if language is None or language in cls.supported_languages:
            return language
        elif language in cls.get_other_languages():
            logger.warning(
                "Language %r is not natively supported by %s; "
                "results may be less accurate. Supported languages: %r",
                language,
                cls.__name__,
                list(cls.supported_languages.keys()),
            )
            return language
        else:
            raise ValueError(
                f"Unsupported language: {language!r}.  Must be one of "
                f"{list(cls.supported_languages.keys())}.")

    @classmethod
    def get_speech_to_text_config(
            cls, model_config: ModelConfig,
            task_type: Literal["transcribe",
                               "translate"]) -> SpeechToTextConfig:
        """Get the speech to text config for the ASR model."""
        ...

    @classmethod
    def get_num_audio_tokens(cls, audio_duration_s: float,
                             stt_config: SpeechToTextConfig,
                             model_config: ModelConfig) -> Optional[int]:
        """
        Map from audio duration to number of audio tokens produced by the ASR 
        model, without running a forward pass.
        This is used for estimating the amount of processing for this audio.
        """
        return None

supported_languages class-attribute

supported_languages: Mapping[str, str]

supports_transcription class-attribute

supports_transcription: Literal[True] = True

supports_transcription_only class-attribute

supports_transcription_only: bool = False

Transcription models can opt out of text generation by setting this to True.

__init_subclass__

__init_subclass__(**kwargs)
Source code in vllm/model_executor/models/interfaces.py
def __init_subclass__(cls, **kwargs):
    super().__init_subclass__(**kwargs)
    # language codes in supported_languages
    # that don't exist in the full language map
    invalid = set(cls.supported_languages) - set(LANGUAGES.keys())
    if invalid:
        raise ValueError(
            f"{cls.__name__}.supported_languages contains invalid "
            f"language codes: {sorted(invalid)}\n. "
            f"Valid choices are: {sorted(LANGUAGES.keys())}")

get_generation_prompt classmethod

get_generation_prompt(
    audio: ndarray,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
    language: Optional[str],
    task_type: Literal["transcribe", "translate"],
    request_prompt: str,
    to_language: Optional[str],
) -> PromptType

Get the prompt for the ASR model. The model has control over the construction, as long as it returns a valid PromptType.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_generation_prompt(cls, audio: np.ndarray,
                          stt_config: SpeechToTextConfig,
                          model_config: ModelConfig,
                          language: Optional[str],
                          task_type: Literal["transcribe", "translate"],
                          request_prompt: str,
                          to_language: Optional[str]) -> PromptType:
    """Get the prompt for the ASR model.
    The model has control over the construction, as long as it
    returns a valid PromptType."""
    ...

get_num_audio_tokens classmethod

get_num_audio_tokens(
    audio_duration_s: float,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
) -> Optional[int]

Map from audio duration to number of audio tokens produced by the ASR model, without running a forward pass. This is used for estimating the amount of processing for this audio.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_num_audio_tokens(cls, audio_duration_s: float,
                         stt_config: SpeechToTextConfig,
                         model_config: ModelConfig) -> Optional[int]:
    """
    Map from audio duration to number of audio tokens produced by the ASR 
    model, without running a forward pass.
    This is used for estimating the amount of processing for this audio.
    """
    return None

get_other_languages classmethod

get_other_languages() -> Mapping[str, str]
Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_other_languages(cls) -> Mapping[str, str]:
    # other possible language codes from the whisper map
    return {
        k: v
        for k, v in LANGUAGES.items() if k not in cls.supported_languages
    }

get_speech_to_text_config classmethod

get_speech_to_text_config(
    model_config: ModelConfig,
    task_type: Literal["transcribe", "translate"],
) -> SpeechToTextConfig

Get the speech to text config for the ASR model.

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def get_speech_to_text_config(
        cls, model_config: ModelConfig,
        task_type: Literal["transcribe",
                           "translate"]) -> SpeechToTextConfig:
    """Get the speech to text config for the ASR model."""
    ...

validate_language classmethod

validate_language(language: Optional[str]) -> Optional[str]

Ensure the language specified in the transcription request is a valid ISO 639-1 language code. If the request language is valid, but not natively supported by the model, trigger a warning (but not an exception).

Source code in vllm/model_executor/models/interfaces.py
@classmethod
def validate_language(cls, language: Optional[str]) -> Optional[str]:
    """
    Ensure the language specified in the transcription request 
    is a valid ISO 639-1 language code. If the request language is 
    valid, but not natively supported by the model, trigger a 
    warning (but not an exception).
    """
    if language is None or language in cls.supported_languages:
        return language
    elif language in cls.get_other_languages():
        logger.warning(
            "Language %r is not natively supported by %s; "
            "results may be less accurate. Supported languages: %r",
            language,
            cls.__name__,
            list(cls.supported_languages.keys()),
        )
        return language
    else:
        raise ValueError(
            f"Unsupported language: {language!r}.  Must be one of "
            f"{list(cls.supported_languages.keys())}.")

SupportsV0Only

Bases: Protocol

Models with this interface are not compatible with V1 vLLM.

Source code in vllm/model_executor/models/interfaces.py
@runtime_checkable
class SupportsV0Only(Protocol):
    """Models with this interface are not compatible with V1 vLLM."""

    supports_v0_only: ClassVar[Literal[True]] = True

supports_v0_only class-attribute

supports_v0_only: Literal[True] = True

VllmModelForPooling

Bases: VllmModel[T_co], Protocol[T_co]

The interface required for all pooling models in vLLM.

Source code in vllm/model_executor/models/interfaces_base.py
@runtime_checkable
class VllmModelForPooling(VllmModel[T_co], Protocol[T_co]):
    """The interface required for all pooling models in vLLM."""

    is_pooling_model: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pooling.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    default_pooling_type: ClassVar[str] = "LAST"
    """
    Indicates the
    [vllm.model_executor.layers.pooler.PoolerConfig.pooling_type][]
    to use by default.

    You can use the
    [vllm.model_executor.models.interfaces_base.default_pooling_type][]
    decorator to conveniently set this field.
    """

    pooler: Pooler
    """The pooler is only called on TP rank 0."""

default_pooling_type class-attribute

default_pooling_type: str = 'LAST'

Indicates the vllm.model_executor.layers.pooler.PoolerConfig.pooling_type to use by default.

You can use the vllm.model_executor.models.interfaces_base.default_pooling_type decorator to conveniently set this field.

is_pooling_model class-attribute

is_pooling_model: Literal[True] = True

A flag that indicates this model supports pooling.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

pooler instance-attribute

pooler: Pooler

The pooler is only called on TP rank 0.

VllmModelForTextGeneration

Bases: VllmModel[T], Protocol[T]

The interface required for all generative models in vLLM.

Source code in vllm/model_executor/models/interfaces_base.py
@runtime_checkable
class VllmModelForTextGeneration(VllmModel[T], Protocol[T]):
    """The interface required for all generative models in vLLM."""

    def compute_logits(
        self,
        hidden_states: T,
    ) -> Optional[T]:
        """Return `None` if TP rank > 0."""
        ...

compute_logits

compute_logits(hidden_states: T) -> Optional[T]

Return None if TP rank > 0.

Source code in vllm/model_executor/models/interfaces_base.py
def compute_logits(
    self,
    hidden_states: T,
) -> Optional[T]:
    """Return `None` if TP rank > 0."""
    ...

has_inner_state

has_inner_state(model: object) -> TypeIs[HasInnerState]
has_inner_state(
    model: type[object],
) -> TypeIs[type[HasInnerState]]
has_inner_state(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[HasInnerState]], TypeIs[HasInnerState]
]
Source code in vllm/model_executor/models/interfaces.py
def has_inner_state(
    model: Union[type[object], object]
) -> Union[TypeIs[type[HasInnerState]], TypeIs[HasInnerState]]:
    return getattr(model, "has_inner_state", False)

is_pooling_model

is_pooling_model(
    model: type[object],
) -> TypeIs[type[VllmModelForPooling]]
is_pooling_model(
    model: object,
) -> TypeIs[VllmModelForPooling]
Source code in vllm/model_executor/models/interfaces_base.py
def is_pooling_model(
    model: Union[type[object], object],
) -> Union[TypeIs[type[VllmModelForPooling]], TypeIs[VllmModelForPooling]]:
    if not is_vllm_model(model):
        return False

    return getattr(model, "is_pooling_model", False)

is_text_generation_model

is_text_generation_model(
    model: type[object],
) -> TypeIs[type[VllmModelForTextGeneration]]
is_text_generation_model(
    model: object,
) -> TypeIs[VllmModelForTextGeneration]
Source code in vllm/model_executor/models/interfaces_base.py
def is_text_generation_model(
    model: Union[type[object], object],
) -> Union[TypeIs[type[VllmModelForTextGeneration]],
           TypeIs[VllmModelForTextGeneration]]:
    if not is_vllm_model(model):
        return False

    if isinstance(model, type):
        return isinstance(model, VllmModelForTextGeneration)

    return isinstance(model, VllmModelForTextGeneration)

supports_lora

supports_lora(
    model: type[object],
) -> TypeIs[type[SupportsLoRA]]
supports_lora(model: object) -> TypeIs[SupportsLoRA]
supports_lora(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsLoRA]], TypeIs[SupportsLoRA]
]
Source code in vllm/model_executor/models/interfaces.py
def supports_lora(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsLoRA]], TypeIs[SupportsLoRA]]:
    result = _supports_lora(model)

    if not result:
        lora_attrs = (
            "packed_modules_mapping",
            "embedding_modules",
            "embedding_padding_modules",
        )
        missing_attrs = tuple(attr for attr in lora_attrs
                              if not hasattr(model, attr))

        if getattr(model, "supports_lora", False):
            if missing_attrs:
                logger.warning(
                    "The model (%s) sets `supports_lora=True`, "
                    "but is missing LoRA-specific attributes: %s",
                    model,
                    missing_attrs,
                )
        else:
            if not missing_attrs:
                logger.warning(
                    "The model (%s) contains all LoRA-specific attributes, "
                    "but does not set `supports_lora=True`.", model)

    return result

supports_mrope

supports_mrope(
    model: type[object],
) -> TypeIs[type[SupportsMRoPE]]
supports_mrope(model: object) -> TypeIs[SupportsMRoPE]
supports_mrope(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsMRoPE]], TypeIs[SupportsMRoPE]
]
Source code in vllm/model_executor/models/interfaces.py
def supports_mrope(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsMRoPE]], TypeIs[SupportsMRoPE]]:
    return isinstance(model, SupportsMRoPE)

supports_multimodal

supports_multimodal(
    model: type[object],
) -> TypeIs[type[SupportsMultiModal]]
supports_multimodal(
    model: object,
) -> TypeIs[SupportsMultiModal]
supports_multimodal(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsMultiModal]],
    TypeIs[SupportsMultiModal],
]
Source code in vllm/model_executor/models/interfaces.py
def supports_multimodal(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsMultiModal]], TypeIs[SupportsMultiModal]]:
    return getattr(model, "supports_multimodal", False)

supports_pp

supports_pp(
    model: type[object],
) -> TypeIs[type[SupportsPP]]
supports_pp(model: object) -> TypeIs[SupportsPP]
supports_pp(
    model: Union[type[object], object],
) -> Union[
    bool, TypeIs[type[SupportsPP]], TypeIs[SupportsPP]
]
Source code in vllm/model_executor/models/interfaces.py
def supports_pp(
    model: Union[type[object], object],
) -> Union[bool, TypeIs[type[SupportsPP]], TypeIs[SupportsPP]]:
    supports_attributes = _supports_pp_attributes(model)
    supports_inspect = _supports_pp_inspect(model)

    if supports_attributes and not supports_inspect:
        logger.warning(
            "The model (%s) sets `supports_pp=True`, but does not accept "
            "`intermediate_tensors` in its `forward` method", model)

    if not supports_attributes:
        pp_attrs = ("make_empty_intermediate_tensors", )
        missing_attrs = tuple(attr for attr in pp_attrs
                              if not hasattr(model, attr))

        if getattr(model, "supports_pp", False):
            if missing_attrs:
                logger.warning(
                    "The model (%s) sets `supports_pp=True`, "
                    "but is missing PP-specific attributes: %s",
                    model,
                    missing_attrs,
                )
        else:
            if not missing_attrs:
                logger.warning(
                    "The model (%s) contains all PP-specific attributes, "
                    "but does not set `supports_pp=True`.", model)

    return supports_attributes and supports_inspect

supports_transcription

supports_transcription(
    model: type[object],
) -> TypeIs[type[SupportsTranscription]]
supports_transcription(
    model: object,
) -> TypeIs[SupportsTranscription]
supports_transcription(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsTranscription]],
    TypeIs[SupportsTranscription],
]
Source code in vllm/model_executor/models/interfaces.py
def supports_transcription(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsTranscription]], TypeIs[SupportsTranscription]]:
    return getattr(model, "supports_transcription", False)

supports_v0_only

supports_v0_only(
    model: type[object],
) -> TypeIs[type[SupportsV0Only]]
supports_v0_only(model: object) -> TypeIs[SupportsV0Only]
supports_v0_only(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]
]
Source code in vllm/model_executor/models/interfaces.py
def supports_v0_only(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]]:
    return getattr(model, "supports_v0_only", False)