vllm.model_executor.models.interfaces ¶

MultiModalEmbeddings `module-attribute` ¶

MultiModalEmbeddings = Union[
    list[Tensor], Tensor, tuple[Tensor, ...]
]

The output embeddings must be one of the following formats:

A list or tuple of 2D tensors, where each tensor corresponds to each input multimodal data item (e.g, image).
A single 3D tensor, with the batch dimension grouping the 2D tensors.

logger `module-attribute` ¶

logger = init_logger(__name__)

HasInnerState ¶

Bases: Protocol

The interface required for all models that has inner state.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class HasInnerState(Protocol):
    """The interface required for all models that has inner state."""

    has_inner_state: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has inner state.
        Models that has inner state usually need access to the scheduler_config
        for max_num_seqs, etc. True for e.g. both Mamba and Jamba.
    """

has_inner_state `class-attribute` ¶

has_inner_state: Literal[True] = True

A flag that indicates this model has inner state. Models that has inner state usually need access to the scheduler_config for max_num_seqs, etc. True for e.g. both Mamba and Jamba.

HasNoOps ¶

Bases: Protocol

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class HasNoOps(Protocol):
    has_noops: ClassVar[Literal[True]] = True

has_noops `class-attribute` ¶

has_noops: Literal[True] = True

IsAttentionFree ¶

Bases: Protocol

The interface required for all models like Mamba that lack attention, but do have state whose size is constant wrt the number of tokens.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class IsAttentionFree(Protocol):
    """The interface required for all models like Mamba that lack attention,
    but do have state whose size is constant wrt the number of tokens."""

    is_attention_free: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has no attention.
        Used for block manager and attention backend selection.
        True for Mamba but not Jamba.
    """

is_attention_free `class-attribute` ¶

is_attention_free: Literal[True] = True

A flag that indicates this model has no attention. Used for block manager and attention backend selection. True for Mamba but not Jamba.

IsHybrid ¶

Bases: Protocol

The interface required for all models like Jamba that have both attention and mamba blocks, indicates that hf_config has 'layers_block_type'

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class IsHybrid(Protocol):
    """The interface required for all models like Jamba that have both
    attention and mamba blocks, indicates that 
    hf_config has 'layers_block_type'"""

    is_hybrid: ClassVar[Literal[True]] = True
    """
        A flag that indicates this model has both mamba and attention blocks
        , also indicates that the model's hf_config has 
        'layers_block_type' """

    @classmethod
    def get_mamba_state_shape_from_config(
        cls,
        vllm_config: "VllmConfig",
        use_v1: bool = True,
    ) -> tuple[tuple[int, int], tuple[int, int, int]]:
        """Calculate shapes for Mamba's convolutional and state caches.

        Args:
            vllm_config: vLLM config
            use_v1: Get shapes for V1 (or V0)

        Returns:
            Tuple containing:
            - conv_state_shape: Shape for convolutional state cache
            - temporal_state_shape: Shape for state space model cache
        """
        ...

is_hybrid `class-attribute` ¶

is_hybrid: Literal[True] = True

A flag that indicates this model has both mamba and attention blocks , also indicates that the model's hf_config has 'layers_block_type'

get_mamba_state_shape_from_config `classmethod` ¶

get_mamba_state_shape_from_config(
    vllm_config: VllmConfig, use_v1: bool = True
) -> tuple[tuple[int, int], tuple[int, int, int]]

Calculate shapes for Mamba's convolutional and state caches.

Parameters:

Name	Type	Description	Default
`vllm_config`	`VllmConfig`	vLLM config	required
`use_v1`	`bool`	Get shapes for V1 (or V0)	`True`

Returns:

Type	Description
`tuple[int, int]`	Tuple containing:
`tuple[int, int, int]`	conv_state_shape: Shape for convolutional state cache
`tuple[tuple[int, int], tuple[int, int, int]]`	temporal_state_shape: Shape for state space model cache

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_mamba_state_shape_from_config(
    cls,
    vllm_config: "VllmConfig",
    use_v1: bool = True,
) -> tuple[tuple[int, int], tuple[int, int, int]]:
    """Calculate shapes for Mamba's convolutional and state caches.

    Args:
        vllm_config: vLLM config
        use_v1: Get shapes for V1 (or V0)

    Returns:
        Tuple containing:
        - conv_state_shape: Shape for convolutional state cache
        - temporal_state_shape: Shape for state space model cache
    """
    ...

MixtureOfExperts ¶

Bases: Protocol

Check if the model is a mixture of experts (MoE) model.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class MixtureOfExperts(Protocol):
    """
    Check if the model is a mixture of experts (MoE) model.
    """

    expert_weights: MutableSequence[Iterable[Tensor]]
    """
    Expert weights saved in this rank.

    The first dimension is the layer, and the second dimension is different
    parameters in the layer, e.g. up/down projection weights.
    """

    num_moe_layers: int
    """Number of MoE layers in this model."""

    num_expert_groups: int
    """Number of expert groups in this model."""

    num_logical_experts: int
    """Number of logical experts in this model."""

    num_physical_experts: int
    """Number of physical experts in this model."""

    num_local_physical_experts: int
    """Number of local physical experts in this model."""

    num_routed_experts: int
    """Number of routed experts in this model."""

    num_shared_experts: int
    """Number of shared experts in this model."""

    num_redundant_experts: int
    """Number of redundant experts in this model."""

    def set_eplb_state(
        self,
        expert_load_view: Tensor,
        logical_to_physical_map: Tensor,
        logical_replica_count: Tensor,
    ) -> None:
        """
        Register the EPLB state in the MoE model.

        Since these are views of the actual EPLB state, any changes made by
        the EPLB algorithm are automatically reflected in the model's behavior
        without requiring additional method calls to set new states.

        You should also collect model's `expert_weights` here instead of in
        the weight loader, since after initial weight loading, further
        processing like quantization may be applied to the weights.

        Args:
            expert_load_view: A view of the expert load metrics tensor.
            logical_to_physical_map: Mapping from logical to physical experts.
            logical_replica_count: Count of replicas for each logical expert.
        """
        ...

    def update_physical_experts_metadata(
        self,
        num_physical_experts: int,
        num_local_physical_experts: int,
    ) -> None:
        ...

expert_weights `instance-attribute` ¶

expert_weights: MutableSequence[Iterable[Tensor]]

Expert weights saved in this rank.

The first dimension is the layer, and the second dimension is different parameters in the layer, e.g. up/down projection weights.

num_expert_groups `instance-attribute` ¶

num_expert_groups: int

Number of expert groups in this model.

num_local_physical_experts `instance-attribute` ¶

num_local_physical_experts: int

Number of local physical experts in this model.

num_logical_experts `instance-attribute` ¶

num_logical_experts: int

Number of logical experts in this model.

num_moe_layers `instance-attribute` ¶

num_moe_layers: int

Number of MoE layers in this model.

num_physical_experts `instance-attribute` ¶

num_physical_experts: int

Number of physical experts in this model.

num_redundant_experts `instance-attribute` ¶

num_redundant_experts: int

Number of redundant experts in this model.

num_routed_experts `instance-attribute` ¶

num_routed_experts: int

Number of routed experts in this model.

num_shared_experts `instance-attribute` ¶

num_shared_experts: int

Number of shared experts in this model.

set_eplb_state ¶

set_eplb_state(
    expert_load_view: Tensor,
    logical_to_physical_map: Tensor,
    logical_replica_count: Tensor,
) -> None

Register the EPLB state in the MoE model.

Since these are views of the actual EPLB state, any changes made by the EPLB algorithm are automatically reflected in the model's behavior without requiring additional method calls to set new states.

You should also collect model's expert_weights here instead of in the weight loader, since after initial weight loading, further processing like quantization may be applied to the weights.

Parameters:

Name	Type	Description	Default
`expert_load_view`	`Tensor`	A view of the expert load metrics tensor.	required
`logical_to_physical_map`	`Tensor`	Mapping from logical to physical experts.	required
`logical_replica_count`	`Tensor`	Count of replicas for each logical expert.	required

Source code in vllm/model_executor/models/interfaces.py

def set_eplb_state(
    self,
    expert_load_view: Tensor,
    logical_to_physical_map: Tensor,
    logical_replica_count: Tensor,
) -> None:
    """
    Register the EPLB state in the MoE model.

    Since these are views of the actual EPLB state, any changes made by
    the EPLB algorithm are automatically reflected in the model's behavior
    without requiring additional method calls to set new states.

    You should also collect model's `expert_weights` here instead of in
    the weight loader, since after initial weight loading, further
    processing like quantization may be applied to the weights.

    Args:
        expert_load_view: A view of the expert load metrics tensor.
        logical_to_physical_map: Mapping from logical to physical experts.
        logical_replica_count: Count of replicas for each logical expert.
    """
    ...

update_physical_experts_metadata ¶

update_physical_experts_metadata(
    num_physical_experts: int,
    num_local_physical_experts: int,
) -> None

Source code in vllm/model_executor/models/interfaces.py

def update_physical_experts_metadata(
    self,
    num_physical_experts: int,
    num_local_physical_experts: int,
) -> None:
    ...

SupportsCrossEncoding ¶

Bases: Protocol

The interface required for all models that support cross encoding.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsCrossEncoding(Protocol):
    """The interface required for all models that support cross encoding."""

    supports_cross_encoding: ClassVar[Literal[True]] = True

supports_cross_encoding `class-attribute` ¶

supports_cross_encoding: Literal[True] = True

SupportsEagle3 ¶

Bases: Protocol

The interface required for models that support EAGLE3 speculative decoding.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsEagle3(Protocol):
    """The interface required for models that support 
    EAGLE3 speculative decoding."""

    supports_eagle3: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports EAGLE3 
    speculative decoding.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def set_aux_hidden_state_layers(self, layers: tuple[int, ...]) -> None:
        """
        Set which layers should output auxiliary
        hidden states for EAGLE3.

        Args:
            layers: Tuple of layer indices that should output auxiliary
                hidden states.
        """
        ...

    def get_eagle3_aux_hidden_state_layers(self) -> tuple[int, ...]:
        """
        Get the layer indices that should output auxiliary hidden states
        for EAGLE3.

        Returns:
            Tuple of layer indices for auxiliary hidden state outputs.
        """
        ...

supports_eagle3 `class-attribute` ¶

supports_eagle3: Literal[True] = True

A flag that indicates this model supports EAGLE3 speculative decoding.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_eagle3_aux_hidden_state_layers ¶

get_eagle3_aux_hidden_state_layers() -> tuple[int, ...]

Get the layer indices that should output auxiliary hidden states for EAGLE3.

Returns:

Type	Description
`tuple[int, ...]`	Tuple of layer indices for auxiliary hidden state outputs.

Source code in vllm/model_executor/models/interfaces.py

def get_eagle3_aux_hidden_state_layers(self) -> tuple[int, ...]:
    """
    Get the layer indices that should output auxiliary hidden states
    for EAGLE3.

    Returns:
        Tuple of layer indices for auxiliary hidden state outputs.
    """
    ...

set_aux_hidden_state_layers ¶

set_aux_hidden_state_layers(
    layers: tuple[int, ...],
) -> None

Set which layers should output auxiliary hidden states for EAGLE3.

Parameters:

Name	Type	Description	Default
`layers`	`tuple[int, ...]`	Tuple of layer indices that should output auxiliary hidden states.	required

Source code in vllm/model_executor/models/interfaces.py

def set_aux_hidden_state_layers(self, layers: tuple[int, ...]) -> None:
    """
    Set which layers should output auxiliary
    hidden states for EAGLE3.

    Args:
        layers: Tuple of layer indices that should output auxiliary
            hidden states.
    """
    ...

SupportsLoRA ¶

Bases: Protocol

The interface required for all models that support LoRA.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsLoRA(Protocol):
    """The interface required for all models that support LoRA."""

    supports_lora: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports LoRA.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """
    # The `embedding_module` and `embedding_padding_modules`
    # are empty by default.
    embedding_modules: ClassVar[dict[str, str]] = {}
    embedding_padding_modules: ClassVar[list[str]] = []
    packed_modules_mapping: ClassVar[dict[str, list[str]]] = {}

embedding_modules `class-attribute` ¶

embedding_modules: dict[str, str] = {}

embedding_padding_modules `class-attribute` ¶

embedding_padding_modules: list[str] = []

packed_modules_mapping `class-attribute` ¶

packed_modules_mapping: dict[str, list[str]] = {}

supports_lora `class-attribute` ¶

supports_lora: Literal[True] = True

A flag that indicates this model supports LoRA.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

SupportsMRoPE ¶

Bases: Protocol

The interface required for all models that support M-RoPE.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMRoPE(Protocol):
    """The interface required for all models that support M-RoPE."""

    supports_mrope: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports M-RoPE.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def get_mrope_input_positions(
        self,
        input_tokens: list[int],
        hf_config: PretrainedConfig,
        image_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
        video_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
        second_per_grid_ts: Optional[list[float]] = None,
        context_len: int = 0,
        seq_len: Optional[int] = None,
        audio_feature_lengths: Optional[torch.Tensor] = None,
        use_audio_in_video: bool = False,
    ) -> tuple[torch.Tensor, int]:
        """
        Get M-RoPE input positions and delta value for this specific model.

        This method should be implemented by each model that supports M-RoPE
        to provide model-specific logic for computing input positions.

        Args:
            input_tokens: List of input token IDs
            hf_config: HuggingFace model configuration
            image_grid_thw: Image grid dimensions (t, h, w)
            video_grid_thw: Video grid dimensions (t, h, w)
            second_per_grid_ts: Seconds per grid timestep for videos
            context_len: Context length
            seq_len: Sequence length
            audio_feature_lengths: Audio feature lengths for multimodal models
            use_audio_in_video: Whether to use audio in video for interleaving

        Returns:
            Tuple of (llm_positions, mrope_position_delta)
            - llm_positions: Tensor of shape [3, num_tokens]
                with T/H/W positions
            - mrope_position_delta: Delta for position calculations
        """
        ...

supports_mrope `class-attribute` ¶

supports_mrope: Literal[True] = True

A flag that indicates this model supports M-RoPE.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_mrope_input_positions ¶

get_mrope_input_positions(
    input_tokens: list[int],
    hf_config: PretrainedConfig,
    image_grid_thw: Optional[
        Union[list[list[int]], Tensor]
    ],
    video_grid_thw: Optional[
        Union[list[list[int]], Tensor]
    ],
    second_per_grid_ts: Optional[list[float]] = None,
    context_len: int = 0,
    seq_len: Optional[int] = None,
    audio_feature_lengths: Optional[Tensor] = None,
    use_audio_in_video: bool = False,
) -> tuple[Tensor, int]

Get M-RoPE input positions and delta value for this specific model.

This method should be implemented by each model that supports M-RoPE to provide model-specific logic for computing input positions.

Parameters:

Name	Type	Description	Default
`input_tokens`	`list[int]`	List of input token IDs	required
`hf_config`	`PretrainedConfig`	HuggingFace model configuration	required
`image_grid_thw`	`Optional[Union[list[list[int]], Tensor]]`	Image grid dimensions (t, h, w)	required
`video_grid_thw`	`Optional[Union[list[list[int]], Tensor]]`	Video grid dimensions (t, h, w)	required
`second_per_grid_ts`	`Optional[list[float]]`	Seconds per grid timestep for videos	`None`
`context_len`	`int`	Context length	`0`
`seq_len`	`Optional[int]`	Sequence length	`None`
`audio_feature_lengths`	`Optional[Tensor]`	Audio feature lengths for multimodal models	`None`
`use_audio_in_video`	`bool`	Whether to use audio in video for interleaving	`False`

Returns:

Type	Description
`Tensor`	Tuple of (llm_positions, mrope_position_delta)
`int`	llm_positions: Tensor of shape [3, num_tokens] with T/H/W positions
`tuple[Tensor, int]`	mrope_position_delta: Delta for position calculations

Source code in vllm/model_executor/models/interfaces.py

def get_mrope_input_positions(
    self,
    input_tokens: list[int],
    hf_config: PretrainedConfig,
    image_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
    video_grid_thw: Optional[Union[list[list[int]], torch.Tensor]],
    second_per_grid_ts: Optional[list[float]] = None,
    context_len: int = 0,
    seq_len: Optional[int] = None,
    audio_feature_lengths: Optional[torch.Tensor] = None,
    use_audio_in_video: bool = False,
) -> tuple[torch.Tensor, int]:
    """
    Get M-RoPE input positions and delta value for this specific model.

    This method should be implemented by each model that supports M-RoPE
    to provide model-specific logic for computing input positions.

    Args:
        input_tokens: List of input token IDs
        hf_config: HuggingFace model configuration
        image_grid_thw: Image grid dimensions (t, h, w)
        video_grid_thw: Video grid dimensions (t, h, w)
        second_per_grid_ts: Seconds per grid timestep for videos
        context_len: Context length
        seq_len: Sequence length
        audio_feature_lengths: Audio feature lengths for multimodal models
        use_audio_in_video: Whether to use audio in video for interleaving

    Returns:
        Tuple of (llm_positions, mrope_position_delta)
        - llm_positions: Tensor of shape [3, num_tokens]
            with T/H/W positions
        - mrope_position_delta: Delta for position calculations
    """
    ...

SupportsMultiModal ¶

Bases: Protocol

The interface required for all multi-modal models.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMultiModal(Protocol):
    """The interface required for all multi-modal models."""

    supports_multimodal: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports multi-modal inputs.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    supports_multimodal_raw_input_only: ClassVar[bool] = False
    """
    A flag that indicates this model supports multi-modal inputs and processes
    them in their raw form and not embeddings.
    """

    supports_encoder_tp_data: ClassVar[bool] = False
    """
    A flag that indicates whether this model supports
    `multimodal_config.mm_encoder_tp_mode="data"`.
    """

    merge_by_field_config: ClassVar[bool] = False
    """
    A flag that indicates which implementation of
    `vllm.multimodal.utils.group_mm_kwargs_by_modality` to use.
    """

    @classmethod
    def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
        """
        Get the placeholder text for the `i`th `modality` item in the prompt.
        """
        ...

    def get_multimodal_embeddings(self,
                                  **kwargs: object) -> MultiModalEmbeddings:
        """
        Returns multimodal embeddings generated from multimodal kwargs 
        to be merged with text embeddings.

        Note:
            The returned multimodal embeddings must be in the same order as
            the appearances of their corresponding multimodal data item in the
            input prompt.
        """
        ...

    def get_language_model(self) -> VllmModel:
        """
        Returns the underlying language model used for text generation.

        This is typically the `torch.nn.Module` instance responsible for 
        processing the merged multimodal embeddings and producing hidden states

        Returns:
            torch.nn.Module: The core language model component.
        """
        ...

    @overload
    def get_input_embeddings(self, input_ids: Tensor) -> Tensor:
        ...

    @overload
    def get_input_embeddings(
        self,
        input_ids: Tensor,
        multimodal_embeddings: MultiModalEmbeddings,
        *,
        is_multimodal: torch.Tensor,
        handle_oov_mm_token: bool = False,
    ) -> Tensor:
        ...

    def _get_text_embeddings(
        self,
        input_ids: Tensor,
        get_input_embeddings: Callable[[Tensor], Tensor],
        *,
        is_multimodal: Optional[Tensor],
        handle_oov_mm_token: bool,
    ) -> Tensor:
        if handle_oov_mm_token and is_multimodal is not None:
            is_text = ~is_multimodal
            text_embeds = get_input_embeddings(input_ids[is_text])

            return torch.empty(
                (input_ids.shape[0], text_embeds.shape[1]),
                dtype=text_embeds.dtype,
                device=text_embeds.device,
            ).masked_scatter_(is_text.unsqueeze_(-1), text_embeds)

        return get_input_embeddings(input_ids)

    def get_input_embeddings(
        self,
        input_ids: Tensor,
        multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
        *,
        is_multimodal: Optional[Tensor] = None,
        handle_oov_mm_token: bool = False,
    ) -> Tensor:
        """
        Apply token embeddings to `input_ids`.

        If `multimodal_embeddings` is passed, scatter them into
        `input_ids` according to the mask `is_multimodal`.

        In case the multi-modal token IDs exceed the vocabulary size of
        the language model, you can set `handle_oov_mm_token=False`
        to avoid calling the language model's `get_input_embeddings` method
        on those tokens. Note however that doing so increases memory usage
        as an additional buffer is needed to hold the input embeddings.
        """
        from .utils import _merge_multimodal_embeddings

        inputs_embeds = self._get_text_embeddings(
            input_ids,
            self.get_language_model().get_input_embeddings,
            is_multimodal=is_multimodal,
            handle_oov_mm_token=handle_oov_mm_token,
        )

        if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
            return inputs_embeds

        if is_multimodal is None:
            raise ValueError(
                "`get_input_embeddings` now requires `is_multimodal` arg, "
                "please update your model runner according to "
                "https://github.com/vllm-project/vllm/pull/16229.")

        return _merge_multimodal_embeddings(
            inputs_embeds=inputs_embeds,
            multimodal_embeddings=multimodal_embeddings,
            is_multimodal=is_multimodal,
        )

merge_by_field_config `class-attribute` ¶

merge_by_field_config: bool = False

A flag that indicates which implementation of vllm.multimodal.utils.group_mm_kwargs_by_modality to use.

supports_encoder_tp_data `class-attribute` ¶

supports_encoder_tp_data: bool = False

A flag that indicates whether this model supports multimodal_config.mm_encoder_tp_mode="data".

supports_multimodal `class-attribute` ¶

supports_multimodal: Literal[True] = True

A flag that indicates this model supports multi-modal inputs.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

supports_multimodal_raw_input_only `class-attribute` ¶

supports_multimodal_raw_input_only: bool = False

A flag that indicates this model supports multi-modal inputs and processes them in their raw form and not embeddings.

_get_text_embeddings ¶

_get_text_embeddings(
    input_ids: Tensor,
    get_input_embeddings: Callable[[Tensor], Tensor],
    *,
    is_multimodal: Optional[Tensor],
    handle_oov_mm_token: bool,
) -> Tensor

Source code in vllm/model_executor/models/interfaces.py

def _get_text_embeddings(
    self,
    input_ids: Tensor,
    get_input_embeddings: Callable[[Tensor], Tensor],
    *,
    is_multimodal: Optional[Tensor],
    handle_oov_mm_token: bool,
) -> Tensor:
    if handle_oov_mm_token and is_multimodal is not None:
        is_text = ~is_multimodal
        text_embeds = get_input_embeddings(input_ids[is_text])

        return torch.empty(
            (input_ids.shape[0], text_embeds.shape[1]),
            dtype=text_embeds.dtype,
            device=text_embeds.device,
        ).masked_scatter_(is_text.unsqueeze_(-1), text_embeds)

    return get_input_embeddings(input_ids)

get_input_embeddings ¶

get_input_embeddings(input_ids: Tensor) -> Tensor

get_input_embeddings(
    input_ids: Tensor,
    multimodal_embeddings: MultiModalEmbeddings,
    *,
    is_multimodal: Tensor,
    handle_oov_mm_token: bool = False,
) -> Tensor

get_input_embeddings(
    input_ids: Tensor,
    multimodal_embeddings: Optional[
        MultiModalEmbeddings
    ] = None,
    *,
    is_multimodal: Optional[Tensor] = None,
    handle_oov_mm_token: bool = False,
) -> Tensor

Apply token embeddings to input_ids.

If multimodal_embeddings is passed, scatter them into input_ids according to the mask is_multimodal.

In case the multi-modal token IDs exceed the vocabulary size of the language model, you can set handle_oov_mm_token=False to avoid calling the language model's get_input_embeddings method on those tokens. Note however that doing so increases memory usage as an additional buffer is needed to hold the input embeddings.

Source code in vllm/model_executor/models/interfaces.py

def get_input_embeddings(
    self,
    input_ids: Tensor,
    multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
    *,
    is_multimodal: Optional[Tensor] = None,
    handle_oov_mm_token: bool = False,
) -> Tensor:
    """
    Apply token embeddings to `input_ids`.

    If `multimodal_embeddings` is passed, scatter them into
    `input_ids` according to the mask `is_multimodal`.

    In case the multi-modal token IDs exceed the vocabulary size of
    the language model, you can set `handle_oov_mm_token=False`
    to avoid calling the language model's `get_input_embeddings` method
    on those tokens. Note however that doing so increases memory usage
    as an additional buffer is needed to hold the input embeddings.
    """
    from .utils import _merge_multimodal_embeddings

    inputs_embeds = self._get_text_embeddings(
        input_ids,
        self.get_language_model().get_input_embeddings,
        is_multimodal=is_multimodal,
        handle_oov_mm_token=handle_oov_mm_token,
    )

    if multimodal_embeddings is None or len(multimodal_embeddings) == 0:
        return inputs_embeds

    if is_multimodal is None:
        raise ValueError(
            "`get_input_embeddings` now requires `is_multimodal` arg, "
            "please update your model runner according to "
            "https://github.com/vllm-project/vllm/pull/16229.")

    return _merge_multimodal_embeddings(
        inputs_embeds=inputs_embeds,
        multimodal_embeddings=multimodal_embeddings,
        is_multimodal=is_multimodal,
    )

get_language_model ¶

get_language_model() -> VllmModel

Returns the underlying language model used for text generation.

This is typically the torch.nn.Module instance responsible for processing the merged multimodal embeddings and producing hidden states

Returns:

Type	Description
`VllmModel`	torch.nn.Module: The core language model component.

Source code in vllm/model_executor/models/interfaces.py

def get_language_model(self) -> VllmModel:
    """
    Returns the underlying language model used for text generation.

    This is typically the `torch.nn.Module` instance responsible for 
    processing the merged multimodal embeddings and producing hidden states

    Returns:
        torch.nn.Module: The core language model component.
    """
    ...

get_multimodal_embeddings ¶

get_multimodal_embeddings(
    **kwargs: object,
) -> MultiModalEmbeddings

Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.

Note

The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.

Source code in vllm/model_executor/models/interfaces.py

def get_multimodal_embeddings(self,
                              **kwargs: object) -> MultiModalEmbeddings:
    """
    Returns multimodal embeddings generated from multimodal kwargs 
    to be merged with text embeddings.

    Note:
        The returned multimodal embeddings must be in the same order as
        the appearances of their corresponding multimodal data item in the
        input prompt.
    """
    ...

get_placeholder_str `classmethod` ¶

get_placeholder_str(modality: str, i: int) -> Optional[str]

Get the placeholder text for the ith modality item in the prompt.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
    """
    Get the placeholder text for the `i`th `modality` item in the prompt.
    """
    ...

SupportsMultiModalPruning ¶

Bases: Protocol

The interface required for models that support returning both input embeddings and positions. Model may require custom positions for dynamic pruning of multimodal embeddings.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsMultiModalPruning(Protocol):
    """The interface required for models that support returning both input
    embeddings and positions. Model may require custom positions for dynamic
    pruning of multimodal embeddings.
    """
    supports_multimodal_pruning: ClassVar[Literal[True]] = True

    def recompute_mrope_positions(
            self, input_ids: list[int],
            multimodal_embeddings: MultiModalEmbeddings,
            mrope_positions: torch.LongTensor, num_computed_tokens: int
    ) -> tuple[MultiModalEmbeddings, Tensor, int]:
        """
        Update part of input mrope positions (starting with
        num_computed_tokens index). Original mrope_positions are computed
        for unpruned sequence and becomes incorrect once pruning occurs,
        so once we prune media tokens we should reflect this in the
        mrope_positions before we feed it to LLM.

        Args:
            input_ids: (N,) All input tokens of the prompt containing
                entire sequence.
            multimodal_embeddings: Tuple of multimodal embeddings that
                fits into the prefill chunk that is being processed.
            mrope_positions: Existing mrope positions (3, N) for entire
                sequence
            num_computed_tokens: A number of computed tokens so far.

        Returns:
            Tuple of (multimodal_embeddings, mrope_positions,
                mrope_position_delta).
        """
        ...

supports_multimodal_pruning `class-attribute` ¶

supports_multimodal_pruning: Literal[True] = True

recompute_mrope_positions ¶

recompute_mrope_positions(
    input_ids: list[int],
    multimodal_embeddings: MultiModalEmbeddings,
    mrope_positions: LongTensor,
    num_computed_tokens: int,
) -> tuple[MultiModalEmbeddings, Tensor, int]

Update part of input mrope positions (starting with num_computed_tokens index). Original mrope_positions are computed for unpruned sequence and becomes incorrect once pruning occurs, so once we prune media tokens we should reflect this in the mrope_positions before we feed it to LLM.

Parameters:

Name	Type	Description	Default
`input_ids`	`list[int]`	(N,) All input tokens of the prompt containing entire sequence.	required
`multimodal_embeddings`	`MultiModalEmbeddings`	Tuple of multimodal embeddings that fits into the prefill chunk that is being processed.	required
`mrope_positions`	`LongTensor`	Existing mrope positions (3, N) for entire sequence	required
`num_computed_tokens`	`int`	A number of computed tokens so far.	required

Returns:

Type	Description
`tuple[MultiModalEmbeddings, Tensor, int]`	Tuple of (multimodal_embeddings, mrope_positions, mrope_position_delta).

Source code in vllm/model_executor/models/interfaces.py

def recompute_mrope_positions(
        self, input_ids: list[int],
        multimodal_embeddings: MultiModalEmbeddings,
        mrope_positions: torch.LongTensor, num_computed_tokens: int
) -> tuple[MultiModalEmbeddings, Tensor, int]:
    """
    Update part of input mrope positions (starting with
    num_computed_tokens index). Original mrope_positions are computed
    for unpruned sequence and becomes incorrect once pruning occurs,
    so once we prune media tokens we should reflect this in the
    mrope_positions before we feed it to LLM.

    Args:
        input_ids: (N,) All input tokens of the prompt containing
            entire sequence.
        multimodal_embeddings: Tuple of multimodal embeddings that
            fits into the prefill chunk that is being processed.
        mrope_positions: Existing mrope positions (3, N) for entire
            sequence
        num_computed_tokens: A number of computed tokens so far.

    Returns:
        Tuple of (multimodal_embeddings, mrope_positions,
            mrope_position_delta).
    """
    ...

SupportsPP ¶

Bases: Protocol

The interface required for all models that support pipeline parallel.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsPP(Protocol):
    """The interface required for all models that support pipeline parallel."""

    supports_pp: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports pipeline parallel.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    def make_empty_intermediate_tensors(
        self,
        batch_size: int,
        dtype: torch.dtype,
        device: torch.device,
    ) -> "IntermediateTensors":
        """Called when PP rank > 0 for profiling purposes."""
        ...

    def forward(
        self,
        *,
        intermediate_tensors: Optional["IntermediateTensors"],
    ) -> Union[Tensor, "IntermediateTensors"]:
        """
        Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
        PP rank > 0.

        Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
        for the last PP rank.
        """
        ...

supports_pp `class-attribute` ¶

supports_pp: Literal[True] = True

A flag that indicates this model supports pipeline parallel.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

forward ¶

forward(
    *, intermediate_tensors: Optional[IntermediateTensors]
) -> Union[Tensor, IntermediateTensors]

Accept IntermediateTensors when PP rank > 0.

Return IntermediateTensors only for the last PP rank.

Source code in vllm/model_executor/models/interfaces.py

def forward(
    self,
    *,
    intermediate_tensors: Optional["IntermediateTensors"],
) -> Union[Tensor, "IntermediateTensors"]:
    """
    Accept [`IntermediateTensors`][vllm.sequence.IntermediateTensors] when
    PP rank > 0.

    Return [`IntermediateTensors`][vllm.sequence.IntermediateTensors] only
    for the last PP rank.
    """
    ...

make_empty_intermediate_tensors ¶

make_empty_intermediate_tensors(
    batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors

Called when PP rank > 0 for profiling purposes.

Source code in vllm/model_executor/models/interfaces.py

def make_empty_intermediate_tensors(
    self,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device,
) -> "IntermediateTensors":
    """Called when PP rank > 0 for profiling purposes."""
    ...

SupportsQuant ¶

The interface required for all models that support quantization.

Source code in vllm/model_executor/models/interfaces.py

class SupportsQuant:
    """The interface required for all models that support quantization."""

    hf_to_vllm_mapper: ClassVar[Optional["WeightsMapper"]] = None
    packed_modules_mapping: ClassVar[Optional[dict[str, list[str]]]] = None
    quant_config: Optional[QuantizationConfig] = None

    def __new__(cls, *args, **kwargs) -> Self:
        instance = super().__new__(cls)

        # find config passed in arguments
        quant_config = cls._find_quant_config(*args, **kwargs)
        if quant_config is not None:

            # attach config to model for general use
            instance.quant_config = quant_config

            # apply model mappings to config for proper config-model matching
            if (hf_to_vllm_mapper := instance.hf_to_vllm_mapper) is not None:
                instance.quant_config.apply_vllm_mapper(hf_to_vllm_mapper)
            if instance.packed_modules_mapping is not None:
                instance.quant_config.packed_modules_mapping.update(
                    instance.packed_modules_mapping)

        return instance

    @staticmethod
    def _find_quant_config(*args, **kwargs) -> Optional[QuantizationConfig]:
        """Find quant config passed through model constructor args"""
        from vllm.config import VllmConfig  # avoid circular import

        args_values = list(args) + list(kwargs.values())
        for arg in args_values:
            if isinstance(arg, VllmConfig):
                return arg.quant_config

            if isinstance(arg, QuantizationConfig):
                return arg

        return None

hf_to_vllm_mapper `class-attribute` ¶

hf_to_vllm_mapper: Optional[WeightsMapper] = None

packed_modules_mapping `class-attribute` ¶

packed_modules_mapping: Optional[dict[str, list[str]]] = (
    None
)

quant_config `class-attribute` `instance-attribute` ¶

quant_config: Optional[QuantizationConfig] = None

new ¶

__new__(*args, **kwargs) -> Self

Source code in vllm/model_executor/models/interfaces.py

def __new__(cls, *args, **kwargs) -> Self:
    instance = super().__new__(cls)

    # find config passed in arguments
    quant_config = cls._find_quant_config(*args, **kwargs)
    if quant_config is not None:

        # attach config to model for general use
        instance.quant_config = quant_config

        # apply model mappings to config for proper config-model matching
        if (hf_to_vllm_mapper := instance.hf_to_vllm_mapper) is not None:
            instance.quant_config.apply_vllm_mapper(hf_to_vllm_mapper)
        if instance.packed_modules_mapping is not None:
            instance.quant_config.packed_modules_mapping.update(
                instance.packed_modules_mapping)

    return instance

_find_quant_config `staticmethod` ¶

_find_quant_config(
    *args, **kwargs
) -> Optional[QuantizationConfig]

Find quant config passed through model constructor args

Source code in vllm/model_executor/models/interfaces.py

@staticmethod
def _find_quant_config(*args, **kwargs) -> Optional[QuantizationConfig]:
    """Find quant config passed through model constructor args"""
    from vllm.config import VllmConfig  # avoid circular import

    args_values = list(args) + list(kwargs.values())
    for arg in args_values:
        if isinstance(arg, VllmConfig):
            return arg.quant_config

        if isinstance(arg, QuantizationConfig):
            return arg

    return None

SupportsScoreTemplate ¶

Bases: Protocol

The interface required for all models that support score template.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsScoreTemplate(Protocol):
    """The interface required for all models that support score template."""

    supports_score_template: ClassVar[Literal[True]] = True
    """
    A flag that indicates this model supports score template.

    Note:
        There is no need to redefine this flag if this class is in the
        MRO of your model class.
    """

    @classmethod
    def get_score_template(cls, query: str, document: str) -> Optional[str]:
        """
        Generate a full prompt by populating the score template with query and document content.
        """ # noqa: E501
        ...

    @classmethod
    def post_process_tokens(cls, prompt: TokensPrompt) -> None:
        """
        Perform architecture-specific manipulations on the input tokens.
        """
        ...

supports_score_template `class-attribute` ¶

supports_score_template: Literal[True] = True

A flag that indicates this model supports score template.

Note

There is no need to redefine this flag if this class is in the MRO of your model class.

get_score_template `classmethod` ¶

get_score_template(
    query: str, document: str
) -> Optional[str]

Generate a full prompt by populating the score template with query and document content.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_score_template(cls, query: str, document: str) -> Optional[str]:
    """
    Generate a full prompt by populating the score template with query and document content.
    """ # noqa: E501
    ...

post_process_tokens `classmethod` ¶

post_process_tokens(prompt: TokensPrompt) -> None

Perform architecture-specific manipulations on the input tokens.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def post_process_tokens(cls, prompt: TokensPrompt) -> None:
    """
    Perform architecture-specific manipulations on the input tokens.
    """
    ...

SupportsTranscription ¶

Bases: Protocol

The interface required for all models that support transcription.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsTranscription(Protocol):
    """The interface required for all models that support transcription."""
    # Mapping from ISO639_1 language codes: language names
    supported_languages: ClassVar[Mapping[str, str]]

    supports_transcription: ClassVar[Literal[True]] = True

    supports_transcription_only: ClassVar[bool] = False
    """
    Transcription models can opt out of text generation by setting this to
    `True`.
    """

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        # language codes in supported_languages
        # that don't exist in the full language map
        invalid = set(cls.supported_languages) - set(LANGUAGES.keys())
        if invalid:
            raise ValueError(
                f"{cls.__name__}.supported_languages contains invalid "
                f"language codes: {sorted(invalid)}\n. "
                f"Valid choices are: {sorted(LANGUAGES.keys())}")

    @classmethod
    def get_generation_prompt(cls, audio: np.ndarray,
                              stt_config: SpeechToTextConfig,
                              model_config: ModelConfig,
                              language: Optional[str],
                              task_type: Literal["transcribe", "translate"],
                              request_prompt: str,
                              to_language: Optional[str]) -> PromptType:
        """Get the prompt for the ASR model.
        The model has control over the construction, as long as it
        returns a valid PromptType."""
        ...

    @classmethod
    def get_other_languages(cls) -> Mapping[str, str]:
        # other possible language codes from the whisper map
        return {
            k: v
            for k, v in LANGUAGES.items() if k not in cls.supported_languages
        }

    @classmethod
    def validate_language(cls, language: Optional[str]) -> Optional[str]:
        """
        Ensure the language specified in the transcription request 
        is a valid ISO 639-1 language code. If the request language is 
        valid, but not natively supported by the model, trigger a 
        warning (but not an exception).
        """
        if language is None or language in cls.supported_languages:
            return language
        elif language in cls.get_other_languages():
            logger.warning(
                "Language %r is not natively supported by %s; "
                "results may be less accurate. Supported languages: %r",
                language,
                cls.__name__,
                list(cls.supported_languages.keys()),
            )
            return language
        else:
            raise ValueError(
                f"Unsupported language: {language!r}.  Must be one of "
                f"{list(cls.supported_languages.keys())}.")

    @classmethod
    def get_speech_to_text_config(
            cls, model_config: ModelConfig,
            task_type: Literal["transcribe",
                               "translate"]) -> SpeechToTextConfig:
        """Get the speech to text config for the ASR model."""
        ...

    @classmethod
    def get_num_audio_tokens(cls, audio_duration_s: float,
                             stt_config: SpeechToTextConfig,
                             model_config: ModelConfig) -> Optional[int]:
        """
        Map from audio duration to number of audio tokens produced by the ASR 
        model, without running a forward pass.
        This is used for estimating the amount of processing for this audio.
        """
        return None

supported_languages `class-attribute` ¶

supported_languages: Mapping[str, str]

supports_transcription `class-attribute` ¶

supports_transcription: Literal[True] = True

supports_transcription_only `class-attribute` ¶

supports_transcription_only: bool = False

Transcription models can opt out of text generation by setting this to True.

__init_subclass__ ¶

__init_subclass__(**kwargs)

Source code in vllm/model_executor/models/interfaces.py

def __init_subclass__(cls, **kwargs):
    super().__init_subclass__(**kwargs)
    # language codes in supported_languages
    # that don't exist in the full language map
    invalid = set(cls.supported_languages) - set(LANGUAGES.keys())
    if invalid:
        raise ValueError(
            f"{cls.__name__}.supported_languages contains invalid "
            f"language codes: {sorted(invalid)}\n. "
            f"Valid choices are: {sorted(LANGUAGES.keys())}")

get_generation_prompt `classmethod` ¶

get_generation_prompt(
    audio: ndarray,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
    language: Optional[str],
    task_type: Literal["transcribe", "translate"],
    request_prompt: str,
    to_language: Optional[str],
) -> PromptType

Get the prompt for the ASR model. The model has control over the construction, as long as it returns a valid PromptType.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_generation_prompt(cls, audio: np.ndarray,
                          stt_config: SpeechToTextConfig,
                          model_config: ModelConfig,
                          language: Optional[str],
                          task_type: Literal["transcribe", "translate"],
                          request_prompt: str,
                          to_language: Optional[str]) -> PromptType:
    """Get the prompt for the ASR model.
    The model has control over the construction, as long as it
    returns a valid PromptType."""
    ...

get_num_audio_tokens `classmethod` ¶

get_num_audio_tokens(
    audio_duration_s: float,
    stt_config: SpeechToTextConfig,
    model_config: ModelConfig,
) -> Optional[int]

Map from audio duration to number of audio tokens produced by the ASR model, without running a forward pass. This is used for estimating the amount of processing for this audio.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_num_audio_tokens(cls, audio_duration_s: float,
                         stt_config: SpeechToTextConfig,
                         model_config: ModelConfig) -> Optional[int]:
    """
    Map from audio duration to number of audio tokens produced by the ASR 
    model, without running a forward pass.
    This is used for estimating the amount of processing for this audio.
    """
    return None

get_other_languages `classmethod` ¶

get_other_languages() -> Mapping[str, str]

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_other_languages(cls) -> Mapping[str, str]:
    # other possible language codes from the whisper map
    return {
        k: v
        for k, v in LANGUAGES.items() if k not in cls.supported_languages
    }

get_speech_to_text_config `classmethod` ¶

get_speech_to_text_config(
    model_config: ModelConfig,
    task_type: Literal["transcribe", "translate"],
) -> SpeechToTextConfig

Get the speech to text config for the ASR model.

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def get_speech_to_text_config(
        cls, model_config: ModelConfig,
        task_type: Literal["transcribe",
                           "translate"]) -> SpeechToTextConfig:
    """Get the speech to text config for the ASR model."""
    ...

validate_language `classmethod` ¶

validate_language(language: Optional[str]) -> Optional[str]

Ensure the language specified in the transcription request is a valid ISO 639-1 language code. If the request language is valid, but not natively supported by the model, trigger a warning (but not an exception).

Source code in vllm/model_executor/models/interfaces.py

@classmethod
def validate_language(cls, language: Optional[str]) -> Optional[str]:
    """
    Ensure the language specified in the transcription request 
    is a valid ISO 639-1 language code. If the request language is 
    valid, but not natively supported by the model, trigger a 
    warning (but not an exception).
    """
    if language is None or language in cls.supported_languages:
        return language
    elif language in cls.get_other_languages():
        logger.warning(
            "Language %r is not natively supported by %s; "
            "results may be less accurate. Supported languages: %r",
            language,
            cls.__name__,
            list(cls.supported_languages.keys()),
        )
        return language
    else:
        raise ValueError(
            f"Unsupported language: {language!r}.  Must be one of "
            f"{list(cls.supported_languages.keys())}.")

SupportsV0Only ¶

Bases: Protocol

Models with this interface are not compatible with V1 vLLM.

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class SupportsV0Only(Protocol):
    """Models with this interface are not compatible with V1 vLLM."""

    supports_v0_only: ClassVar[Literal[True]] = True

supports_v0_only `class-attribute` ¶

supports_v0_only: Literal[True] = True

_SupportsLoRAType ¶

Bases: Protocol

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class _SupportsLoRAType(Protocol):
    supports_lora: Literal[True]

    packed_modules_mapping: dict[str, list[str]]
    embedding_modules: dict[str, str]
    embedding_padding_modules: list[str]

embedding_modules `instance-attribute` ¶

embedding_modules: dict[str, str]

embedding_padding_modules `instance-attribute` ¶

embedding_padding_modules: list[str]

packed_modules_mapping `instance-attribute` ¶

packed_modules_mapping: dict[str, list[str]]

supports_lora `instance-attribute` ¶

supports_lora: Literal[True]

_SupportsPPType ¶

Bases: Protocol

Source code in vllm/model_executor/models/interfaces.py

@runtime_checkable
class _SupportsPPType(Protocol):
    supports_pp: Literal[True]

    def make_empty_intermediate_tensors(
        self,
        batch_size: int,
        dtype: torch.dtype,
        device: torch.device,
    ) -> "IntermediateTensors":
        ...

    def forward(
        self,
        *,
        intermediate_tensors: Optional["IntermediateTensors"],
    ) -> Union[Tensor, "IntermediateTensors"]:
        ...

supports_pp `instance-attribute` ¶

supports_pp: Literal[True]

forward ¶

forward(
    *, intermediate_tensors: Optional[IntermediateTensors]
) -> Union[Tensor, IntermediateTensors]

Source code in vllm/model_executor/models/interfaces.py

def forward(
    self,
    *,
    intermediate_tensors: Optional["IntermediateTensors"],
) -> Union[Tensor, "IntermediateTensors"]:
    ...

make_empty_intermediate_tensors ¶

make_empty_intermediate_tensors(
    batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors

Source code in vllm/model_executor/models/interfaces.py

def make_empty_intermediate_tensors(
    self,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device,
) -> "IntermediateTensors":
    ...

_supports_cross_encoding ¶

_supports_cross_encoding(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsCrossEncoding]],
    TypeIs[SupportsCrossEncoding],
]

Source code in vllm/model_executor/models/interfaces.py

def _supports_cross_encoding(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsCrossEncoding]], TypeIs[SupportsCrossEncoding]]:
    return getattr(model, "supports_cross_encoding", False)

_supports_lora ¶

_supports_lora(model: Union[type[object], object]) -> bool

Source code in vllm/model_executor/models/interfaces.py

def _supports_lora(model: Union[type[object], object]) -> bool:
    if isinstance(model, type):
        return isinstance(model, _SupportsLoRAType)

    return isinstance(model, SupportsLoRA)

_supports_pp_attributes ¶

_supports_pp_attributes(
    model: Union[type[object], object],
) -> bool

Source code in vllm/model_executor/models/interfaces.py

def _supports_pp_attributes(model: Union[type[object], object]) -> bool:
    if isinstance(model, type):
        return isinstance(model, _SupportsPPType)

    return isinstance(model, SupportsPP)

_supports_pp_inspect ¶

_supports_pp_inspect(
    model: Union[type[object], object],
) -> bool

Source code in vllm/model_executor/models/interfaces.py

def _supports_pp_inspect(model: Union[type[object], object]) -> bool:
    model_forward = getattr(model, "forward", None)
    if not callable(model_forward):
        return False

    return supports_kw(model_forward, "intermediate_tensors")

has_inner_state ¶

has_inner_state(model: object) -> TypeIs[HasInnerState]

has_inner_state(
    model: type[object],
) -> TypeIs[type[HasInnerState]]

has_inner_state(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[HasInnerState]], TypeIs[HasInnerState]
]

Source code in vllm/model_executor/models/interfaces.py

def has_inner_state(
    model: Union[type[object], object]
) -> Union[TypeIs[type[HasInnerState]], TypeIs[HasInnerState]]:
    return getattr(model, "has_inner_state", False)

has_noops ¶

has_noops(model: object) -> TypeIs[HasNoOps]

has_noops(model: type[object]) -> TypeIs[type[HasNoOps]]

has_noops(
    model: Union[type[object], object],
) -> Union[TypeIs[type[HasNoOps]], TypeIs[HasNoOps]]

Source code in vllm/model_executor/models/interfaces.py

def has_noops(
    model: Union[type[object], object]
) -> Union[TypeIs[type[HasNoOps]], TypeIs[HasNoOps]]:
    return getattr(model, "has_noops", False)

is_attention_free ¶

is_attention_free(model: object) -> TypeIs[IsAttentionFree]

is_attention_free(
    model: type[object],
) -> TypeIs[type[IsAttentionFree]]

is_attention_free(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[IsAttentionFree]], TypeIs[IsAttentionFree]
]

Source code in vllm/model_executor/models/interfaces.py

def is_attention_free(
    model: Union[type[object], object]
) -> Union[TypeIs[type[IsAttentionFree]], TypeIs[IsAttentionFree]]:
    return getattr(model, "is_attention_free", False)

is_hybrid ¶

is_hybrid(model: object) -> TypeIs[IsHybrid]

is_hybrid(model: type[object]) -> TypeIs[type[IsHybrid]]

is_hybrid(
    model: Union[type[object], object],
) -> Union[TypeIs[type[IsHybrid]], TypeIs[IsHybrid]]

Source code in vllm/model_executor/models/interfaces.py

def is_hybrid(
    model: Union[type[object], object]
) -> Union[TypeIs[type[IsHybrid]], TypeIs[IsHybrid]]:
    return getattr(model, "is_hybrid", False)

is_mixture_of_experts ¶

is_mixture_of_experts(
    model: object,
) -> TypeIs[MixtureOfExperts]

Source code in vllm/model_executor/models/interfaces.py

def is_mixture_of_experts(model: object) -> TypeIs[MixtureOfExperts]:
    return isinstance(model, MixtureOfExperts)

supports_cross_encoding ¶

supports_cross_encoding(
    model: type[object],
) -> TypeIs[type[SupportsCrossEncoding]]

supports_cross_encoding(
    model: object,
) -> TypeIs[SupportsCrossEncoding]

supports_cross_encoding(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsCrossEncoding]],
    TypeIs[SupportsCrossEncoding],
]

Source code in vllm/model_executor/models/interfaces.py

def supports_cross_encoding(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsCrossEncoding]], TypeIs[SupportsCrossEncoding]]:
    return is_pooling_model(model) and _supports_cross_encoding(model)

supports_eagle3 ¶

supports_eagle3(
    model: type[object],
) -> TypeIs[type[SupportsEagle3]]

supports_eagle3(model: object) -> TypeIs[SupportsEagle3]

supports_eagle3(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsEagle3]], TypeIs[SupportsEagle3]
]

Source code in vllm/model_executor/models/interfaces.py

def supports_eagle3(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsEagle3]], TypeIs[SupportsEagle3]]:
    return isinstance(model, SupportsEagle3)

supports_lora ¶

supports_lora(
    model: type[object],
) -> TypeIs[type[SupportsLoRA]]

supports_lora(model: object) -> TypeIs[SupportsLoRA]

supports_lora(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsLoRA]], TypeIs[SupportsLoRA]
]

Source code in vllm/model_executor/models/interfaces.py

def supports_lora(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsLoRA]], TypeIs[SupportsLoRA]]:
    result = _supports_lora(model)

    if not result:
        lora_attrs = (
            "packed_modules_mapping",
            "embedding_modules",
            "embedding_padding_modules",
        )
        missing_attrs = tuple(attr for attr in lora_attrs
                              if not hasattr(model, attr))

        if getattr(model, "supports_lora", False):
            if missing_attrs:
                logger.warning(
                    "The model (%s) sets `supports_lora=True`, "
                    "but is missing LoRA-specific attributes: %s",
                    model,
                    missing_attrs,
                )
        else:
            if not missing_attrs:
                logger.warning(
                    "The model (%s) contains all LoRA-specific attributes, "
                    "but does not set `supports_lora=True`.", model)

    return result

supports_mrope ¶

supports_mrope(
    model: type[object],
) -> TypeIs[type[SupportsMRoPE]]

supports_mrope(model: object) -> TypeIs[SupportsMRoPE]

supports_mrope(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsMRoPE]], TypeIs[SupportsMRoPE]
]

Source code in vllm/model_executor/models/interfaces.py

def supports_mrope(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsMRoPE]], TypeIs[SupportsMRoPE]]:
    return isinstance(model, SupportsMRoPE)

supports_multimodal ¶

supports_multimodal(
    model: type[object],
) -> TypeIs[type[SupportsMultiModal]]

supports_multimodal(
    model: object,
) -> TypeIs[SupportsMultiModal]

supports_multimodal(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsMultiModal]],
    TypeIs[SupportsMultiModal],
]

Source code in vllm/model_executor/models/interfaces.py

def supports_multimodal(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsMultiModal]], TypeIs[SupportsMultiModal]]:
    return getattr(model, "supports_multimodal", False)

supports_multimodal_encoder_tp_data ¶

supports_multimodal_encoder_tp_data(
    model: Union[type[object], object],
) -> bool

Source code in vllm/model_executor/models/interfaces.py

def supports_multimodal_encoder_tp_data(
        model: Union[type[object], object]) -> bool:
    return getattr(model, "supports_encoder_tp_data", False)

supports_multimodal_pruning ¶

supports_multimodal_pruning(
    model: type[object],
) -> TypeIs[type[SupportsMultiModalPruning]]

supports_multimodal_pruning(
    model: object,
) -> TypeIs[SupportsMultiModalPruning]

supports_multimodal_pruning(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsMultiModalPruning]],
    TypeIs[SupportsMultiModalPruning],
]

Source code in vllm/model_executor/models/interfaces.py

def supports_multimodal_pruning(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsMultiModalPruning]],
           TypeIs[SupportsMultiModalPruning]]:
    return getattr(model, "supports_multimodal_pruning", False)

supports_multimodal_raw_input_only ¶

supports_multimodal_raw_input_only(
    model: Union[type[object], object],
) -> bool

Source code in vllm/model_executor/models/interfaces.py

def supports_multimodal_raw_input_only(
        model: Union[type[object], object]) -> bool:
    return getattr(model, "supports_multimodal_raw_input_only", False)

supports_pp ¶

supports_pp(
    model: type[object],
) -> TypeIs[type[SupportsPP]]

supports_pp(model: object) -> TypeIs[SupportsPP]

supports_pp(
    model: Union[type[object], object],
) -> Union[
    bool, TypeIs[type[SupportsPP]], TypeIs[SupportsPP]
]

Source code in vllm/model_executor/models/interfaces.py

def supports_pp(
    model: Union[type[object], object],
) -> Union[bool, TypeIs[type[SupportsPP]], TypeIs[SupportsPP]]:
    supports_attributes = _supports_pp_attributes(model)
    supports_inspect = _supports_pp_inspect(model)

    if supports_attributes and not supports_inspect:
        logger.warning(
            "The model (%s) sets `supports_pp=True`, but does not accept "
            "`intermediate_tensors` in its `forward` method", model)

    if not supports_attributes:
        pp_attrs = ("make_empty_intermediate_tensors", )
        missing_attrs = tuple(attr for attr in pp_attrs
                              if not hasattr(model, attr))

        if getattr(model, "supports_pp", False):
            if missing_attrs:
                logger.warning(
                    "The model (%s) sets `supports_pp=True`, "
                    "but is missing PP-specific attributes: %s",
                    model,
                    missing_attrs,
                )
        else:
            if not missing_attrs:
                logger.warning(
                    "The model (%s) contains all PP-specific attributes, "
                    "but does not set `supports_pp=True`.", model)

    return supports_attributes and supports_inspect

supports_score_template ¶

supports_score_template(
    model: type[object],
) -> TypeIs[type[SupportsScoreTemplate]]

supports_score_template(
    model: object,
) -> TypeIs[SupportsScoreTemplate]

supports_score_template(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsScoreTemplate]],
    TypeIs[SupportsScoreTemplate],
]

Source code in vllm/model_executor/models/interfaces.py

def supports_score_template(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsScoreTemplate]], TypeIs[SupportsScoreTemplate]]:
    return getattr(model, "supports_score_template", False)

supports_transcription ¶

supports_transcription(
    model: type[object],
) -> TypeIs[type[SupportsTranscription]]

supports_transcription(
    model: object,
) -> TypeIs[SupportsTranscription]

supports_transcription(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsTranscription]],
    TypeIs[SupportsTranscription],
]

Source code in vllm/model_executor/models/interfaces.py

def supports_transcription(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsTranscription]], TypeIs[SupportsTranscription]]:
    return getattr(model, "supports_transcription", False)

supports_v0_only ¶

supports_v0_only(
    model: type[object],
) -> TypeIs[type[SupportsV0Only]]

supports_v0_only(model: object) -> TypeIs[SupportsV0Only]

supports_v0_only(
    model: Union[type[object], object],
) -> Union[
    TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]
]

Source code in vllm/model_executor/models/interfaces.py

def supports_v0_only(
    model: Union[type[object], object],
) -> Union[TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]]:
    return getattr(model, "supports_v0_only", False)

vllm.model_executor.models.interfaces ¶

MultiModalEmbeddings module-attribute ¶

logger module-attribute ¶

HasInnerState ¶

has_inner_state class-attribute ¶

HasNoOps ¶

has_noops class-attribute ¶

IsAttentionFree ¶

is_attention_free class-attribute ¶

IsHybrid ¶

is_hybrid class-attribute ¶

get_mamba_state_shape_from_config classmethod ¶

MixtureOfExperts ¶

expert_weights instance-attribute ¶

num_expert_groups instance-attribute ¶

num_local_physical_experts instance-attribute ¶

num_logical_experts instance-attribute ¶

num_moe_layers instance-attribute ¶

num_physical_experts instance-attribute ¶

num_redundant_experts instance-attribute ¶

num_routed_experts instance-attribute ¶

num_shared_experts instance-attribute ¶

set_eplb_state ¶

update_physical_experts_metadata ¶

SupportsCrossEncoding ¶

supports_cross_encoding class-attribute ¶

SupportsEagle3 ¶

supports_eagle3 class-attribute ¶

get_eagle3_aux_hidden_state_layers ¶

set_aux_hidden_state_layers ¶

SupportsLoRA ¶

embedding_modules class-attribute ¶

embedding_padding_modules class-attribute ¶

packed_modules_mapping class-attribute ¶

supports_lora class-attribute ¶

SupportsMRoPE ¶

supports_mrope class-attribute ¶

get_mrope_input_positions ¶

SupportsMultiModal ¶

merge_by_field_config class-attribute ¶

supports_encoder_tp_data class-attribute ¶

supports_multimodal class-attribute ¶

supports_multimodal_raw_input_only class-attribute ¶

_get_text_embeddings ¶

get_input_embeddings ¶

get_language_model ¶

get_multimodal_embeddings ¶

get_placeholder_str classmethod ¶

SupportsMultiModalPruning ¶

supports_multimodal_pruning class-attribute ¶

recompute_mrope_positions ¶

SupportsPP ¶

supports_pp class-attribute ¶

forward ¶

make_empty_intermediate_tensors ¶

SupportsQuant ¶

hf_to_vllm_mapper class-attribute ¶

packed_modules_mapping class-attribute ¶

quant_config class-attribute instance-attribute ¶

__new__ ¶

_find_quant_config staticmethod ¶

SupportsScoreTemplate ¶

supports_score_template class-attribute ¶

get_score_template classmethod ¶

post_process_tokens classmethod ¶

SupportsTranscription ¶

supported_languages class-attribute ¶

supports_transcription class-attribute ¶

supports_transcription_only class-attribute ¶

__init_subclass__ ¶

get_generation_prompt classmethod ¶

get_num_audio_tokens classmethod ¶

get_other_languages classmethod ¶

get_speech_to_text_config classmethod ¶

validate_language classmethod ¶

SupportsV0Only ¶

supports_v0_only class-attribute ¶

_SupportsLoRAType ¶

embedding_modules instance-attribute ¶

embedding_padding_modules instance-attribute ¶

MultiModalEmbeddings `module-attribute` ¶

logger `module-attribute` ¶

has_inner_state `class-attribute` ¶

has_noops `class-attribute` ¶

is_attention_free `class-attribute` ¶

is_hybrid `class-attribute` ¶

get_mamba_state_shape_from_config `classmethod` ¶

expert_weights `instance-attribute` ¶

num_expert_groups `instance-attribute` ¶

num_local_physical_experts `instance-attribute` ¶

num_logical_experts `instance-attribute` ¶

num_moe_layers `instance-attribute` ¶

num_physical_experts `instance-attribute` ¶

num_redundant_experts `instance-attribute` ¶

num_routed_experts `instance-attribute` ¶

num_shared_experts `instance-attribute` ¶

supports_cross_encoding `class-attribute` ¶

supports_eagle3 `class-attribute` ¶

embedding_modules `class-attribute` ¶

embedding_padding_modules `class-attribute` ¶

packed_modules_mapping `class-attribute` ¶

supports_lora `class-attribute` ¶

supports_mrope `class-attribute` ¶

merge_by_field_config `class-attribute` ¶

supports_encoder_tp_data `class-attribute` ¶

supports_multimodal `class-attribute` ¶

supports_multimodal_raw_input_only `class-attribute` ¶

get_placeholder_str `classmethod` ¶

supports_multimodal_pruning `class-attribute` ¶

supports_pp `class-attribute` ¶

hf_to_vllm_mapper `class-attribute` ¶

packed_modules_mapping `class-attribute` ¶

quant_config `class-attribute` `instance-attribute` ¶

new ¶

_find_quant_config `staticmethod` ¶

supports_score_template `class-attribute` ¶

get_score_template `classmethod` ¶

post_process_tokens `classmethod` ¶

supported_languages `class-attribute` ¶

supports_transcription `class-attribute` ¶

supports_transcription_only `class-attribute` ¶

get_generation_prompt `classmethod` ¶

get_num_audio_tokens `classmethod` ¶

get_other_languages `classmethod` ¶

get_speech_to_text_config `classmethod` ¶

validate_language `classmethod` ¶

supports_v0_only `class-attribute` ¶

embedding_modules `instance-attribute` ¶

embedding_padding_modules `instance-attribute` ¶

packed_modules_mapping `instance-attribute` ¶

supports_lora `instance-attribute` ¶

supports_pp `instance-attribute` ¶