vllm.attention.backends.abstract ¶
AttentionBackend ¶
Bases: ABC
Abstract class for attention backends.
Source code in vllm/attention/backends/abstract.py
supports_quant_query_input class-attribute
instance-attribute
¶
supports_quant_query_input: bool = False
full_cls_name classmethod
¶
get_builder_cls abstractmethod
staticmethod
¶
get_impl_cls abstractmethod
staticmethod
¶
get_impl_cls() -> Type[AttentionImpl]
get_kv_cache_shape abstractmethod
staticmethod
¶
get_kv_cache_stride_order staticmethod
¶
get_metadata_cls abstractmethod
staticmethod
¶
get_metadata_cls() -> Type[AttentionMetadata]
make_metadata classmethod
¶
make_metadata(*args, **kwargs) -> AttentionMetadata
AttentionImpl ¶
Source code in vllm/attention/backends/abstract.py
can_return_lse_for_decode class-attribute
instance-attribute
¶
can_return_lse_for_decode: bool = False
need_to_return_lse_for_decode class-attribute
instance-attribute
¶
need_to_return_lse_for_decode: bool = False
__init__ abstractmethod
¶
__init__(
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: Optional[int] = None,
alibi_slopes: Optional[List[float]] = None,
sliding_window: Optional[int] = None,
kv_cache_dtype: str = "auto",
logits_soft_cap: Optional[float] = None,
attn_type: str = DECODER,
kv_sharing_target_layer_name: Optional[str] = None,
) -> None
Source code in vllm/attention/backends/abstract.py
__new__ ¶
Source code in vllm/attention/backends/abstract.py
forward abstractmethod
¶
forward(
layer: AttentionLayer,
query: Tensor,
key: Tensor,
value: Tensor,
kv_cache: Tensor,
attn_metadata: T,
output: Optional[Tensor] = None,
output_scale: Optional[Tensor] = None,
output_block_scale: Optional[Tensor] = None,
) -> Tensor
Source code in vllm/attention/backends/abstract.py
fused_output_quant_supported ¶
fused_output_quant_supported(quant_key: QuantKey)
Does this attention implementation support fused output quantization. This is used by the AttnFusionPass to only fuse output quantization onto implementations that support it.
:param quant_key: QuantKey object that describes the quantization op :return: is fusion supported for this type of quantization
Source code in vllm/attention/backends/abstract.py
AttentionLayer ¶
Bases: Protocol
Source code in vllm/attention/backends/abstract.py
AttentionMetadata ¶
AttentionType ¶
Attention type. Use string to be compatible with torch.compile
.
Source code in vllm/attention/backends/abstract.py
DECODER class-attribute
instance-attribute
¶
Decoder attention between previous layer Q/K/V.
ENCODER class-attribute
instance-attribute
¶
Encoder attention between previous layer Q/K/V for encoder-decoder.
ENCODER_DECODER class-attribute
instance-attribute
¶
Attention between dec. Q and enc. K/V for encoder-decoder.
ENCODER_ONLY class-attribute
instance-attribute
¶
Encoder attention between previous layer Q/K/V.
MLAAttentionImpl ¶
Bases: AttentionImpl[T]
, Generic[T]
Source code in vllm/attention/backends/abstract.py
forward abstractmethod
¶
forward(
layer: AttentionLayer,
hidden_states_or_cq: Tensor,
kv_c_normed: Tensor,
k_pe: Tensor,
kv_cache: Tensor,
attn_metadata: T,
output: Optional[Tensor] = None,
output_scale: Optional[Tensor] = None,
output_block_scale: Optional[Tensor] = None,
) -> Tensor