vllm.v1.attention.backends.mla.flashinfer_mla ¶
FLASHINFER_MLA_WORKSPACE_BUFFER_SIZE module-attribute
¶
g_fi_workspace module-attribute
¶
g_fi_workspace = zeros(
FLASHINFER_MLA_WORKSPACE_BUFFER_SIZE,
dtype=uint8,
device="cuda",
)
FlashInferMLABackend ¶
Bases: MLACommonBackend
Source code in vllm/v1/attention/backends/mla/flashinfer_mla.py
get_impl_cls staticmethod
¶
get_impl_cls() -> type[FlashInferMLAImpl]
FlashInferMLAImpl ¶
Bases: MLACommonImpl[MLACommonMetadata]
Source code in vllm/v1/attention/backends/mla/flashinfer_mla.py
__init__ ¶
__init__(
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: int,
alibi_slopes: Optional[list[float]],
sliding_window: Optional[int],
kv_cache_dtype: str,
logits_soft_cap: Optional[float],
attn_type: str,
kv_sharing_target_layer_name: Optional[str],
**mla_args,
) -> None
Source code in vllm/v1/attention/backends/mla/flashinfer_mla.py
_forward_decode ¶
_forward_decode(
q: Union[Tensor, tuple[Tensor, Tensor]],
kv_c_and_k_pe_cache: Tensor,
attn_metadata: MLACommonMetadata,
layer: AttentionLayer,
) -> tuple[Tensor, Optional[Tensor]]