vllm.attention.layers.cross_attention ¶
CrossAttention ¶
Bases: Attention
Cross-attention for encoder-decoder models. Handles attention between decoder queries and encoder keys/values.
Source code in vllm/attention/layers/cross_attention.py
__init__ ¶
__init__(
num_heads: int,
head_size: int,
scale: float,
cache_config: Optional[CacheConfig] = None,
attn_type: Optional[str] = None,
**kwargs,
)
Source code in vllm/attention/layers/cross_attention.py
_get_cross_slot_mapping ¶
_get_cross_slot_mapping(
encoder_seq_lens: ndarray,
block_table_tensor: Tensor,
kv_cache_spec: CrossAttentionSpec,
device: device,
) -> Tensor
Get cross-attention slot mappings.
Source code in vllm/attention/layers/cross_attention.py
_get_max_encoder_len ¶
_get_max_encoder_len(vllm_config: VllmConfig) -> int
Gets the max number of encoder input tokens from the config.
Source code in vllm/attention/layers/cross_attention.py
create_cross_attention_backend cached
¶
create_cross_attention_backend(
underlying_attn_backend: AttentionBackend,
) -> type[AttentionBackend]