vllm.model_executor.warmup.deep_gemm_warmup ¶
Warmup deep_gemm kernels. DeepGEMM JIT's the kernels. The warmup aims to JIT all the kernels that would be used during model execution beforehand.
GROUPED_FP8_GEMM_NT_CONTIGUOUS_WARMUP_CACHE module-attribute
¶
_deepgemm_fp8_gemm_nt_warmup ¶
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_deepgemm_grouped_fp8_gemm_nt_contiguous_warmup ¶
_deepgemm_grouped_fp8_gemm_nt_contiguous_warmup(
w1: Tensor,
w2: Tensor,
w1_scale: Tensor,
w2_scale: Tensor,
num_topk: int,
max_tokens: int,
)
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_extract_data_from_fused_moe_module ¶
Extract weights, weight scales and num_topk from FusedMoE module.
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_extract_data_from_linear_base_module ¶
Extract weights, weight scales and quantization block sizes from the given LinearBase module.
Source code in vllm/model_executor/warmup/deep_gemm_warmup.py
_fp8_linear_may_use_deep_gemm ¶
Return True if the input module/layer could be processed with DeepGEMM.