vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize ¶
FlashInferAllGatherMoEPrepareAndFinalize ¶
Bases: FlashInferCutlassMoEPrepareAndFinalize
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
__init__ ¶
finalize ¶
finalize(
output: Tensor,
fused_expert_output: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
apply_router_weight_on_input: bool,
weight_and_reduce_impl: TopKWeightAndReduce,
) -> None
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
prepare ¶
prepare(
a1: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
num_experts: int,
expert_map: Optional[Tensor],
apply_router_weight_on_input: bool,
quant_config: FusedMoEQuantConfig,
) -> PrepareResultType
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
FlashInferAllToAllMoEPrepareAndFinalize ¶
Bases: FlashInferCutlassMoEPrepareAndFinalize
FlashInfer implementation using AllToAll communication.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
__init__ ¶
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
finalize ¶
finalize(
output: Tensor,
fused_expert_output: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
apply_router_weight_on_input: bool,
weight_and_reduce_impl: TopKWeightAndReduce,
) -> None
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
prepare ¶
prepare(
a1: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
num_experts: int,
expert_map: Optional[Tensor],
apply_router_weight_on_input: bool,
quant_config: FusedMoEQuantConfig,
) -> PrepareResultType
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
FlashInferCutlassMoEPrepareAndFinalize ¶
Bases: FusedMoEPrepareAndFinalize
Base class for FlashInfer MoE prepare and finalize operations.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
__init__ ¶
_apply_router_weight_on_input ¶
_apply_router_weight_on_input(
a1: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
apply_router_weight_on_input: bool,
) -> None
Apply router weight on input if needed.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
max_num_tokens_per_rank ¶
create_flashinfer_prepare_finalize ¶
create_flashinfer_prepare_finalize(
use_dp: bool,
use_nvfp4: bool = False,
enable_alltoallv: bool = False,
) -> FlashInferCutlassMoEPrepareAndFinalize
Factory function to create the appropriate FlashInfer implementation.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
flashinfer_alltoall_combine ¶
flashinfer_alltoall_combine(
all2all_manager: All2AllManagerBase,
output: Tensor,
top_k: int,
token_count: int,
alltoall_info,
)
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
flashinfer_alltoall_dispatch ¶
flashinfer_alltoall_dispatch(
all2all_manager: All2AllManagerBase,
global_num_tokens_cpu: list[int],
x: Tensor,
gs: Tensor,
topk_ids: Tensor,
topk_weights: Tensor,
top_k: int,
num_experts: int,
quant_config: FusedMoEQuantConfig,
)