Fused MoE Kernel features¶
The purpose of this document is to provide an overview of the various MoE kernels (both modular and non-modular) so it will be easier to select an appropriate set of kernels for any particular situation. This includes information about the all2all backends used by modular kernels.
Fused MoE Modular All2All backends¶
There are a number of all2all communication backends that are used to implement expert parallelism (EP) for the FusedMoE
layer. The different FusedMoEPrepareAndFinalize
sub-classes provide an interface for each all2all backend.
The following table describes the relevant features of each backend, i.e. activation format, supported quantization schemes and async support.
The output activation format (standard or batched) corresponds to the output of the prepare step of the FusedMoEPrepareAndFinalize
subclass, the finalize step requires the same format. All the backend prepare
methods expect activations in standard format and all the `finalize methods return activations in standard format. More details on the formats can be found in the Fused MoE Modular Kernel document.
The quantization types and formats enumerate which quantization schemes are supported by each FusedMoEPrepareAndFinalize
class. The quantization can happen before or after the dispatch based on the format the all2all backend supports. e.g. deepep_high_throughput supports only block-quantized fp8 format, any other format will result in dispatching in higher precision and quantizing afterwards. The output of the prepare step for each backend is the quantized type. The finalize step generally requires the same input type as the original activations, e.g. if the original input is bfloat16 and the quantization scheme is fp8 w/per-tensor scales, prepare
will return fp8/per-tensor scale activations and finalize
will take bfloat16 activations. See the diagrams in Fused MoE Modular Kernel for more details on the types and formats of activations at each step of the MoE process. If no quantization type is specified, the kernel operates on float16 and/or bfloat16.
Async backends support the use of DBO (Dual Batch Overlap) and shared expert overlap (where shared experts are computed during the combine step).
Certain models require the topk weights to be applied to the input activations rather than the output activations when topk==1, e.g. llama. For modular kernels, this feature is supported by the FusedMoEPrepareAndFinalize
subclass, for non-modular kernels, it is up to the experts function to deal with this flag.
unless otherwise specified, backends are controlled via VLLM_ALL2ALL_BACKEND
. All backends except flashinfer
only work with EP+DP or EP+TP. Flashinfer
can work with EP or DP w/o EP.
Backend | Output act. format | Quant. types | Quant. format | Async | Apply Weight On Input | Sub-class |
---|---|---|---|---|---|---|
naive | standard | all1 | G,A,T | N | 6 | layer.py |
pplx | batched | fp8,int8 | G,A,T | Y | Y | PplxPrepareAndFinalize |
deepep_high_throughput | standard | fp8 | G(128),A,T2 | Y | Y | DeepEPLLPrepareAndFinalize |
deepep_low_latency | batched | fp8 | G(128),A,T3 | Y | Y | DeepEPHTPrepareAndFinalize |
flashinfer_all2allv | standard | nvfp4,fp8 | G,A,T | N | N | FlashInferAllToAllMoEPrepareAndFinalize |
flashinfer4 | standard | nvfp4,fp8 | G,A,T | N | N | FlashInferCutlassMoEPrepareAndFinalize |
flashinfer4 | standard | nvfp4,fp8 | G,A,T | N | N | FlashInferCutlassMoEPrepareAndFinalize |
MoEPrepareAndFinalizeNoEP5 | standard | fp8,int8 | G,A,T | N | Y | MoEPrepareAndFinalizeNoEP |
BatchedPrepareAndFinalize5 | batched | fp8,int8 | G,A,T | N | Y | BatchedPrepareAndFinalize |
Table key
- All types: mxfp4, nvfp4, int4, int8, fp8
- A,T quantization occurs after dispatch.
- All quantization happens after dispatch.
- Controlled by different env vars (
VLLM_FLASHINFER_MOE_BACKEND
"throughput" or "latency") - This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs w/o dispatch or combine. These cannot be selected via environment variable. These are generally use for testing or adapting an expert subclass to the
fused_experts
API. - This depends on the experts implementation.
- G - Grouped
- G(N) - Grouped w/block size N
- A - Per activation token
- T - Per tensor
Modular kernels are supported by the following FusedMoEMethodBase
classes.
ModelOptFp8MoEMethod
Fp8MoEMethod
CompressedTensorsW4A4MoeMethod
CompressedTensorsW8A8Fp8MoEMethod
Mxfp4MoEMethod
UnquantizedFusedMoEMethod
Fused MoE Experts Kernels¶
The are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton fused_experts
function. Many have modular kernel adatpers so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties.
Each kernel must be provided with one of the supported input activation formats. Some flavors of kernels support both standard and batched formats through different entry points, e.g. TritonExperts
and BatchedTritonExperts
. Batched format kernels are currently only needed for matching with certain all2all backends, e.g. pplx
, DeepEPLLPrepareAndFinalize
.
Similar to the backend kernels, each experts kernel only supports certain quantization formats. For non-modular experts, the activations will be in the original type and quantized internally by the kernel. Modular experts will expect the activations to already be in the quantized format. Both types of experts will yield outputs in the original activation type.
Each experts kernel supports one or more activation functions, e.g. silu, gelu that are applied to the intermediate results.
As with the backends, some experts support applying topk weights on the input activations. The entries in the column in this table only apply to the non-modular experts.
Most experts flavors include an equivalent modular interface which will be a subclass of FusedMoEPermuteExpertsUnpermute
.
To be used with a particular FusedMoEPrepareAndFinalize
sub-class, MoE kernels must have compatible activation formats, quantization types and quantization formats.
Kernel | Input act. format | Quant. types | Quant. format | Activation function | Apply Weight On Input | Modular | Source |
---|---|---|---|---|---|---|---|
triton | standard | all1 | G,A,T | silu, gelu, swigluoai, silu_no_mul, gelu_no_mul | Y | Y | fused_experts ,TritonExperts |
triton (batched) | batched | all1 | G,A,T | silu, gelu | 6 | Y | BatchedTritonExperts |
deep gemm | standard, batched | fp8 | G(128),A,T | silu, gelu | 6 | Y | deep_gemm_moe_fp8 ,DeepGemmExperts ,BatchedDeepGemmExperts |
cutlass_fp4 | standard, batched | nvfp4 | A,T | silu | Y | Y | cutlass_moe_fp4 ,CutlassExpertsFp4 |
cutlass_fp8 | standard, batched | fp8 | A,T | silu, gelu | Y | Y | cutlass_moe_fp8 ,CutlassExpertsFp8 ,CutlasBatchedExpertsFp8 |
flashinfer | standard | nvfp4, fp8 | T | 5 | N | Y | flashinfer_cutlass_moe_fp4 ,FlashInferExperts |
gpt oss triton | standard | N/A | N/A | 5 | Y | Y | triton_kernel_fused_experts ,OAITritonExperts |
deep gemm+triton2 | standard, batched | all1 | G(128),A,T | silu, gelu | 6 | Y | TritonOrDeepGemmExperts ,BatchedTritonOrDeepGemmExperts |
marlin | standard | 3 | 3 | silu, swigluoai | Y | N | fused_marlin_moe |
trtllm | standard | mxfp4, nvfp4 | G(16),G(32) | 5 | N | Y | TrtLlmGenExperts |
pallas | standard | N/A | N/A | silu | N | N | fused_moe |
iterative | standard | N/A | N/A | silu | N | N | fused_moe |
rocm aiter moe | standard | fp8 | G(128),A,T | silu, gelu | Y | N | rocm_aiter_fused_experts |
cpu_fused_moe | standard | N/A | N/A | silu | N | N | CPUFusedMOE |
naive batched4 | batched | int8, fp8 | G,A,T | silu, gelu | 6 | Y | NaiveBatchedExperts |
Table key
- All types: mxfp4, nvfp4, int4, int8, fp8
- A dispatcher wrapper around triton and deep gemm experts. Will select based on type + shape + quantization params
- uint4, uint8, fp8, fp4
- This is a naive implementation of experts that supports batched format. Mainly used for testing.
- The
activation
parameter is ignored and SwiGlu is used by default instead. - Only handled by or supported when used with modular kernels.
Modular Kernel "families"¶
The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts.
backend | FusedMoEPrepareAndFinalize subclasses | FusedMoEPermuteExpertsUnpermute subclasses |
---|---|---|
deepep_high_throughput, pplx | DeepEPHTPrepareAndFinalize ,PplxPrepareAndFinalize | BatchedDeepGemmExperts ,BatchedTritonExperts ,BatchedTritonOrDeepGemmExperts ,CutlassBatchedExpertsFp8 |
deepep_low_latency | DeepEPLLPrepareAndFinalize | DeepGemmExperts ,TritonExperts ,TritonOrDeepGemmExperts ,CutlassExpertsFp8 |
flashinfer | FlashInferCutlassMoEPrepareAndFinalize | FlashInferExperts |