Skip to content

Fused MoE Kernel features

The purpose of this document is to provide an overview of the various MoE kernels (both modular and non-modular) so it will be easier to select an appropriate set of kernels for any particular situation. This includes information about the all2all backends used by modular kernels.

Fused MoE Modular All2All backends

There are a number of all2all communication backends that are used to implement expert parallelism (EP) for the FusedMoE layer. The different FusedMoEPrepareAndFinalize sub-classes provide an interface for each all2all backend.

The following table describes the relevant features of each backend, i.e. activation format, supported quantization schemes and async support.

The output activation format (standard or batched) corresponds to the output of the prepare step of the FusedMoEPrepareAndFinalize subclass, the finalize step requires the same format. All the backend prepare methods expect activations in standard format and all the `finalize methods return activations in standard format. More details on the formats can be found in the Fused MoE Modular Kernel document.

The quantization types and formats enumerate which quantization schemes are supported by each FusedMoEPrepareAndFinalize class. The quantization can happen before or after the dispatch based on the format the all2all backend supports. e.g. deepep_high_throughput supports only block-quantized fp8 format, any other format will result in dispatching in higher precision and quantizing afterwards. The output of the prepare step for each backend is the quantized type. The finalize step generally requires the same input type as the original activations, e.g. if the original input is bfloat16 and the quantization scheme is fp8 w/per-tensor scales, prepare will return fp8/per-tensor scale activations and finalize will take bfloat16 activations. See the diagrams in Fused MoE Modular Kernel for more details on the types and formats of activations at each step of the MoE process. If no quantization type is specified, the kernel operates on float16 and/or bfloat16.

Async backends support the use of DBO (Dual Batch Overlap) and shared expert overlap (where shared experts are computed during the combine step).

Certain models require the topk weights to be applied to the input activations rather than the output activations when topk==1, e.g. llama. For modular kernels, this feature is supported by the FusedMoEPrepareAndFinalize subclass, for non-modular kernels, it is up to the experts function to deal with this flag.

unless otherwise specified, backends are controlled via VLLM_ALL2ALL_BACKEND. All backends except flashinfer only work with EP+DP or EP+TP. Flashinfer can work with EP or DP w/o EP.

Backend Output act. format Quant. types Quant. format Async Apply Weight On Input Sub-class
naive standard all1 G,A,T N 6 layer.py
pplx batched fp8,int8 G,A,T Y Y PplxPrepareAndFinalize
deepep_high_throughput standard fp8 G(128),A,T2 Y Y DeepEPLLPrepareAndFinalize
deepep_low_latency batched fp8 G(128),A,T3 Y Y DeepEPHTPrepareAndFinalize
flashinfer_all2allv standard nvfp4,fp8 G,A,T N N FlashInferAllToAllMoEPrepareAndFinalize
flashinfer4 standard nvfp4,fp8 G,A,T N N FlashInferCutlassMoEPrepareAndFinalize
flashinfer4 standard nvfp4,fp8 G,A,T N N FlashInferCutlassMoEPrepareAndFinalize
MoEPrepareAndFinalizeNoEP5 standard fp8,int8 G,A,T N Y MoEPrepareAndFinalizeNoEP
BatchedPrepareAndFinalize5 batched fp8,int8 G,A,T N Y BatchedPrepareAndFinalize

Table key

  1. All types: mxfp4, nvfp4, int4, int8, fp8
  2. A,T quantization occurs after dispatch.
  3. All quantization happens after dispatch.
  4. Controlled by different env vars (VLLM_FLASHINFER_MOE_BACKEND "throughput" or "latency")
  5. This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs w/o dispatch or combine. These cannot be selected via environment variable. These are generally use for testing or adapting an expert subclass to the fused_experts API.
  6. This depends on the experts implementation.

  • G - Grouped
  • G(N) - Grouped w/block size N
  • A - Per activation token
  • T - Per tensor

Modular kernels are supported by the following FusedMoEMethodBase classes.

Fused MoE Experts Kernels

The are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton fused_experts function. Many have modular kernel adatpers so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties.

Each kernel must be provided with one of the supported input activation formats. Some flavors of kernels support both standard and batched formats through different entry points, e.g. TritonExperts and BatchedTritonExperts. Batched format kernels are currently only needed for matching with certain all2all backends, e.g. pplx, DeepEPLLPrepareAndFinalize.

Similar to the backend kernels, each experts kernel only supports certain quantization formats. For non-modular experts, the activations will be in the original type and quantized internally by the kernel. Modular experts will expect the activations to already be in the quantized format. Both types of experts will yield outputs in the original activation type.

Each experts kernel supports one or more activation functions, e.g. silu, gelu that are applied to the intermediate results.

As with the backends, some experts support applying topk weights on the input activations. The entries in the column in this table only apply to the non-modular experts.

Most experts flavors include an equivalent modular interface which will be a subclass of FusedMoEPermuteExpertsUnpermute.

To be used with a particular FusedMoEPrepareAndFinalize sub-class, MoE kernels must have compatible activation formats, quantization types and quantization formats.

Kernel Input act. format Quant. types Quant. format Activation function Apply Weight On Input Modular Source
triton standard all1 G,A,T silu, gelu,
swigluoai,
silu_no_mul,
gelu_no_mul
Y Y fused_experts,
TritonExperts
triton (batched) batched all1 G,A,T silu, gelu 6 Y BatchedTritonExperts
deep gemm standard,
batched
fp8 G(128),A,T silu, gelu 6 Y deep_gemm_moe_fp8,
DeepGemmExperts,
BatchedDeepGemmExperts
cutlass_fp4 standard,
batched
nvfp4 A,T silu Y Y cutlass_moe_fp4,
CutlassExpertsFp4
cutlass_fp8 standard,
batched
fp8 A,T silu, gelu Y Y cutlass_moe_fp8,
CutlassExpertsFp8,
CutlasBatchedExpertsFp8
flashinfer standard nvfp4,
fp8
T 5 N Y flashinfer_cutlass_moe_fp4,
FlashInferExperts
gpt oss triton standard N/A N/A 5 Y Y triton_kernel_fused_experts,
OAITritonExperts
deep gemm+triton2 standard,
batched
all1 G(128),A,T silu, gelu 6 Y TritonOrDeepGemmExperts,
BatchedTritonOrDeepGemmExperts
marlin standard 3 3 silu,
swigluoai
Y N fused_marlin_moe
trtllm standard mxfp4,
nvfp4
G(16),G(32) 5 N Y TrtLlmGenExperts
pallas standard N/A N/A silu N N fused_moe
iterative standard N/A N/A silu N N fused_moe
rocm aiter moe standard fp8 G(128),A,T silu, gelu Y N rocm_aiter_fused_experts
cpu_fused_moe standard N/A N/A silu N N CPUFusedMOE
naive batched4 batched int8,
fp8
G,A,T silu, gelu 6 Y NaiveBatchedExperts

Table key

  1. All types: mxfp4, nvfp4, int4, int8, fp8
  2. A dispatcher wrapper around triton and deep gemm experts. Will select based on type + shape + quantization params
  3. uint4, uint8, fp8, fp4
  4. This is a naive implementation of experts that supports batched format. Mainly used for testing.
  5. The activation parameter is ignored and SwiGlu is used by default instead.
  6. Only handled by or supported when used with modular kernels.

Modular Kernel "families"

The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts.

backend FusedMoEPrepareAndFinalize subclasses FusedMoEPermuteExpertsUnpermute subclasses
deepep_high_throughput,
pplx
DeepEPHTPrepareAndFinalize,
PplxPrepareAndFinalize
BatchedDeepGemmExperts,
BatchedTritonExperts,
BatchedTritonOrDeepGemmExperts,
CutlassBatchedExpertsFp8
deepep_low_latency DeepEPLLPrepareAndFinalize DeepGemmExperts,
TritonExperts,
TritonOrDeepGemmExperts,
CutlassExpertsFp8
flashinfer FlashInferCutlassMoEPrepareAndFinalize FlashInferExperts