vllm.forward_context ¶
batchsize_logging_interval module-attribute
¶
batchsize_logging_interval: float = (
VLLM_LOG_BATCHSIZE_INTERVAL
)
BatchDescriptor ¶
Bases: NamedTuple
Batch descriptor for cudagraph dispatching. We should keep the num of items as minimal as possible to properly and uniquely describe the padded batch for cudagraph.
Source code in vllm/forward_context.py
non_uniform property
¶
non_uniform: BatchDescriptor
Return a non-uniform version of current batch descriptor.
DPMetadata dataclass
¶
Source code in vllm/forward_context.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
|
__init__ ¶
__init__(
max_tokens_across_dp_cpu: Tensor,
num_tokens_across_dp_cpu: Tensor,
local_sizes: Optional[list[int]] = None,
) -> None
chunked_sizes ¶
Context manager to compute and temporarily set the per-rank local token sizes for a specific chunk during chunked forward execution.
This is necessary to ensure each DP (data parallel) rank processes its designated portion of tokens in lockstep with others, even when the token counts are uneven or some ranks have completed their input early.
For chunked execution, we break up the total tokens on each rank into multiple chunks (of at most max_chunk_size_per_rank
), and for a given chunk_idx
, this context manager sets self.local_sizes
to the number of tokens to process in that chunk on each rank.
self.local_sizes
is only valid inside the context.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequence_parallel_size | int | When Attn is TP and MoE layers are EP, we use SP between the layers to avoid redundant ops. We need this value to compute the chunked sizes. | required |
max_chunk_size_per_rank | int | The max number of tokens each rank is allowed to process in this chunk. | required |
chunk_idx | int | The index of the chunk to compute sizes for. | required |
Source code in vllm/forward_context.py
cu_tokens_across_sp ¶
Source code in vllm/forward_context.py
get_chunk_sizes_across_dp_rank ¶
make staticmethod
¶
make(
parallel_config: ParallelConfig,
attn_metadata: Any,
num_tokens: int,
num_tokens_across_dp_cpu: Optional[Tensor] = None,
) -> DPMetadata
Source code in vllm/forward_context.py
num_tokens_across_dp staticmethod
¶
Gather the num_tokens across all DP ranks and return results in a CPU tensor of size dp_size.
Source code in vllm/forward_context.py
should_ubatch_across_dp staticmethod
¶
should_ubatch_across_dp(
should_ubatch: bool,
orig_num_tokens_per_ubatch: int,
padded_num_tokens_per_ubatch: int,
dp_size: int,
dp_rank: int,
) -> tuple[bool, Optional[Tensor]]
-
Decides if each DP rank is going to microbatch. Either all ranks run with microbatching or none of them do. If this function decides not to run with microbatching. It will "abort" meaning that no padding information will be returned to the caller. It will return (False, None)
-
Determines the total number of tokens that each rank will run. All ranks will be padded out so that the run with the same number of tokens
tuple[
Name | Type | Description |
---|---|---|
should_ubatch | bool | Are all DP ranks going to microbatch |
num_tokens_after_padding | Optional[Tensor] | A tensor containing the total number of |
tuple[bool, Optional[Tensor]] | tokens per-microbatch for each DP rank including padding. Will be | |
tuple[bool, Optional[Tensor]] | None if should_ubatch if False |
]
Source code in vllm/forward_context.py
sp_local_sizes ¶
sp_local_sizes(sequence_parallel_size: int)
Context mamager for setting self.local_sizes. Same as self.chunked_sizes but without any chunking.
Source code in vllm/forward_context.py
ForwardContext dataclass
¶
Source code in vllm/forward_context.py
attn_metadata instance-attribute
¶
attn_metadata: Union[
AttentionMetadata,
dict[str, AttentionMetadata],
list[dict[str, AttentionMetadata]],
]
batch_descriptor class-attribute
instance-attribute
¶
batch_descriptor: Optional[BatchDescriptor] = None
cudagraph_runtime_mode class-attribute
instance-attribute
¶
cudagraph_runtime_mode: CUDAGraphMode = NONE
no_compile_layers instance-attribute
¶
Type AttentionMetadata for v0, Type Dict[str, AttentionMetadata] for v1, map from layer_name of each attention layer to its attention metadata Type List[Dict[str, AttentionMetadata]] for DBO. List of size two, one for each microbatch. Set dynamically for each forward pass
__init__ ¶
__init__(
no_compile_layers: dict[str, Any],
attn_metadata: Union[
AttentionMetadata,
dict[str, AttentionMetadata],
list[dict[str, AttentionMetadata]],
],
virtual_engine: int,
dp_metadata: Optional[DPMetadata] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
ubatch_slices: Optional[UBatchSlices] = None,
) -> None
_compute_chunked_local_num_tokens ¶
_compute_chunked_local_num_tokens(
num_tokens_across_dp_cpu: Tensor,
sequence_parallel_size: int,
max_num_tokens: int,
chunk_idx: int,
) -> list[int]
Source code in vllm/forward_context.py
_compute_sp_num_tokens ¶
_compute_sp_num_tokens(
num_tokens_across_dp_cpu: Tensor,
sequence_parallel_size: int,
) -> list[int]
Source code in vllm/forward_context.py
create_forward_context ¶
create_forward_context(
attn_metadata: Any,
vllm_config: VllmConfig,
virtual_engine: int = 0,
dp_metadata: Optional[DPMetadata] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
ubatch_slices: Optional[UBatchSlices] = None,
)
Source code in vllm/forward_context.py
get_forward_context ¶
get_forward_context() -> ForwardContext
Get the current forward context.
Source code in vllm/forward_context.py
override_forward_context ¶
override_forward_context(
forward_context: Optional[ForwardContext],
)
A context manager that overrides the current forward context. This is used to override the forward context for a specific forward pass.
Source code in vllm/forward_context.py
set_forward_context ¶
set_forward_context(
attn_metadata: Any,
vllm_config: VllmConfig,
virtual_engine: int = 0,
num_tokens: Optional[int] = None,
num_tokens_across_dp: Optional[Tensor] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
ubatch_slices: Optional[UBatchSlices] = None,
)
A context manager that stores the current forward context, can be attention metadata, etc. Here we can inject common logic for every model forward pass.