vllm.sequence ¶
Sequence and its related classes.
ExecuteModelRequest ¶
IntermediateTensors dataclass
¶
For all pipeline stages except the last, we need to return the hidden states and residuals to be sent to the next stage. This data structure contains the hidden states and residuals for a request.
Each stage also needs to handle its own kv_connector_output.
Source code in vllm/sequence.py
RequestMetrics dataclass
¶
Metrics associated with a request.
Attributes:
Name | Type | Description |
---|---|---|
arrival_time | float | The time when the request arrived. |
first_scheduled_time | Optional[float] | The time when the request was first scheduled. |
first_token_time | Optional[float] | The time when the first token was generated. |
time_in_queue | Optional[float] | The time the request spent in the queue. |
finished_time | Optional[float] | The time when the request was finished. |
scheduler_time | Optional[float] | The time spent in the scheduler when this request was being considered by the scheduler. |
model_forward_time | Optional[float] | The time spent in the model forward pass when this request was in the batch. |
model_execute_time | Optional[float] | The time spent in the model execute function. This will include model forward, block/sync across workers, cpu-gpu sync time and sampling time. |
Source code in vllm/sequence.py
__init__ ¶
__init__(
arrival_time: float,
last_token_time: float,
first_scheduled_time: Optional[float],
first_token_time: Optional[float],
time_in_queue: Optional[float],
finished_time: Optional[float] = None,
scheduler_time: Optional[float] = None,
model_forward_time: Optional[float] = None,
model_execute_time: Optional[float] = None,
) -> None