vllm.entrypoints.utils ¶
VLLM_SUBCMD_PARSER_EPILOG module-attribute
¶
VLLM_SUBCMD_PARSER_EPILOG = "For full list: vllm {subcmd} --help=all\nFor a section: vllm {subcmd} --help=ModelConfig (case-insensitive)\nFor a flag: vllm {subcmd} --help=max-model-len (_ or - accepted)\nDocumentation: https://docs.vllm.ai\n"
_validate_truncation_size ¶
_validate_truncation_size(
max_model_len: int,
truncate_prompt_tokens: Optional[int],
tokenization_kwargs: Optional[dict[str, Any]] = None,
) -> Optional[int]
Source code in vllm/entrypoints/utils.py
cli_env_setup ¶
Source code in vllm/entrypoints/utils.py
decrement_server_load ¶
get_max_tokens ¶
get_max_tokens(
max_model_len: int,
request: Union[
ChatCompletionRequest, CompletionRequest
],
input_length: int,
default_sampling_params: dict,
) -> int
Source code in vllm/entrypoints/utils.py
listen_for_disconnect async
¶
Returns if a disconnect message is received
Source code in vllm/entrypoints/utils.py
load_aware_call ¶
Source code in vllm/entrypoints/utils.py
log_non_default_args ¶
log_non_default_args(args: Union[Namespace, EngineArgs])
Source code in vllm/entrypoints/utils.py
with_cancellation ¶
Decorator that allows a route handler to be cancelled by client disconnections.
This does not use request.is_disconnected, which does not work with middleware. Instead this follows the pattern from starlette.StreamingResponse, which simultaneously awaits on two tasks- one to wait for an http disconnect message, and the other to do the work that we want done. When the first task finishes, the other is cancelled.
A core assumption of this method is that the body of the request has already been read. This is a safe assumption to make for fastapi handlers that have already parsed the body of the request into a pydantic model for us. This decorator is unsafe to use elsewhere, as it will consume and throw away all incoming messages for the request while it looks for a disconnect message.
In the case where a StreamingResponse
is returned by the handler, this wrapper will stop listening for disconnects and instead the response object will start listening for disconnects.