vllm.config.kv_transfer ¶
KVTransferConfig ¶
Configuration for distributed KV cache transfer.
Source code in vllm/config/kv_transfer.py
engine_id class-attribute
instance-attribute
¶
The engine id for KV transfers.
kv_buffer_device class-attribute
instance-attribute
¶
The device used by kv connector to buffer the KV cache. Choices are 'cuda' and 'cpu'.
kv_buffer_size class-attribute
instance-attribute
¶
kv_buffer_size: float = 1000000000.0
The buffer size for TorchDistributedConnector. Measured in number of bytes. Recommended value: 1e9 (about 1GB).
kv_connector class-attribute
instance-attribute
¶
The KV connector for vLLM to transmit KV caches between vLLM instances.
kv_connector_extra_config class-attribute
instance-attribute
¶
any extra config that the connector may need.
kv_connector_module_path class-attribute
instance-attribute
¶
The Python module path to dynamically load the KV connector from. Only supported in V1.
kv_ip class-attribute
instance-attribute
¶
kv_ip: str = '127.0.0.1'
The KV connector ip, used to build distributed connection.
kv_parallel_size class-attribute
instance-attribute
¶
kv_parallel_size: int = 1
The number of parallel instances for KV cache transfer. For P2pNcclConnector, this should be 2.
kv_port class-attribute
instance-attribute
¶
kv_port: int = 14579
The KV connector port, used to build distributed connection.
kv_rank class-attribute
instance-attribute
¶
The rank of this vLLM instance in the KV cache transfer. Typical value: 0 for prefill instance, 1 for decode instance. Currently only 1P1D is supported.
kv_role class-attribute
instance-attribute
¶
Whether this vLLM instance produces, consumes KV cache, or both. Choices are 'kv_producer', 'kv_consumer', and 'kv_both'.
__post_init__ ¶
Source code in vllm/config/kv_transfer.py
compute_hash ¶
compute_hash() -> str
WARNING: Whenever a new field is added to this config, ensure that it is included in the factors list if it affects the computation graph.
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.