vllm.model_executor.layers.fla.ops.utils ¶
SUPPRESS_LEVEL module-attribute
¶
device module-attribute
¶
device = (
get_available_device()
if get_available_device() != "hip"
else "cuda"
)
is_intel_alchemist module-attribute
¶
is_intel_alchemist = (
is_intel and "Intel(R) Arc(TM) A" in get_device_name(0)
)
is_nvidia_hopper module-attribute
¶
is_nvidia_hopper = is_nvidia and (
"NVIDIA H" in get_device_name(0)
or get_device_capability()[0] >= 9
)
use_cuda_graph module-attribute
¶
use_cuda_graph = (
is_nvidia and get("FLA_USE_CUDA_GRAPH", "0") == "1"
)
_check_platform cached
¶
_check_platform() -> Literal[
"nvidia", "amd", "intel", "musa"
]
Source code in vllm/model_executor/layers/fla/ops/utils.py
check_shared_mem cached
¶
Source code in vllm/model_executor/layers/fla/ops/utils.py
get_all_max_shared_mem ¶
input_guard ¶
A decorator to make sure all input tensors are contiguous and set the device based on input tensors.
Source code in vllm/model_executor/layers/fla/ops/utils.py
tensor_cache ¶
A decorator that caches the most recent results of a function with tensor inputs.
This decorator will store the output of the decorated function for the most recent set of input tensors. The cache is limited to a fixed size (default is 4). When the cache is full, the oldest entry will be removed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn | Callable[..., Tensor] | The function to be decorated. It should take tensor inputs and return tensor outputs. | required |
Returns:
Type | Description |
---|---|
Callable[..., Tensor] | Callable[..., torch.Tensor]: A wrapped version of the input function with single-entry caching. |