Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
gpu
cuda
inference
nvidia
cutlass
mha
multi-head-attention
llm
tensor-core
large-language-model
flash-attention
flash-attention-2
-
Updated
Sep 7, 2024 - C++