Replies: 1 comment
-
|
Hi @ket395, vLLM is a serving engine, not just an inference script. Its primary purpose is to run Large Language Models (LLMs) with high throughput, meaning it is designed to serve many users simultaneously rather than just focusing on low latency for a single user. The core design philosophy revolves around "PagedAttention." Traditional attention algorithms allocate contiguous memory for the Key-Value (KV) cache, which often leads to massive fragmentation and can result in 60-80% of VRAM being wasted. vLLM addresses this by managing KV cache memory much like an operating system manages virtual memory, breaking it into non-contiguous "pages." Because of this efficiency, vLLM can batch many more requests together within the same amount of VRAM, typically delivering two to four times higher throughput than standard Transformers. Additionally, it utilizes continuous batching, meaning it does not wait for a whole batch to finish; instead, it ejects completed requests and inserts new ones token-by-token. Regarding hardware, the RAM and VRAM requirements depend strictly on the model you want to run rather than the engine itself. A good rule of thumb is to double the model's parameter count (in billions) to estimate the required VRAM in gigabytes for FP16 precision. For example, Llama-3-8B requires approximately 16GB of VRAM. vLLM’s efficiency allows you to fit significantly more context, such as longer chat history, into the remaining space compared to other engines. While CPU execution is possible, it requires AVX512 support found in modern AMD or Intel CPUs and is significantly slower than GPU execution. NVIDIA GPUs are currently the primary hardware supported, with Compute Capability 7.0 or higher recommended, but AMD ROCm support is active and improving. The engine does generate high-level GPU-specific machine code, utilizing custom CUDA kernels often written in Triton or CUDA C++ to optimize specific operations like attention and activation functions for specific GPU architectures. In terms of operating system support, vLLM works well on Arch Linux, as it is a Python library with compiled C++/CUDA extensions that can be installed via pip or the AUR. However, it does not support FreeBSD or Haiku. The primary blocker for these operating systems is the lack of NVIDIA CUDA driver support, which is virtually non-existent or extremely limited on those platforms. Without the necessary CUDA or ROCm drivers, vLLM cannot access the GPU hardware required for its core functionality.
|
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
I'd tell a joke since maintainers don't read this.
And smart people or lazy people or "smart and lazy" / "smart but lazy" and security focused people don't want to use another website for one single question or poll or issue. Too many costs, too many risks, too many headaches, not enough value.
If you can build a positive reputation of providing great value to the community.
Also I'm talking about all cases of forums. Here we go -
They are private because I only have one personal email, it's secure, has only useful subscription that is never a single spam or promo or newsletter or podcast or free gift or any other marketing or other BS. Very anti privacy, anti security, inconvenient - if I wanted to talk to maintainers only.
TLDR or AI gen summary - Read the joke first. Then you'll understand the other side of the story instead.
Also, why would I use vLLM?
The above was an old post from asd
What is this vllm thing actually useful for in the real world? What are the benefits over the alternatives? What are the performance profiles? Can you give RAM type(which spec and ECC or no ECC?) and RAM size and CPU core count and GPU VRAM and GPU CUDA core count? Can it handle high level GPU specific assembly level code gen correctly for multiple Nvidia GPU arch's? What is the design philosophy?
Why are the devs on Slack? Most only use Slack for 9 to 5 jobs, so there's a clear essential but soul draining vibe and sentiment to it. Is there a traditional mailing list for this with a digest feed - RSS /ATOM and OPML if the last is practical and has safe implementation across OS platforms? Does it build in one step and run smoothly on Arch (I mean with AUR) like distro or FreeBSD family or Haiku?
Beta Was this translation helpful? Give feedback.
All reactions