Many 0 Day user questions - What is this vllm thing useful #17231

ket395 · 2025-04-26T15:35:11Z

ket395
Apr 26, 2025

I'd tell a joke since maintainers don't read this.

User - Here's a bug with your public code that you want people to use.
Author - Not a bug (User stops reading after this and either abuses the author's reply out of understandable frustration and anger or does it in his head and pretends to be nice)
User - Ok. (may not even repsond or have notification allowed for this interaction)
(No response from User after years, but User doesn't use the software anymore uninstalled it.
they found an alternative, project lost one user and everyone else who had the same issue 
but didn't have a Github account or those who just found the result on Github or on a search engine
like Google or through Perplexity)
(Status - the bug is still present but not fixed. People ignore it then delete the code, download
stats keep increasing. Author is proud of his work! Welcome to Github flow! ;))

GitHub discussions are not often looked at by users and maintainers.

And smart people or lazy people or "smart and lazy" / "smart but lazy" and security focused people don't want to use another website for one single question or poll or issue. Too many costs, too many risks, too many headaches, not enough value.

If you can build a positive reputation of providing great value to the community.

Also I'm talking about all cases of forums. Here we go -

This application will be able to read your private email addresses.

They are private because I only have one personal email, it's secure, has only useful subscription that is never a single spam or promo or newsletter or podcast or free gift or any other marketing or other BS. Very anti privacy, anti security, inconvenient - if I wanted to talk to maintainers only.

TLDR or AI gen summary - Read the joke first. Then you'll understand the other side of the story instead.

Also, why would I use vLLM?

The above was an old post from asd

What is this vllm thing actually useful for in the real world? What are the benefits over the alternatives? What are the performance profiles? Can you give RAM type(which spec and ECC or no ECC?) and RAM size and CPU core count and GPU VRAM and GPU CUDA core count? Can it handle high level GPU specific assembly level code gen correctly for multiple Nvidia GPU arch's? What is the design philosophy?

Why are the devs on Slack? Most only use Slack for 9 to 5 jobs, so there's a clear essential but soul draining vibe and sentiment to it. Is there a traditional mailing list for this with a digest feed - RSS /ATOM and OPML if the last is practical and has safe implementation across OS platforms? Does it build in one step and run smoothly on Arch (I mean with AUR) like distro or FreeBSD family or Haiku?

JunjieAraoXiong · 2025-11-25T14:10:22Z

JunjieAraoXiong
Nov 25, 2025

Hi @ket395,

vLLM is a serving engine, not just an inference script. Its primary purpose is to run Large Language Models (LLMs) with high throughput, meaning it is designed to serve many users simultaneously rather than just focusing on low latency for a single user. The core design philosophy revolves around "PagedAttention." Traditional attention algorithms allocate contiguous memory for the Key-Value (KV) cache, which often leads to massive fragmentation and can result in 60-80% of VRAM being wasted. vLLM addresses this by managing KV cache memory much like an operating system manages virtual memory, breaking it into non-contiguous "pages." Because of this efficiency, vLLM can batch many more requests together within the same amount of VRAM, typically delivering two to four times higher throughput than standard Transformers. Additionally, it utilizes continuous batching, meaning it does not wait for a whole batch to finish; instead, it ejects completed requests and inserts new ones token-by-token.

Regarding hardware, the RAM and VRAM requirements depend strictly on the model you want to run rather than the engine itself. A good rule of thumb is to double the model's parameter count (in billions) to estimate the required VRAM in gigabytes for FP16 precision. For example, Llama-3-8B requires approximately 16GB of VRAM. vLLM’s efficiency allows you to fit significantly more context, such as longer chat history, into the remaining space compared to other engines. While CPU execution is possible, it requires AVX512 support found in modern AMD or Intel CPUs and is significantly slower than GPU execution. NVIDIA GPUs are currently the primary hardware supported, with Compute Capability 7.0 or higher recommended, but AMD ROCm support is active and improving. The engine does generate high-level GPU-specific machine code, utilizing custom CUDA kernels often written in Triton or CUDA C++ to optimize specific operations like attention and activation functions for specific GPU architectures.

In terms of operating system support, vLLM works well on Arch Linux, as it is a Python library with compiled C++/CUDA extensions that can be installed via pip or the AUR. However, it does not support FreeBSD or Haiku. The primary blocker for these operating systems is the lack of NVIDIA CUDA driver support, which is virtually non-existent or extremely limited on those platforms. Without the necessary CUDA or ROCm drivers, vLLM cannot access the GPU hardware required for its core functionality.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Many 0 Day user questions - What is this vllm thing useful #17231

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Many 0 Day user questions - What is this vllm thing useful #17231

Uh oh!

ket395 Apr 26, 2025

Replies: 1 comment

Uh oh!

JunjieAraoXiong Nov 25, 2025

ket395
Apr 26, 2025

JunjieAraoXiong
Nov 25, 2025