Ditch the Ollama bottleneck. Learn how to set up vLLM with Open WebUI for 24x faster local AI inference. Includes Docker networking fixes and optimization tips.
If you are running local LLMs in 2026, you likely started with Ollama. Itโs the “Apple” of local AI: sleek, simple, and it just works. But eventually, you hit a wall. Maybe you tried to serve a model to three friends at once, or perhaps you noticed your 70B model chugging along at a painful 8 tokens per second.
This is the “Ollama Bottleneck.” Itโs built for ease of use, not raw speed or concurrency.
If you are ready to graduate from “hobbyist” to “server-grade” performance, you need vLLM. It is the engine that powers the world’s fastest API providers, and yesโyou can run it at home.
In this guide, we will build the ultimate local AI stack: vLLM (the engine) + Open WebUI (the interface).
Table of Contents
Why vLLM? The “Tetris” Effect
Why switch? In a word: Throughput. Benchmarks from late 2025 show vLLM achieving 24x higher throughput than standard transformers and consistently beating Ollama in concurrent request handling (793 TPS vs 41 TPS under load).
The Secret Sauce: PagedAttention
Ollama (and llama.cpp) often struggle with memory fragmentation. Imagine a library where every book must have 5 empty shelves reserved “just in case” the author writes a sequel. That is wasted VRAM.
vLLM uses PagedAttention. Think of it like Tetris or your OSโs virtual memory. It breaks the modelโs memory (KV Cache) into tiny, non-contiguous blocks that fill every available gap in your GPUโs VRAM.
- Result: You can fit larger batches of context into the same 24GB card.
- Benefit: Multiple users can chat with the model simultaneously without it slowing to a crawl.
Prerequisites
Before we touch the terminal, ensure you have the hardware. vLLM is optimized for NVIDIA GPUs (CUDA).
- OS: Linux (Ubuntu 22.04+) or Windows WSL2 (highly recommended over native Windows).
- Hardware: NVIDIA GPU with 8GB+ VRAM (RTX 3060 or better).
- Software: Python 3.10+, pip, and Docker (for Open WebUI).
Step 1: Install vLLM
Unlike Ollamaโs one-click installer, vLLM is a Python library. We will install it in a dedicated environment to keep things clean.
# Create a virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM (This pulls the latest CUDA kernels)
pip install vllm
Pro Tip: If you are on a multi-GPU setup (e.g., dual RTX 3090s), vLLM automatically detects and utilizes both cards via Tensor Parallelism. You don’t need complex config files.
Step 2: Launch the API Server
This is where the magic happens. We aren’t just running a model; we are launching an OpenAI-Compatible API Server. This means Open WebUI (or any app designed for GPT-4) will think your local machine is actually OpenAI’s servers.
Run this command to serve a model (e.g., Llama-3-8B):
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.95 \
--max-model-len 8192
Breakdown of Flags:
--host 0.0.0.0: Exposes the server to your local network (crucial for Docker connections).--gpu-memory-utilization 0.95: Tells vLLM to use 95% of your VRAM. If you get Out-Of-Memory errors, lower this to0.85.--max-model-len: Limits the context window to save memory.
The first launch will take a few minutes as it downloads the model weights from Hugging Face.
Step 3: Connect Open WebUI
Now that vLLM is humming along on port 8000, let’s give it a beautiful face. We will use Open WebUI (formerly Ollama WebUI).
1. Run Open WebUI via Docker
If you don’t have it running yet:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
2. The “Hidden” Networking Fix
This is where 90% of users fail. Because Open WebUI runs inside a Docker container, it cannot see localhost on your host machine. You cannot use http://localhost:8000.
The Correct Configuration:
- Open Open WebUI in your browser (
http://localhost:3000). - Click on your Profile Icon > Settings > Connections.
- Turn OFF Ollama (optional, saves resources).
- Under OpenAI API, enter the following:
- Base URL:
http://host.docker.internal:8000/v1 - API Key:
EMPTY(vLLM accepts any string here, or literally the word “EMPTY”).
- Base URL:
- Click the Refresh/Verify button.
If successful, you will see a green verification tick. Go to the “New Chat” dropdown, and you should see meta-llama/Meta-Llama-3-8B-Instruct available for selection.
Optimization & Troubleshooting
vLLM is greedy. It tries to reserve all VRAM for the KV cache.
- Fix: Add
--enforce-eagerto the launch command if you are low on VRAM, or lower the utilization flag:--gpu-memory-utilization 0.7.
“The Context Window is too small”
If you see errors about “position ids,” your model might be trying to handle more context than your GPU can fit.
- Fix: Explicitly cap the context with
--max-model-len 4096. This guarantees stability over maximum context length.
Conclusion: When to use which?
- Stick with Ollama if: You are on a MacBook (Apple Silicon support in vLLM is still maturing) or you just want to run a quick test on a laptop.
- Switch to vLLM if: You have a dedicated NVIDIA GPU rig, you want to serve models to multiple users in your house/office, or you are building an RAG (Retrieval Augmented Generation) pipeline where speed is critical.
vLLM essentially turns your gaming PC into an enterprise inference server. Pair it with Open WebUI, and you have a setup that rivals ChatGPT Plusโprivacy included.
Leave a Reply