The Ultimate Local AI Stack: How to Run vLLM with Open WebUI

,

Ditch the Ollama bottleneck. Learn how to set up vLLM with Open WebUI for 24x faster local AI inference. Includes Docker networking fixes and optimization tips.

If you are running local LLMs in 2026, you likely started with Ollama. Itโ€™s the “Apple” of local AI: sleek, simple, and it just works. But eventually, you hit a wall. Maybe you tried to serve a model to three friends at once, or perhaps you noticed your 70B model chugging along at a painful 8 tokens per second.

This is the “Ollama Bottleneck.” Itโ€™s built for ease of use, not raw speed or concurrency.

If you are ready to graduate from “hobbyist” to “server-grade” performance, you need vLLM. It is the engine that powers the world’s fastest API providers, and yesโ€”you can run it at home.

In this guide, we will build the ultimate local AI stack: vLLM (the engine) + Open WebUI (the interface).



Why vLLM? The “Tetris” Effect

Why switch? In a word: Throughput. Benchmarks from late 2025 show vLLM achieving 24x higher throughput than standard transformers and consistently beating Ollama in concurrent request handling (793 TPS vs 41 TPS under load).

The Secret Sauce: PagedAttention

Ollama (and llama.cpp) often struggle with memory fragmentation. Imagine a library where every book must have 5 empty shelves reserved “just in case” the author writes a sequel. That is wasted VRAM.

vLLM uses PagedAttention. Think of it like Tetris or your OSโ€™s virtual memory. It breaks the modelโ€™s memory (KV Cache) into tiny, non-contiguous blocks that fill every available gap in your GPUโ€™s VRAM.

  • Result: You can fit larger batches of context into the same 24GB card.
  • Benefit: Multiple users can chat with the model simultaneously without it slowing to a crawl.

Prerequisites

Before we touch the terminal, ensure you have the hardware. vLLM is optimized for NVIDIA GPUs (CUDA).

  • OS: Linux (Ubuntu 22.04+) or Windows WSL2 (highly recommended over native Windows).
  • Hardware: NVIDIA GPU with 8GB+ VRAM (RTX 3060 or better).
  • Software: Python 3.10+, pip, and Docker (for Open WebUI).

Step 1: Install vLLM

Unlike Ollamaโ€™s one-click installer, vLLM is a Python library. We will install it in a dedicated environment to keep things clean.

# Create a virtual environment
python3 -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM (This pulls the latest CUDA kernels)
pip install vllm

Pro Tip: If you are on a multi-GPU setup (e.g., dual RTX 3090s), vLLM automatically detects and utilizes both cards via Tensor Parallelism. You don’t need complex config files.


Step 2: Launch the API Server

This is where the magic happens. We aren’t just running a model; we are launching an OpenAI-Compatible API Server. This means Open WebUI (or any app designed for GPT-4) will think your local machine is actually OpenAI’s servers.

Run this command to serve a model (e.g., Llama-3-8B):

vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 8192

Breakdown of Flags:

  • --host 0.0.0.0: Exposes the server to your local network (crucial for Docker connections).
  • --gpu-memory-utilization 0.95: Tells vLLM to use 95% of your VRAM. If you get Out-Of-Memory errors, lower this to 0.85.
  • --max-model-len: Limits the context window to save memory.

The first launch will take a few minutes as it downloads the model weights from Hugging Face.


Step 3: Connect Open WebUI

Now that vLLM is humming along on port 8000, let’s give it a beautiful face. We will use Open WebUI (formerly Ollama WebUI).

1. Run Open WebUI via Docker

If you don’t have it running yet:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

2. The “Hidden” Networking Fix

This is where 90% of users fail. Because Open WebUI runs inside a Docker container, it cannot see localhost on your host machine. You cannot use http://localhost:8000.

The Correct Configuration:

  1. Open Open WebUI in your browser (http://localhost:3000).
  2. Click on your Profile Icon > Settings > Connections.
  3. Turn OFF Ollama (optional, saves resources).
  4. Under OpenAI API, enter the following:
    • Base URL: http://host.docker.internal:8000/v1
    • API Key: EMPTY (vLLM accepts any string here, or literally the word “EMPTY”).
  5. Click the Refresh/Verify button.

If successful, you will see a green verification tick. Go to the “New Chat” dropdown, and you should see meta-llama/Meta-Llama-3-8B-Instruct available for selection.


Optimization & Troubleshooting

vLLM is greedy. It tries to reserve all VRAM for the KV cache.

  • Fix: Add --enforce-eager to the launch command if you are low on VRAM, or lower the utilization flag: --gpu-memory-utilization 0.7.

“The Context Window is too small”

If you see errors about “position ids,” your model might be trying to handle more context than your GPU can fit.

  • Fix: Explicitly cap the context with --max-model-len 4096. This guarantees stability over maximum context length.

Conclusion: When to use which?

  • Stick with Ollama if: You are on a MacBook (Apple Silicon support in vLLM is still maturing) or you just want to run a quick test on a laptop.
  • Switch to vLLM if: You have a dedicated NVIDIA GPU rig, you want to serve models to multiple users in your house/office, or you are building an RAG (Retrieval Augmented Generation) pipeline where speed is critical.

vLLM essentially turns your gaming PC into an enterprise inference server. Pair it with Open WebUI, and you have a setup that rivals ChatGPT Plusโ€”privacy included.


Comments

Leave a Reply

Twenty Twenty-Five

Designed with WordPress

Discover more from SatGeo

Subscribe now to keep reading and get access to the full archive.

Continue reading