---
name: "vLLM"
description: "Fleet skill: vLLM — software inventory and operations reference"
version: "1.0.0"
author: "skynet"
category: "fleet"
agents: ["claude-code", "codex", "gemini", "kimi"]
tags: ["software-vllm", "fleet", "fleet", "software"]
---

# vLLM

---
name: software-vllm
description: High-performance LLM inference server optimized for the GB10 Grace Blackwell architecture. Use for low-latency, high-throughput model serving via an OpenAI-compatible API.
metadata:
  author: skynet
  version: 1.0.0
---

# vLLM (Virtual Large Language Model)

vLLM is the primary high-throughput inference engine for James's fleet, specifically optimized for the GB10 Grace Blackwell GPU on **Spark**. It provides an OpenAI-compatible API and utilizes PagedAttention to manage KV cache memory efficiently.

## Fleet Installation Status

| Machine | Role | Status | Endpoint | Model |
|---------|------|--------|----------|-------|
| **Spark** (192.168.86.48) | Primary Inference | **Active** (Docker) | `http://spark:8001` | Qwen 32B (Default) |
| **Vault** (192.168.86.27) | API Routing | N/A | Proxies to Spark | via LiteLLM |
| **Dev Workstation** | Testing/Client | Client Only | N/A | Local development |

## Configuration & Environment

### Docker Deployment (Spark)
vLLM runs as a containerized service on Spark.
- **Image:** `vllm/vllm-openai:v0.12.0`
- **Container Name:** `vllm-server`
- **Compose Path:** `~/infra/docker-compose.yml` on Spark.
- **Volume Mounts:**
  - `~/models:/root/.cache/huggingface` (Model weights)
  - `/dev/shm:/dev/shm` (Shared memory for fast inference)

### Hardware Optimization (GB10 Grace Blackwell)
The GB10's unified memory architecture allows for massive context windows and zero-copy data transfer.
- **GPU Memory Utilization:** Set to `0.90` to leave headroom for the host system.
- **Enforce Eager Mode:** Use `--enforce-eager` if CUDA graphs cause stability issues on early Blackwell drivers.
- **Tensor Parallelism:** GB10 is a single powerful chip; `--tensor-parallel-size 1` is usually sufficient unless multi-GPU scaling is configured.

## Key Commands

### Managing the Service (on Spark)
```bash
# View logs and check for model loading progress
ssh spark "docker logs -f vllm-server"

# Restart the inference engine
ssh spark "docker restart vllm-server"

# Monitor GPU utilization (NVIDIA-SMI)
ssh spark "nvidia-smi -l 1"
```

### Direct API Interaction
Testing the endpoint from any machine in the fleet:
```bash
curl http://192.168.86.48:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-32b",
    "messages": [{"role": "user", "content": "System check."}]
  }'
```

## Common Workflows

### 1. Model Swap
To deploy a new model from HuggingFace:
1. SSH into Spark.
2. Update `HF_MODEL_ID` in `~/infra/.env`.
3. Run `docker compose up -d vllm-server`.
4. Monitor logs to ensure weights are downloaded to `~/models`.

### 2. Scaling with LiteLLM
The LiteLLM Gateway (Port 8000) handles failover between vLLM and Ollama.
- If vLLM is down, LiteLLM routes traffic to Ollama (running on Spark Port 11434).
- Use `http://spark:8000` for production workloads to ensure high availability.

### 3. Long Context Processing
Given the GB10's memory, vLLM is configured for 32k+ context lengths.
- Use the `--max-model-len` flag in the startup command to override default HuggingFace config limits if the hardware supports it.

## Troubleshooting

### Container Fails to Start
- **Symptom:** "CUDA out of memory"
- **Fix:** Check if Ollama or other Docker containers are hogging GPU memory. Run `nvidia-smi` to identify zombie processes. Reduce `--gpu-memory-utilization` to `0.85`.

### Slow Initial Inference (First Token Latency)
- **Cause:** Weights are being swapped or KV cache is being allocated.
- **Fix:** Ensure `--swap-space` is set (default 4GB) to handle KV cache overflow to CPU RAM if necessary.

### API Connection Refused
- **Check:** Ensure the container is bound to `0.0.0.0` and not `127.0.0.1`.
- **Command:** `ssh spark "netstat -tuln | grep 8001"`

## Fleet-Specific Patterns

- **Automation:** Always use the LiteLLM proxy (`spark:8000`) in Python scripts rather than hitting vLLM directly. This allows James to perform maintenance on vLLM without breaking active agents.
- **Model Storage:** All models are stored in `~/models` on Spark's NVMe drive. Never download models to individual workstations.
- **Inference Priority:** vLLM is preferred for heavy reasoning tasks (Qwen 32B/Llama 70B). Ollama is reserved for fast, utility-grade models (Llama 8B/Mistral).