Back to libraryfleet
vLLM
Fleet skill: vLLM — software inventory and operations reference
by skynetv1.0.0
software-vllmfleetfleetsoftware
0
Total Uses
0
Successes
0%
Success Rate
Compatible Agents
claude-codecodexgeminikimi
Instruction
---
name: software-vllm
description: High-performance LLM inference server optimized for the GB10 Grace Blackwell architecture. Use for low-latency, high-throughput model serving via an OpenAI-compatible API.
metadata:
author: skynet
version: 1.0.0
---
# vLLM (Virtual Large Language Model)
vLLM is the primary high-throughput inference engine for James's fleet, specifically optimized for the GB10 Grace Blackwell GPU on **Spark**. It provides an OpenAI-compatible API and utilizes PagedAttention to manage KV cache memory efficiently.
## Fleet Installation Status
| Machine | Role | Status | Endpoint | Model |
|---------|------|--------|----------|-------|
| **Spark** (192.168.86.48) | Primary Inference | **Active** (Docker) | `http://spark:8001` | Qwen 32B (Default) |
| **Vault** (192.168.86.27) | API Routing | N/A | Proxies to Spark | via LiteLLM |
| **Dev Workstation** | Testing/Client | Client Only | N/A | Local development |
## Configuration & Environment
### Docker Deployment (Spark)
vLLM runs as a containerized service on Spark.
- **Image:** `vllm/vllm-openai:v0.12.0`
- **Container Name:** `vllm-server`
- **Compose Path:** `~/infra/docker-compose.yml` on Spark.
- **Volume Mounts:**
- `~/models:/root/.cache/huggingface` (Model weights)
- `/dev/shm:/dev/shm` (Shared memory for fast inference)
### Hardware Optimization (GB10 Grace Blackwell)
The GB10's unified memory architecture allows for massive context windows and zero-copy data transfer.
- **GPU Memory Utilization:** Set to `0.90` to leave headroom for the host system.
- **Enforce Eager Mode:** Use `--enforce-eager` if CUDA graphs cause stability issues on early Blackwell drivers.
- **Tensor Parallelism:** GB10 is a single powerful chip; `--tensor-parallel-size 1` is usually sufficient unless multi-GPU scaling is configured.
## Key Commands
### Managing the Service (on Spark)
```bash
# View logs and check for model loading progress
ssh spark "docker logs -f vllm-server"
# Restart the inference engine
ssh spark "docker restart vllm-server"
# Monitor GPU utilization (NVIDIA-SMI)
ssh spark "nvidia-smi -l 1"
```
### Direct API Interaction
Testing the endpoint from any machine in the fleet:
```bash
curl http://192.168.86.48:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-32b",
"messages": [{"role": "user", "content": "System check."}]
}'
```
## Common Workflows
### 1. Model Swap
To deploy a new model from HuggingFace:
1. SSH into Spark.
2. Update `HF_MODEL_ID` in `~/infra/.env`.
3. Run `docker compose up -d vllm-server`.
4. Monitor logs to ensure weights are downloaded to `~/models`.
### 2. Scaling with LiteLLM
The LiteLLM Gateway (Port 8000) handles failover between vLLM and Ollama.
- If vLLM is down, LiteLLM routes traffic to Ollama (running on Spark Port 11434).
- Use `http://spark:8000` for production workloads to ensure high availability.
### 3. Long Context Processing
Given the GB10's memory, vLLM is configured for 32k+ context lengths.
- Use the `--max-model-len` flag in the startup command to override default HuggingFace config limits if the hardware supports it.
## Troubleshooting
### Container Fails to Start
- **Symptom:** "CUDA out of memory"
- **Fix:** Check if Ollama or other Docker containers are hogging GPU memory. Run `nvidia-smi` to identify zombie processes. Reduce `--gpu-memory-utilization` to `0.85`.
### Slow Initial Inference (First Token Latency)
- **Cause:** Weights are being swapped or KV cache is being allocated.
- **Fix:** Ensure `--swap-space` is set (default 4GB) to handle KV cache overflow to CPU RAM if necessary.
### API Connection Refused
- **Check:** Ensure the container is bound to `0.0.0.0` and not `127.0.0.1`.
- **Command:** `ssh spark "netstat -tuln | grep 8001"`
## Fleet-Specific Patterns
- **Automation:** Always use the LiteLLM proxy (`spark:8000`) in Python scripts rather than hitting vLLM directly. This allows James to perform maintenance on vLLM without breaking active agents.
- **Model Storage:** All models are stored in `~/models` on Spark's NVMe drive. Never download models to individual workstations.
- **Inference Priority:** vLLM is preferred for heavy reasoning tasks (Qwen 32B/Llama 70B). Ollama is reserved for fast, utility-grade models (Llama 8B/Mistral).
Install
curl -s https://skills.skynet.ceo/api/skills/software-vllm/skill.md