Private LLM Platform for Company-wide Coding Assistant
A self‑hosted, GPU‑served coding assistant for a team, with full access control, usage analytics, and observability — running entirely on our own hardware.
We run Qwen3‑Coder‑30B‑A3B (4‑bit GGUF) on a single GPU via llama.cpp, exposed to ~20 developers through an OpenAI‑compatible API. Every request passes through a gateway that authenticates users, enforces rate limits and budgets, and records the full prompt/response for analytics. A separate observability stack tracks model throughput, queueing, GPU/host health, and aggregates all container logs. Nothing leaves our infrastructure.
Architecture
request / data flow admin & out‑of‑band access
Components
Layer
Tool
Role
Model serving
llama.cpp (CUDA)
Serves Qwen3‑Coder‑30B‑A3B (Q4) over an OpenAI API on the GPU.
Centralised, searchable logs from every container (LogQL in Grafana).
Dashboards
Grafana
Unified metrics + logs + alerting.
Key engineering decisions
Gatekeeper pattern. Users never touch the model directly — all traffic flows through LiteLLM so every call is authenticated, throttled, and attributable to a person.
Full conversation capture. LiteLLM streams prompts and replies to Langfuse, giving searchable usage data for analytics and topic modelling — self‑hosted, no third party.
Defence in depth. The raw model port and databases are bound to localhost only; remote DB access is via SSH tunnel. The open surface is just the keyed gateway.
Operational visibility. Queue depth, KV‑cache pressure, and GPU saturation are first‑class signals — the metrics that actually predict a bad experience under concurrent load.
Reproducible. The entire platform is two docker‑compose stacks on a shared network; nothing is installed on the host.