Private LLM Platform for Company-wide Coding Assistant

A self‑hosted, GPU‑served coding assistant for a team, with full access control, usage analytics, and observability — running entirely on our own hardware.

llama.cpp · CUDA LiteLLM gateway Langfuse analytics Prometheus · Grafana · Loki Docker Compose

Overview

We run Qwen3‑Coder‑30B‑A3B (4‑bit GGUF) on a single GPU via llama.cpp, exposed to ~20 developers through an OpenAI‑compatible API. Every request passes through a gateway that authenticates users, enforces rate limits and budgets, and records the full prompt/response for analytics. A separate observability stack tracks model throughput, queueing, GPU/host health, and aggregates all container logs. Nothing leaves our infrastructure.

Architecture

request / data flow admin & out‑of‑band access

Components

Layer	Tool	Role
Model serving	llama.cpp (CUDA)	Serves Qwen3‑Coder‑30B‑A3B (Q4) over an OpenAI API on the GPU.
Gateway	LiteLLM	Single entrypoint: per‑user keys, rate limits, budgets, spend tracking, admin UI.
Analytics	Langfuse	Records every prompt & response; dashboards and export for usage / topic analysis.
Storage	PostgreSQL ×2	Keys & spend (LiteLLM) and conversation traces (Langfuse).
Metrics	Prometheus + node‑exporter + DCGM	Throughput, queueing, KV‑cache, GPU/VRAM/temp, host health; alert rules.
Logs	Loki + Promtail	Centralised, searchable logs from every container (LogQL in Grafana).
Dashboards	Grafana	Unified metrics + logs + alerting.

Key engineering decisions

Gatekeeper pattern. Users never touch the model directly — all traffic flows through LiteLLM so every call is authenticated, throttled, and attributable to a person.
Full conversation capture. LiteLLM streams prompts and replies to Langfuse, giving searchable usage data for analytics and topic modelling — self‑hosted, no third party.
Defence in depth. The raw model port and databases are bound to localhost only; remote DB access is via SSH tunnel. The open surface is just the keyed gateway.
Operational visibility. Queue depth, KV‑cache pressure, and GPU saturation are first‑class signals — the metrics that actually predict a bad experience under concurrent load.
Reproducible. The entire platform is two docker‑compose stacks on a shared network; nothing is installed on the host.