Private LLM Platform for Company-wide Coding Assistant

A self‑hosted, GPU‑served coding assistant for a team, with full access control, usage analytics, and observability — running entirely on our own hardware.

llama.cpp · CUDA LiteLLM gateway Langfuse analytics Prometheus · Grafana · Loki Docker Compose

Overview

We run Qwen3‑Coder‑30B‑A3B (4‑bit GGUF) on a single GPU via llama.cpp, exposed to ~20 developers through an OpenAI‑compatible API. Every request passes through a gateway that authenticates users, enforces rate limits and budgets, and records the full prompt/response for analytics. A separate observability stack tracks model throughput, queueing, GPU/host health, and aggregates all container logs. Nothing leaves our infrastructure.

Architecture

OBSERVABILITY ≈20 Developers OpenAI‑compatible clients / editors LiteLLM Gateway · :4000 auth · per‑user keys · limits budgets · spend tracking llama‑server · :8000 Qwen3‑Coder‑30B‑A3B (Q4) llama.cpp · CUDA GPU · 256k ctx Langfuse · :3001 stores every prompt+reply usage analytics · export DBeaver SSH tunnel from laptop langfuse‑db Postgres · 127.0.0.1:5434 litellm‑db Postgres · 127.0.0.1:5433 Admin (you) direct, localhost only Exporters node‑exporter · DCGM (GPU) llama.cpp /metrics Prometheus · :9090 scrape + alert rules Promtail all container logs (Docker API) Loki · :3100 log store · LogQL Grafana · :3000 dashboards · log search metrics + alerts request + sk‑key inference logs keys · spend :8000 direct metrics + logs
request / data flow admin & out‑of‑band access

Components

LayerToolRole
Model servingllama.cpp (CUDA)Serves Qwen3‑Coder‑30B‑A3B (Q4) over an OpenAI API on the GPU.
GatewayLiteLLMSingle entrypoint: per‑user keys, rate limits, budgets, spend tracking, admin UI.
AnalyticsLangfuseRecords every prompt & response; dashboards and export for usage / topic analysis.
StoragePostgreSQL ×2Keys & spend (LiteLLM) and conversation traces (Langfuse).
MetricsPrometheus + node‑exporter + DCGMThroughput, queueing, KV‑cache, GPU/VRAM/temp, host health; alert rules.
LogsLoki + PromtailCentralised, searchable logs from every container (LogQL in Grafana).
DashboardsGrafanaUnified metrics + logs + alerting.

Key engineering decisions