14 min reading inDevOpsAI

Building a private LLM Cluster

A hands-on experiment building a self-managed at-home AI cluster with k3s, Ollama, and LiteLLM.

Building a private LLM Cluster

Building a private LLM cluster

This post covers my hands-on experiment of building a local, self-managed AI cluster at a reasonable price point for at-home AI/LLM usage.

Intro

I like using LLMs and exploring their capabilities; I'm not very fond of the direction the industry is taking.

But I don't think the development in the 'AI/LLM' space will be halted, although I do think there is a direction that is better for society overall.

That is the future where good small and performant AI models are available for everyone to host themselves on their own compute. Keeping all the data private and allowing for much better encapsulation and sandboxing of LLM workloads.

Considerations for building an AI cluster

First, we need hardware. As seen in my initial OS LLM tests, I have access to a basic Bosgame mini PC with the new Ryzen AI 395 chip. After those initial tests, and considering current RAM rarity and prices, I extended my setup with another Bosgame mini PC. This also allows me to test and use a more complex local multi-node Kubernetes setup.

Thus for this article we will use 2x AMD Strix Halo 128GB RAM mini PCs; we will allocate both to have 94GB VRAM in BIOS; and we will leave the remaining RAM for backend and Kubernetes resources and other services.

Since the setup gets more complex, I prefer not to rely on Docker Compose alone and instead choose an orchestration tool that I can also manage externally through an API. For me, that choice is Kubernetes; and for system configuration management I opted to use k3s, instead of my usual microk8s.

Now we have solutions for hardware ✅ and clustering software ✅. Both of the nodes currently run in a home-subnetwork with their own router ✅. But since I plan to set up a multi-region cluster in the future, I am already extending this setup to work across networks and across LAN/WLAN.

For that, we will use Tailscale in this setup; I'm also planning to build a more static setup that will work with WireGuard only in the future; but for now we choose convenience.

Still, there are more missing parts:

  1. multi-node LLM backend and orchestration
  2. load balancing and deployment

So for 1. I'll choose a multi-node Ollama Helm chart install; the initial aim is to have two nodes with the same models available so they can be load-balanced. For 2. I'll choose LiteLLM as it's fairly easy to configure, supports API authorization, and supports the OpenAI API schema; and has a bunch of handy load balancing and routing features.

Experiment Setup

For easy configuration and management of both nodes, I've connected both to KVMs. I can control, observe, and manage both nodes simply from my laptop; this is important as I do not have enough screens and keyboards, and I also don't want to manually set up all nodes.

Here is a picture of the wiring; I've also configured a simple hyprland setup with screen mirroring at my display resolution to allow me to easily view the screen on KVM and another monitor.

Real live KVM connection setup

System Set-Up

To configure and manage the cluster, I want to avoid extensive manual setup while still keeping the setup repeatable and orchestratable from outside. Thus I chose to install NixOS on both nodes; in my NixOS setup I can easily set up SSH access keys, my tailnet, and everything else in a central deterministic configuration.

Notably I ensure each system runs

  • NixOS, amdgpu, k3s ( see below )

And of course all required basic system tooling:

  • git, vim, k9s, kubectl, helm etc...

Also I've configured basic SSH keys for access to my nodes from inside my tailscale network. Similarly, I've created a setup that lets me pre-define hosts per system. This will be important later when we obtain DNS-01 certificates to resolve, for example, our LiteLLM dashboard through our tailnet.

Here you can see my two KVM control tabs open, one selected and showing current VRAM and CPU usage through btop and amdgpu.

KVM Control screenshot

K3s Cluster Set-Up

In networking, I chose to route cluster traffic over the tailnet to allow me to have mini PCs connected via LAN and WLAN, to be able to move them across location and use them across networks in the future.

I configured one cluster node to be the k3s agent and the other to be the server. As said I've opted to connect over the tailnet directly, the cluster configuration is managed directly through nix; this generally equates to roughly this k3s command:

TOKEN="$(openssl rand -hex 32)" sudo install -d -m 700 /etc/secrets/k3s echo "$TOKEN" | sudo tee /etc/secrets/k3s/token >/dev/null k3s server \ --cluster-init \ --token-file "/etc/secrets/k3s/token" \ --node-external-ip "node-a" \ --tls-san "node-a" \ --flannel-iface "tailscale0" k3s agent \ --server "https://node-a:6443" \ --token-file "/etc/secrets/k3s/token" \ --flannel-iface "tailscale0"

Wait for the services to become available and retrieve the kubeconfig.yaml from the main k8s node.

sudo cp /etc/rancher/k3s/k3s.yaml ./kubeconfig.yaml sudo chown "$USER":"$USER" ./kubeconfig.yaml

Setting up DNS-01 challenges and certificate issuing behind NAT

Note: this part will differ based on your DNS provider

To access the LiteLLM API in the browser conveniently, we want to have a real certificate!

kubectl create namespace cert-manager helm repo add jetstack https://charts.jetstack.io helm repo update helm upgrade --install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --set crds.enabled=true

Example DNS challenge set-up:

export KUBECONFIG=./kubeconfig.yaml kubectl create namespace cert-manager || true helm repo add cert-manager-webhook-ionos https://fabmade.github.io/cert-manager-webhook-ionos helm repo update helm upgrade --install cert-manager-webhook-ionos \ cert-manager-webhook-ionos/cert-manager-webhook-ionos \ --namespace cert-manager

Create API secret ionos-secret.yaml

kubectl apply -f - <<'EOF' apiVersion: v1 kind: Secret metadata: name: ionos-secret namespace: cert-manager type: Opaque stringData: IONOS_PUBLIC_PREFIX: "<YOUR DNS PROVIDER KEY PREFIX>" IONOS_SECRET: "<YOUR DNS PROVIDER KEY>" EOF

Create a production cluster issuer letsencrypt-ionos-prod-account-key:

kubectl apply -f - <<'EOF' apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-ionos-prod spec: acme: email: "<your-email>" server: https://acme-v02.api.letsencrypt.org/directory privateKeySecretRef: name: letsencrypt-ionos-prod-account-key solvers: - dns01: webhook: groupName: acme.fabmade.de solverName: ionos config: apiUrl: https://api.hosting.ionos.com/dns/v1 publicKeySecretRef: name: ionos-secret key: IONOS_PUBLIC_PREFIX secretKeySecretRef: name: ionos-secret key: IONOS_SECRET EOF

Ollama Multi-Node Installation

First we need to teach our cluster to work with our integrated graphics, then we can install ollama with vulkan enabled and run llms.

Set up k3s AMD ROCm resources

Now we have two devices with AMD integrated GPUs and Kubernetes installed. But we can see that the Kubernetes cluster doesn't yet have AMD GPU resources listed on nodes:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.amd\.com/gpu}{"\n"}{end}'

If GPU plugin is not installed yet, the AMD column is usually empty or <none>.

helm repo add rocm https://rocm.github.io/k8s-device-plugin/ helm repo update

Now we can install the ROCm GPU plugin:

helm upgrade --install amd-gpu rocm/amd-gpu \ --namespace kube-system

Now we can see the GPUs listed:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.amd\.com/gpu}{"\n"}{end}'

Set up Ollama Helm chart

I opted to use this Helm chart. Add the Helm chart:

helm repo add otwld https://helm.otwld.com/ helm repo update helm search repo otwld/ollama --versions

We create some basic values that allow us to replicate the setup 1:1 on our other node:

fullnameOverride: ollama image: repository: ollama/ollama tag: "0.19.0" ollama: gpu: enabled: true type: amd number: 1 persistentVolume: enabled: true size: 120Gi service: type: NodePort port: 11434 nodePort: 31434 ingress: enabled: true className: traefik annotations: {} hosts: - host: ollama.home.arpa paths: - path: / pathType: Prefix tls: [] nodeSelector: kubernetes.io/hostname: node-b extraEnv: <...ENV set-up see benchmark below...> resources: requests: cpu: "2000m" memory: 16Gi limits: cpu: "6000m" memory: 112Gi tests: enabled: true

Note that we apply these values twice with a minimal modification:

nodeSelector: kubernetes.io/hostname: node-a

vs

nodeSelector: kubernetes.io/hostname: node-b

This ensures we have identical Ollama instances running on both hosts, but this also means that we always need to make sure both Ollama instance node states are synced.

LiteLLM load balancing and routing setup

I chose to use this Helm chart as a base setup for LiteLLM.

We want our traffic to be routed based on system usage, i.e. when one mini PC is under NPU load it should receive less traffic. Optimally we also want the API to block traffic if the system is under so much load that the experience would be too bad (more on that later).

LiteLLM Dashboard Screenshot

Setup basic chart values:

replicaCount: 1 image: repository: ghcr.io/berriai/litellm-database pullPolicy: IfNotPresent tag: main-v1.82.3-stable service: type: ClusterIP port: 4000 ingress: enabled: true className: traefik annotations: {} hosts: - host: litellm.home.arpa paths: - path: / pathType: Prefix tls: [] db: useExisting: false deployStandalone: true migrationJob: enabled: true resources: requests: cpu: "500m" memory: 1Gi limits: cpu: "2000m" memory: 4Gi

Now, since we have set up initial cluster DNS-01 challenges, we can request an official Let's Encrypt certificate via:

ingress: enabled: true className: traefik annotations: cert-manager.io/cluster-issuer: letsencrypt-ionos-prod hosts: - host: <your-domain> paths: - path: / pathType: Prefix

This will make it much easier to access our LiteLLM UI through the browser as it won't complain about SSL and self-signed certificates. But note you still need to register <your-domain> in your own /etc/hosts as we did NOT create a public DNS entry. (Shout-out to JannisT for this setup suggestion.)

Then start by adding your models. Depending on node setup, you need to register model endpoints; for me it's one per cluster node, so two. We choose to throttle requests depending on model size, this configuration will be further adjusted and tested in future blog posts.

proxy_config: model_list: - model_name: gemma4:e4b litellm_params: model: openai/gemma4:e4b api_base: http://ollama.ollama.svc.cluster.local:11434/v1 api_key: "none" rpm: 60 - model_name: gemma4:e4b litellm_params: model: openai/gemma4:e4b api_base: http://ollama-bosgame.ollama.svc.cluster.local:11434/v1 api_key: "none" rpm: 60

Now finally we also want to configure the router setup. I opted to generate a general PROXY_MASTER_KEY instead of individual keys per endpoint, just for benchmarking for now.

router_settings: routing_strategy: usage-based-routing num_retries: 2 timeout: 600 litellm_settings: drop_params: true request_timeout: 600 general_settings: master_key: os.environ/PROXY_MASTER_KEY

Configure LiteLLM to also route external providers

I found it very convenient to also use LiteLLM to route paid external providers, as this gives me a central location to manage access and tokens. As I've also run my benchmark on some public providers for comparison, this was also set up, e.g. one could add this to integrate OpenAI gpt-5.5 or gpt-5.4-mini:

model_list: - model_name: gpt-5.5 litellm_params: model: openai/gpt-5.5 api_base: https://api.openai.com/v1 api_key: os.environ/OPENAI_API_KEY rpm: 120 - model_name: gpt-5.4-mini litellm_params: model: openai/gpt-5.4-mini api_base: https://api.openai.com/v1 api_key: os.environ/OPENAI_API_KEY rpm: 180

This can also be used very conveniently to manage LLM access, e.g. in a company. And finally we can install the chart:

helm --kubeconfig "kubeconfig.yaml" upgrade --install litellm-ollama \ oci://docker.litellm.ai/berriai/litellm-helm \ --version 1.82.3 \ --namespace ollama \ --create-namespace \ --values "litellm-values-ollama-lb.yaml"

Now we can retrieve the master key:

kubectl --kubeconfig "kubeconfig.yaml" -n ollama get secret litellm-ollama-masterkey -o jsonpath='{.data.masterkey}' | base64 -d

Setup complete. Now we can see all our clusters and nodes running, and we're ready for benchmarking.

K9s Cluster Online view

Benchmark

Compared to the last benchmark, I wanted to extend the test range. Specifically, I'm interested in finding out:

  • TTFT (time to first token); and
  • TTFT dependency on context length

To reset the tests, I always ensure that all models are killed and fully unloaded, and I restart the Ollama containers every time. ./scripts/kill_all_running_completions.sh --force-restart wraps a Kubernetes exec command into a simple script that can be hooked into my updated Ollama benchmark. I also refactored the benchmark to work in 'API only' mode so I can compare my results with known existing LLM hosters (for example OpenAI).

I also observed that it is very important to preload the Ollama model, because TTFT is significantly reduced when the model is already loaded into memory. After observing this, I had to re-run all tests because the initial TTFT values would significantly drag down the overall average.

LITELLM_KEY='<YOUR LITELLM API KEY>' \ LITELLM_UNLOAD_COMMAND='./scripts/kill_all_running_completions.sh --force-restart' \ CONTEXT_STEPS=8 \ API_PRIME_REQUESTS=2 \ ./scripts/run-full-test.py model-configs-cluster/<llm-benchmark-config>.yaml

Benchmark Results 1: Model Comparison

In the first version of this article, I tested the following models:

  • deepseek-r1:1.5b
  • deepseek-r1:14b
  • deepseek-r1:32b
  • deepseek-r1:7b
  • deepseek-r1:8b
  • gemma4-31b
  • gemma4:26b
  • gemma4:e4b
  • nemotron-cascade-2:30b
  • qwen3.6:27b
  • qwen3.6:35b
  • extra: gpt-5.4-mini
  • extra: gpt-5.4-nano
  • extra: gpt-5.5

Mainly due to already having spent extended time setting up the cluster and benchmarking, and also being rate-limited by Ollama for model downloads after some time...

Actually I re-ran the benchmark with updated settings, see further below for improved results

1) Basic benchmark comparison

ModelAvg tokens/sAvg TTFT (ms)PromptsMax tokensTempStreamCtx minCtx max
deepseek-r1:1.5b150.76293.0351200.2True2568192
deepseek-r1:14b24.34510.0951200.2True2568192
deepseek-r1:32b11.21722.5951200.2True2568192
deepseek-r1:7b43.63391.0451200.2True2568192
deepseek-r1:8b38.84378.6451200.2True2568192
gemma4-31b10.81882.3751200.2True2568192
gemma4:26b34.91463.7551200.2True2568192
gemma4:e4b22.23881.1251200.2True2568192
nemotron-cascade-2:30b63.54477.5751200.2True2568192
qwen3.6:27b10.90556.3151200.2True2568192
qwen3.6:35b43.82393.5351200.2True2568192
extra: gpt-5.4-mini146.22820.3751200.2True2568192
extra: gpt-5.4-nano154.10932.6751200.2True2568192
extra: gpt-5.5240.362532.0351201True2568192

2) Context dependency matrix (cell = TTFT ms / TPS)

Model25642068911311855304349938192
deepseek-r1:1.5b428.7 / 161.96799.1 / 136.34467.6 / 140.91745.6 / 118.201101.9 / 136.642057.6 / 115.743734.5 / 2323.338365.7 / 109.83
deepseek-r1:14b1850.9 / 24.042675.2 / 23.122834.3 / 23.664459.5 / 21.698047.5 / 21.6414473.9 / 20.2028394.6 / 18.8062332.4 / 0.00
deepseek-r1:32b4118.3 / 11.105963.4 / 10.906538.6 / 11.3610339.6 / 10.9417599.6 / 10.4930579.9 / 10.2256279.6 / 9.59114042.7 / 17.24
deepseek-r1:7b1238.0 / 41.371520.1 / 42.541758.0 / 41.962279.7 / 43.083831.8 / 39.186966.6 / 38.4113241.8 / 35.0127059.7 / 34.08
deepseek-r1:8b1249.7 / 38.961759.3 / 36.361581.6 / 36.832639.5 / 34.084399.0 / 33.858336.1 / 31.3816415.3 / 28.7136080.3 / 24.59
gemma4-31b4276.4 / 10.046296.4 / 9.886747.6 / 9.7511411.5 / 9.8619858.7 / 9.6540756.8 / 9.73136644.3 / 9.08412691.0 / 9.24
gemma4:26b1539.1 / 38.162353.9 / 29.871706.6 / 35.643227.9 / 30.304622.3 / 35.609168.0 / 29.7316423.8 / 34.0836809.2 / 27.72
gemma4:e4b7438.0 / 23.4418421.3 / 19.2412118.5 / 22.7731385.4 / 18.7033828.8 / 20.9891792.6 / 16.06119861.2 / 16.99359684.5 / 11.55
nemotron-cascade-2:30b1097.9 / 65.781696.6 / 60.591098.9 / 60.521977.4 / 60.043135.1 / 61.995473.3 / 54.607581.8 / 51.7813900.1 / 50.41
qwen3.6:27b3516.2 / 10.905487.2 / 10.625691.8 / 10.7411068.7 / 10.5420218.5 / 10.6436026.7 / 10.4951709.2 / 10.3094600.6 / 10.06
qwen3.6:35b1555.5 / 39.171759.9 / 45.461760.5 / 41.223028.3 / 45.465332.8 / 39.649476.1 / 42.2013865.3 / 38.5425481.7 / 40.66
extra: gpt-5.4-mini2421.2 / 182.52911.4 / 112.34879.7 / 166.111037.7 / 177.67864.7 / 181.731099.4 / 181.74972.6 / 152.311101.8 / 136.78
extra: gpt-5.4-nano756.8 / 200.59916.3 / 193.01969.8 / 202.75999.7 / 257.31660.0 / 179.181309.9 / 181.85879.4 / 166.731114.1 / 204.15
extra: gpt-5.5675.8 / 56.381736.0 / 68.901401.5 / 27.762199.7 / 129.051917.5 / 131.382326.1 / 140.263584.2 / 61.832417.5 / 106.07

Legend: each context cell is TTFT(ms) / tokens_per_second.

Blog Addition Benchmark Re-Runs

Re-run bigger default context and flash attention:
  • set OLLAMA_FLASH_ATTENTION=1
  • set OLLAMA_KV_CACHE_TYPE=q8_0
  • set OLLAMA_CONTEXT_LENGTH=32768

1) Basic benchmark comparison

ModelAvg tokens/sAvg TTFT (ms)PromptsMax tokensTempStreamCtx minCtx max
deepseek-r1:1.5b141.31190.9051200.2True2568192
deepseek-r1:14b24.14396.7551200.2True2568192
deepseek-r1:32b11.16656.9851200.2True2568192
deepseek-r1:7b42.98294.6351200.2True2568192
deepseek-r1:8b38.83337.2251200.2True2568192

2) Context dependency matrix (cell = TTFT ms / TPS)

Model25642068911311855304349938192
deepseek-r1:1.5b419.5 / 177.26642.4 / 135.83437.8 / 160.39792.7 / 128.231042.8 / 136.442004.9 / 111.783980.5 / 2640.749439.9 / 91.71
deepseek-r1:14b1796.4 / 23.792631.6 / 22.262683.1 / 23.864479.4 / 21.288143.7 / 21.4415357.6 / 20.1430560.9 / 18.1967705.2 / 0.00
deepseek-r1:32b3910.9 / 11.046071.4 / 10.886352.6 / 11.1610480.8 / 10.6617850.9 / 10.5631701.4 / 10.1258965.5 / 9.55120763.5 / 17.22
deepseek-r1:7b1099.0 / 40.931469.8 / 45.491539.4 / 41.952250.2 / 43.153988.8 / 38.907254.0 / 38.2014020.1 / 36.0728857.7 / 32.66
deepseek-r1:8b1319.7 / 38.041512.0 / 38.401713.1 / 34.542639.4 / 37.764624.9 / 33.118594.2 / 31.3617776.0 / 27.5437400.9 / 24.62

Re-Run different KV Cache

  • OLLAMA_FLASH_ATTENTION=1
  • OLLAMA_KV_CACHE_TYPE=f16
  • OLLAMA_CONTEXT_LENGTH=32768
  • OLLAMA_KEEP_ALIVE=1h
  • OLLAMA_NUM_PARALLEL=1
  • OLLAMA_MAX_LOADED_MODELS=1
1) Basic benchmark comparison
ModelAvg tokens/sAvg TTFT (ms)PromptsMax tokensTempStreamCtx minCtx max
deepseek-r1:1.5b141.31190.9051200.2True2568192
deepseek-r1:14b24.14396.7551200.2True2568192
deepseek-r1:32b11.16656.9851200.2True2568192
deepseek-r1:7b42.98294.6351200.2True2568192
deepseek-r1:8b38.83337.2251200.2True2568192
gemma4-31b10.71881.2051200.2True2568192
gemma4:26b35.20528.3651200.2True2568192
nemotron-cascade-2:30b59.47353.1451200.2True2568192
qwen3.6:27b10.91565.5751200.2True2568192
qwen3.6:35b43.08370.1651200.2True2568192
2) Context dependency matrix (cell = TTFT ms / TPS)
Model25642068911311855304349938192
deepseek-r1:1.5b419.5 / 177.26642.4 / 135.83437.8 / 160.39792.7 / 128.231042.8 / 136.442004.9 / 111.783980.5 / 2640.749439.9 / 91.71
deepseek-r1:14b1796.4 / 23.792631.6 / 22.262683.1 / 23.864479.4 / 21.288143.7 / 21.4415357.6 / 20.1430560.9 / 18.1967705.2 / 0.00
deepseek-r1:32b3910.9 / 11.046071.4 / 10.886352.6 / 11.1610480.8 / 10.6617850.9 / 10.5631701.4 / 10.1258965.5 / 9.55120763.5 / 17.22
deepseek-r1:7b1099.0 / 40.931469.8 / 45.491539.4 / 41.952250.2 / 43.153988.8 / 38.907254.0 / 38.2014020.1 / 36.0728857.7 / 32.66
deepseek-r1:8b1319.7 / 38.041512.0 / 38.401713.1 / 34.542639.4 / 37.764624.9 / 33.118594.2 / 31.3617776.0 / 27.5437400.9 / 24.62
gemma4-31b4397.6 / 10.066220.6 / 9.826652.5 / 9.7611416.4 / 9.9020962.1 / 10.3440675.2 / 9.70135690.2 / 9.05411124.6 / 9.25
gemma4:26b1784.3 / 30.011912.4 / 36.352002.9 / 28.852857.8 / 36.305111.5 / 29.178601.7 / 34.2416929.3 / 28.9235758.9 / 32.77
nemotron-cascade-2:30b978.2 / 63.841539.3 / 60.441097.1 / 67.421798.6 / 56.092928.9 / 62.825282.0 / 54.567328.9 / 51.0113637.7 / 52.39
qwen3.6:27b3537.2 / 10.855509.0 / 10.565746.9 / 10.6811172.7 / 10.5120261.7 / 10.6836068.3 / 10.2951735.6 / 10.3294678.5 / 9.98
qwen3.6:35b1343.6 / 44.832194.7 / 39.741585.0 / 44.903080.7 / 39.625332.7 / 42.159675.2 / 39.0913807.2 / 41.0126637.5 / 37.21

After using some of the OSS LLMs for my workflows, I realized: Maybe my context length test is wrong, because in a normal chat session you continue with the exact previous context. In normal usage I experienced TTFT much lower than what I was seeing in my benchmarking. So I rewrote the context size test to actually emulate a "continuing" session. But then I also realized that cache reuse is only possible if the same request gets routed to the same node. So even after these changes, the initial settings performed best; I'll test further variations in future experiments.

Re-Run Re-using old context on long context re-runs

1) Basic benchmark comparison
ModelAvg tokens/sAvg TTFT (ms)PromptsMax tokensTempStreamCtx minCtx maxConnect timeout (s)Read timeout (s)
deepseek-r1:1.5b162.31218.5051200.2True2568192
deepseek-r1:14b24.22403.9751200.2True2568192
deepseek-r1:32b11.15630.4851200.2True2568192
deepseek-r1:7b43.78253.6951200.2True2568192
deepseek-r1:8b38.75267.0451200.2True2568192
gemma4-31b10.85898.8551200.2True2568192
gemma4:26b37.45502.1951200.2True2568192
gemma4:e4b23.21833.1051200.2True2568192
nemotron-cascade-2:30b62.52482.4751200.2True2568192
qwen3.6:27b10.91608.1451200.2True2568192
qwen3.6:35b42.69517.8851200.2True2568192

2) Context dependency matrix (cell = TTFT ms / TPS)

Model25642068911311855304349938192
deepseek-r1:1.5b386.4 / 170.37633.8 / 128.73467.1 / 154.89797.2 / 150.021047.0 / 152.962340.1 / 146.263648.6 / 208.688199.2 / 95.43
deepseek-r1:14b1925.3 / 23.732855.8 / 22.722860.2 / 22.714408.8 / 21.758118.4 / 21.0014545.2 / 20.2328582.2 / 19.3162470.2 / 16.26
deepseek-r1:32b3958.9 / 11.096335.6 / 10.916457.5 / 10.8010605.9 / 11.0517941.0 / 10.4930800.6 / 0.0056235.0 / 9.53114121.6 / 8.75
deepseek-r1:7b961.3 / 44.911520.0 / 41.481422.5 / 43.702272.6 / 40.853806.8 / 38.686812.1 / 37.3712829.1 / 35.9827943.1 / 33.45
deepseek-r1:8b1184.7 / 36.781582.0 / 37.401756.1 / 36.312634.8 / 37.904535.0 / 33.228299.6 / 30.5516362.9 / 28.2934171.7 / 24.80
gemma4-31b4578.4 / 10.106658.9 / 9.767037.5 / 9.8811862.8 / 9.6120464.6 / 9.7441579.3 / 9.55124721.5 / 9.00561515.7 / 9.11
gemma4:26b1495.1 / 36.572438.6 / 29.141826.3 / 35.803453.3 / 28.674894.9 / 34.619459.4 / 28.7316942.0 / 33.8438030.7 / 28.35
gemma4:e4b7550.6 / 22.9719217.4 / 15.4212559.8 / 22.2632229.7 / 15.6734552.3 / 20.3791452.3 / 15.03120309.8 / 16.09370979.0 / 11.58
nemotron-cascade-2:30b1043.1 / 63.701757.6 / 61.821249.8 / 63.682713.2 / 65.173504.5 / 61.798272.6 / 37.017918.5 / 52.5015585.7 / 36.38
qwen3.6:27b3816.0 / 10.845900.3 / 10.546129.2 / 10.6811886.5 / 10.1920982.5 / 10.5437356.6 / 10.2953244.9 / 10.3198825.9 / 9.58
qwen3.6:35b1856.4 / 38.021907.7 / 45.112123.4 / 37.483325.6 / 45.106113.1 / 36.9610343.4 / 41.4315384.7 / 34.2726747.8 / 38.07

Legend: each context cell is TTFT(ms) / tokens_per_second.

Cross-setting TTFT context dependency (selected model families)

This table compares TTFT (ms) only across context sizes for gemma4-*, nemotron-cascade-*, and qwen3-* over the different benchmark settings.

SettingParameters / benchmark behaviorModel25642068911311855304349938192
A) Initial benchmark runPre re-run baseline from first matrix (no explicit env overrides listed in that section)(divider)--------
A) Initial benchmark runbaselinegemma4-31b4276.46296.46747.611411.519858.740756.8136644.3412691.0
A) Initial benchmark runbaselinegemma4:26b1539.12353.91706.63227.94622.39168.016423.836809.2
A) Initial benchmark runbaselinegemma4:e4b7438.018421.312118.531385.433828.891792.6119861.2359684.5
A) Initial benchmark runbaselinenemotron-cascade-2:30b1097.91696.61098.91977.43135.15473.37581.813900.1
A) Initial benchmark runbaselineqwen3.6:27b3516.25487.25691.811068.720218.536026.751709.294600.6
A) Initial benchmark runbaselineqwen3.6:35b1555.51759.91760.53028.35332.89476.113865.325481.7
B) Re-run with KV cache = f16OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=f16, OLLAMA_CONTEXT_LENGTH=32768, OLLAMA_KEEP_ALIVE=1h, OLLAMA_NUM_PARALLEL=1, OLLAMA_MAX_LOADED_MODELS=1(divider)--------
B) Re-run with KV cache = f16flash attention + f16 KV cachegemma4-31b4397.66220.66652.511416.420962.140675.2135690.2411124.6
B) Re-run with KV cache = f16flash attention + f16 KV cachegemma4:26b1784.31912.42002.92857.85111.58601.716929.335758.9
B) Re-run with KV cache = f16flash attention + f16 KV cachegemma4:e4bn/an/an/an/an/an/an/an/a
B) Re-run with KV cache = f16flash attention + f16 KV cachenemotron-cascade-2:30b978.21539.31097.11798.62928.95282.07328.913637.7
B) Re-run with KV cache = f16flash attention + f16 KV cacheqwen3.6:27b3537.25509.05746.911172.720261.736068.351735.694678.5
B) Re-run with KV cache = f16flash attention + f16 KV cacheqwen3.6:35b1343.62194.71585.03080.75332.79675.213807.226637.5
C) Re-run with context re-use in benchmark flowBenchmark changed to emulate continuing sessions (context re-use); same model family scope(divider)--------
C) Re-run with context re-use in benchmark flowcontinuing-session style context re-usegemma4-31b4578.46658.97037.511862.820464.641579.3124721.5561515.7
C) Re-run with context re-use in benchmark flowcontinuing-session style context re-usegemma4:26b1495.12438.61826.33453.34894.99459.416942.038030.7
C) Re-run with context re-use in benchmark flowcontinuing-session style context re-usegemma4:e4b7550.619217.412559.832229.734552.391452.3120309.8370979.0
C) Re-run with context re-use in benchmark flowcontinuing-session style context re-usenemotron-cascade-2:30b1043.11757.61249.82713.23504.58272.67918.515585.7
C) Re-run with context re-use in benchmark flowcontinuing-session style context re-useqwen3.6:27b3816.05900.36129.211886.520982.537356.653244.998825.9
C) Re-run with context re-use in benchmark flowcontinuing-session style context re-useqwen3.6:35b1856.41907.72123.43325.66113.110343.415384.726747.8

So after that change, TTFT still stayed roughly the same (but now we know we need some sort of cache-hit / node-reuse strategy in the future).

That must be it for now, the article is way too long already. I'm starting to actively use my own hosted models for different things and will report back further in the future.

Cheers Tim


Keep Reading