server-config

Archived

Author	SHA1	Message	Date
Simon Gardling	12469de580	llama.cpp: things	2026-04-11 10:27:38 -04:00
Simon Gardling	dad3867144	grafana: fix llama-cpp annotation query format for Grafana 12 All checks were successful Build and Deploy / deploy (push) Successful in 2m42s Grafana 12 expects Prometheus annotation queries wrapped in a 'target' object with datasource, expr, refId, and range fields. The previous format had expr/step as top-level fields which Grafana silently ignored.	2026-04-09 22:19:21 -04:00
Simon Gardling	7ee55eca6b	typo: systemd.service -> systemd.services Some checks failed Build and Deploy / deploy (push) Failing after 15m58s	2026-04-09 20:48:06 -04:00
Simon Gardling	100999734b	ddns-updater: disable DynamicUser to fix secret perms Some checks failed Build and Deploy / deploy (push) Failing after 10s	2026-04-09 20:47:04 -04:00
Simon Gardling	ce1c335230	caddy: wildcard TLS via DNS-01 challenge + ddns-updater for Njalla Some checks failed Build and Deploy / deploy (push) Failing after 31m3s Build Caddy with the caddy-dns/njalla plugin to enable DNS-01 ACME challenges. This issues a single wildcard certificate for *.sigkill.computer instead of per-subdomain certificates, reducing Let's Encrypt API calls and certificate management overhead. Add ddns-updater service (nixpkgs services.ddns-updater) configured with Njalla provider to automatically update DNS records when the server's public IP changes.	2026-04-09 19:54:57 -04:00
Simon Gardling	e9ce1ce0a2	grafana: replace llama-cpp-annotations daemon with prometheus query	2026-04-09 19:54:57 -04:00
Simon Gardling	a3a6700106	grafana: replace disk-usage-collector with prometheus-zfs-exporter The custom disk-usage-collector shell script + minutely timer is replaced by prometheus-zfs-exporter (pdf/zfs_exporter, packaged in nixpkgs as services.prometheus.exporters.zfs). The exporter provides pool capacity metrics (allocated/free/size) natively. Partition metrics (/boot, /persistent, /nix) now use node_exporter's built-in filesystem collector (node_filesystem_*_bytes) which already runs and collects these metrics. Also fixes a latent race condition in serviceMountWithZpool: the -mounts service now orders after zfs-mount.service (which runs 'zfs mount -a'), not just after pool import. Without this, the mount check could run before datasets are actually mounted.	2026-04-09 19:54:57 -04:00
Simon Gardling	75319256f3	lib: add mkCaddyReverseProxy, mkFail2banJail, mkGrafanaAnnotationService, extractArrApiKey	2026-04-09 19:54:57 -04:00
Simon Gardling	c74d356595	xmrig: compile with compiler optimizations All checks were successful Build and Deploy / deploy (push) Successful in 2m45s	2026-04-09 16:25:30 -04:00
Simon Gardling	ae03c2f288	p2pool: don't disable on power loss p2pool is very light on resources, it's xmrig that should be disabled	2026-04-09 14:44:13 -04:00
Simon Gardling	0d87f90657	gitea: make gitea-runner wait for gitea.service Some checks failed Build and Deploy / deploy (push) Failing after 4m18s prevents spam on ntfy	2026-04-09 14:16:05 -04:00
Simon Gardling	d1e9c92423	update Some checks failed Build and Deploy / deploy (push) Failing after 4s	2026-04-09 14:03:34 -04:00
Simon Gardling	4f33b16411	llama.cpp: thing	2026-04-09 14:02:53 -04:00
Simon Gardling	4f41789995	Reapply "llama-cpp: enable" All checks were successful Build and Deploy / deploy (push) Successful in 6m43s This reverts commit `645a532ed7`.	2026-04-07 22:49:53 -04:00
Simon Gardling	c0390af1a4	llama-cpp: update All checks were successful Build and Deploy / deploy (push) Successful in 2m33s	2026-04-07 22:29:02 -04:00
Simon Gardling	98310f2582	organize patches + add gemma4 patch All checks were successful Build and Deploy / deploy (push) Successful in 2m41s	2026-04-07 20:57:54 -04:00
Simon Gardling	645a532ed7	Revert "llama-cpp: enable" All checks were successful Build and Deploy / deploy (push) Successful in 1m52s This reverts commit `fdc1596bce`.	2026-04-07 20:23:48 -04:00
Simon Gardling	2884a39eb1	llama-cpp: patch for vulkan support instead All checks were successful Build and Deploy / deploy (push) Successful in 7m23s	2026-04-07 20:07:02 -04:00
Simon Gardling	fdc1596bce	llama-cpp: enable All checks were successful Build and Deploy / deploy (push) Successful in 7m16s	2026-04-07 19:15:56 -04:00
Simon Gardling	778b04a80f	Reapply "llama-cpp: maybe use vulkan?" All checks were successful Build and Deploy / deploy (push) Successful in 2m17s This reverts commit `9addb1569a`.	2026-04-07 19:12:57 -04:00
Simon Gardling	88fc219f2d	update	2026-04-07 19:11:50 -04:00
Simon Gardling	a5c7c91e38	Power: disable a bunch of things All checks were successful Build and Deploy / deploy (push) Successful in 1m42s BROKE intel arc A380 completely because it was forced into L1.1/L1.2 pcie substates. Forcewaking the device would fail and it would never come up. So I will be more conservative on power saving tuning.	2026-04-07 19:08:08 -04:00
Simon Gardling	628c16fe64	fix git-crypt key for dotfiles workflow All checks were successful Build and Deploy / deploy (push) Successful in 2m32s	2026-04-07 13:51:19 -04:00
Simon Gardling	0df5d98770	grafana: use postgresql All checks were successful Build and Deploy / deploy (push) Successful in 2m45s Doesn't use for data, only annotation and other stuff	2026-04-07 12:44:59 -04:00
Simon Gardling	2848c7e897	grafana: keep data forever	2026-04-07 12:44:46 -04:00
Simon Gardling	e57c9cb83b	xmrig-auto-pause: raise thresholds for server background load All checks were successful Build and Deploy / deploy (push) Successful in 1m59s	2026-04-07 01:09:16 -04:00
Simon Gardling	d48f27701f	xmrig-auto-pause: add hysteresis to prevent stop/start thrashing xmrig's RandomX pollutes the L3 cache, making other processes appear ~3-8% busier. With a single 5% threshold for both stopping and resuming, the script oscillates: start xmrig -> cache pressure inflates CPU -> stop xmrig -> CPU drops -> restart -> repeat. Split into CPU_STOP_THRESHOLD (15%) and CPU_RESUME_THRESHOLD (5%). The stop threshold sits above xmrig's indirect pressure, so only genuine workloads trigger a pause. The resume threshold confirms the system is truly idle before restarting.	2026-04-07 01:09:06 -04:00
Simon Gardling	738861fd53	lanzaboote: fix was upstreamed	2026-04-06 19:21:20 -04:00
Simon Gardling	274ef40ccc	lanzaboote: pin to fork with pcrlock reinstall fix Some checks failed Build and Deploy / deploy (push) Failing after 3h15m29s Upstream PR: https://github.com/nix-community/lanzaboote/pull/566	2026-04-06 16:08:57 -04:00
Simon Gardling	a76a7969d9	nix-cache Some checks failed Build and Deploy / deploy (push) Failing after 1h17m39s	2026-04-06 14:21:31 -04:00
Simon Gardling	4be2eaed35	Reapply "update" Some checks failed Build and Deploy / deploy (push) Failing after 10m49s This reverts commit `655bbda26f`.	2026-04-06 13:40:52 -04:00
Simon Gardling	655bbda26f	Revert "update" All checks were successful Build and Deploy / deploy (push) Successful in 1m19s This reverts commit `960259b0d0`.	2026-04-06 13:39:32 -04:00
Simon Gardling	3b8aedd502	fix hardened kernel with nix sandbox	2026-04-06 13:36:38 -04:00
Simon Gardling	960259b0d0	update Some checks failed Build and Deploy / deploy (push) Failing after 2m14s	2026-04-06 13:12:50 -04:00
Simon Gardling	5fa6f37b28	llama-cpp: disable	2026-04-06 13:12:06 -04:00
Simon Gardling	7afd1f35d2	xmrig-auto-pause: fix	2026-04-06 13:11:54 -04:00
Simon Gardling	a12dcb01ec	llama-cpp: remove folder	2026-04-06 12:48:28 -04:00
Simon Gardling	6d47f02a0f	llama-cpp: set batch size to 4096 All checks were successful Build and Deploy / deploy (push) Successful in 1m22s	2026-04-06 02:29:37 -04:00
Simon Gardling	9addb1569a	Revert "llama-cpp: maybe use vulkan?" This reverts commit `0a927ea893`.	2026-04-06 02:28:26 -04:00
Simon Gardling	df04e36b41	llama-cpp: fix vulkan cache Some checks failed Build and Deploy / deploy (push) Failing after 1m23s	2026-04-06 02:23:29 -04:00
Simon Gardling	0a927ea893	llama-cpp: maybe use vulkan? All checks were successful Build and Deploy / deploy (push) Successful in 8m30s	2026-04-06 02:12:46 -04:00
Simon Gardling	3e46c5bfa5	llama-cpp: use turbo3 for everything All checks were successful Build and Deploy / deploy (push) Successful in 1m20s	2026-04-06 01:53:11 -04:00
Simon Gardling	06aee5af77	llama-cpp: gemma 4 E4B -> gemma 4 E2B All checks were successful Build and Deploy / deploy (push) Successful in 2m5s	2026-04-06 01:24:25 -04:00
Simon Gardling	8fddd3a954	llama-cpp: context: 32768 -> 65536 All checks were successful Build and Deploy / deploy (push) Successful in 2m58s	2026-04-06 01:04:23 -04:00
Simon Gardling	0e4f0d3176	llama-cpp: fix model name All checks were successful Build and Deploy / deploy (push) Successful in 1m18s	2026-04-06 00:59:20 -04:00
Simon Gardling	bbcd662c28	xmrig-auto-pause: fix stuck state after external restart, add startup cooldown All checks were successful Build and Deploy / deploy (push) Successful in 8m47s Two bugs found during live verification on the server: 1. Stuck state after external restart: if something else restarted xmrig (e.g. deploy-rs activation) while paused_by_us=True, the script never detected this and became permanently stuck — unable to stop xmrig on future load because it thought xmrig was already stopped. Fix: when paused_by_us=True and busy, check if xmrig is actually running. If so, reset paused_by_us=False and re-stop it. 2. Flapping on xmrig restart: RandomX dataset init takes ~3.7s of intense non-nice CPU, which the script detected as real workload and immediately re-stopped xmrig after every restart, creating a start-stop loop. Fix: add STARTUP_COOLDOWN (default 10s) — after starting xmrig, skip CPU checks until the cooldown expires. Both bugs were present in production: the script had been stuck since Apr 3 (2+ days) with xmrig running unmanaged alongside llama-server.	2026-04-05 23:20:47 -04:00
Simon Gardling	324a9123db	better organize related monero and matrix services All checks were successful Build and Deploy / deploy (push) Successful in 2m48s	2026-04-04 14:32:26 -04:00
Simon Gardling	8ea96c8b8e	llama-cpp: fix model hash All checks were successful Build and Deploy / deploy (push) Successful in 2m36s	2026-04-04 00:28:07 -04:00
Simon Gardling	3f62b9c88e	grafana: replace custom metric collectors with community exporters Replace three custom Prometheus textfile collector scripts with dedicated community-maintained exporters: - jellyfin-collector.nix (25 LoC shell) -> rebelcore/jellyfin_exporter Metric: jellyfin_active_streams -> count(jellyfin_now_playing_state) Bonus: per-session labels (user, title, device, codec info) - qbittorrent-collector.nix (40 LoC shell) -> anriha/qbittorrent-metrics-exporter Metric: qbittorrent_{download,upload}_bytes_per_second -> qbit_{dl,up}speed Bonus: per-torrent metrics with category/tag aggregation - intel-gpu-collector.nix + .py (130 LoC Python) -> mike1808/igpu-exporter Metric: intel_gpu_engine_busy_percent -> igpu_engines_busy_percent Bonus: persistent daemon vs oneshot timer, no streaming JSON parser All three run as persistent daemons scraped by Prometheus, replacing the textfile-collector pattern of systemd timers writing .prom files. Dashboard PromQL queries updated to match new metric names.	2026-04-03 15:38:13 -04:00
Simon Gardling	479ec43b8f	llama-cpp: integrate native prometheus /metrics endpoint llama.cpp server has a built-in /metrics endpoint exposing prompt_tokens_seconds, predicted_tokens_seconds, tokens_predicted_total, n_decode_total, and n_busy_slots_per_decode. Enable it with --metrics and add a Prometheus scrape target, replacing the need for any external metric collection for LLM inference monitoring.	2026-04-03 15:19:11 -04:00

1 2 3 4 5 ...

810 Commits