server-config

Archived

Author	SHA1	Message	Date
Simon Gardling	274ef40ccc	lanzaboote: pin to fork with pcrlock reinstall fix Some checks failed Build and Deploy / deploy (push) Failing after 3h15m29s Upstream PR: https://github.com/nix-community/lanzaboote/pull/566	2026-04-06 16:08:57 -04:00
Simon Gardling	a76a7969d9	nix-cache Some checks failed Build and Deploy / deploy (push) Failing after 1h17m39s	2026-04-06 14:21:31 -04:00
Simon Gardling	4be2eaed35	Reapply "update" Some checks failed Build and Deploy / deploy (push) Failing after 10m49s This reverts commit `655bbda26f`.	2026-04-06 13:40:52 -04:00
Simon Gardling	655bbda26f	Revert "update" All checks were successful Build and Deploy / deploy (push) Successful in 1m19s This reverts commit `960259b0d0`.	2026-04-06 13:39:32 -04:00
Simon Gardling	3b8aedd502	fix hardened kernel with nix sandbox	2026-04-06 13:36:38 -04:00
Simon Gardling	960259b0d0	update Some checks failed Build and Deploy / deploy (push) Failing after 2m14s	2026-04-06 13:12:50 -04:00
Simon Gardling	5fa6f37b28	llama-cpp: disable	2026-04-06 13:12:06 -04:00
Simon Gardling	7afd1f35d2	xmrig-auto-pause: fix	2026-04-06 13:11:54 -04:00
Simon Gardling	a12dcb01ec	llama-cpp: remove folder	2026-04-06 12:48:28 -04:00
Simon Gardling	6d47f02a0f	llama-cpp: set batch size to 4096 All checks were successful Build and Deploy / deploy (push) Successful in 1m22s	2026-04-06 02:29:37 -04:00
Simon Gardling	9addb1569a	Revert "llama-cpp: maybe use vulkan?" This reverts commit `0a927ea893`.	2026-04-06 02:28:26 -04:00
Simon Gardling	df04e36b41	llama-cpp: fix vulkan cache Some checks failed Build and Deploy / deploy (push) Failing after 1m23s	2026-04-06 02:23:29 -04:00
Simon Gardling	0a927ea893	llama-cpp: maybe use vulkan? All checks were successful Build and Deploy / deploy (push) Successful in 8m30s	2026-04-06 02:12:46 -04:00
Simon Gardling	3e46c5bfa5	llama-cpp: use turbo3 for everything All checks were successful Build and Deploy / deploy (push) Successful in 1m20s	2026-04-06 01:53:11 -04:00
Simon Gardling	06aee5af77	llama-cpp: gemma 4 E4B -> gemma 4 E2B All checks were successful Build and Deploy / deploy (push) Successful in 2m5s	2026-04-06 01:24:25 -04:00
Simon Gardling	8fddd3a954	llama-cpp: context: 32768 -> 65536 All checks were successful Build and Deploy / deploy (push) Successful in 2m58s	2026-04-06 01:04:23 -04:00
Simon Gardling	0e4f0d3176	llama-cpp: fix model name All checks were successful Build and Deploy / deploy (push) Successful in 1m18s	2026-04-06 00:59:20 -04:00
Simon Gardling	bbcd662c28	xmrig-auto-pause: fix stuck state after external restart, add startup cooldown All checks were successful Build and Deploy / deploy (push) Successful in 8m47s Two bugs found during live verification on the server: 1. Stuck state after external restart: if something else restarted xmrig (e.g. deploy-rs activation) while paused_by_us=True, the script never detected this and became permanently stuck — unable to stop xmrig on future load because it thought xmrig was already stopped. Fix: when paused_by_us=True and busy, check if xmrig is actually running. If so, reset paused_by_us=False and re-stop it. 2. Flapping on xmrig restart: RandomX dataset init takes ~3.7s of intense non-nice CPU, which the script detected as real workload and immediately re-stopped xmrig after every restart, creating a start-stop loop. Fix: add STARTUP_COOLDOWN (default 10s) — after starting xmrig, skip CPU checks until the cooldown expires. Both bugs were present in production: the script had been stuck since Apr 3 (2+ days) with xmrig running unmanaged alongside llama-server.	2026-04-05 23:20:47 -04:00
Simon Gardling	324a9123db	better organize related monero and matrix services All checks were successful Build and Deploy / deploy (push) Successful in 2m48s	2026-04-04 14:32:26 -04:00
Simon Gardling	8ea96c8b8e	llama-cpp: fix model hash All checks were successful Build and Deploy / deploy (push) Successful in 2m36s	2026-04-04 00:28:07 -04:00
Simon Gardling	3f62b9c88e	grafana: replace custom metric collectors with community exporters Replace three custom Prometheus textfile collector scripts with dedicated community-maintained exporters: - jellyfin-collector.nix (25 LoC shell) -> rebelcore/jellyfin_exporter Metric: jellyfin_active_streams -> count(jellyfin_now_playing_state) Bonus: per-session labels (user, title, device, codec info) - qbittorrent-collector.nix (40 LoC shell) -> anriha/qbittorrent-metrics-exporter Metric: qbittorrent_{download,upload}_bytes_per_second -> qbit_{dl,up}speed Bonus: per-torrent metrics with category/tag aggregation - intel-gpu-collector.nix + .py (130 LoC Python) -> mike1808/igpu-exporter Metric: intel_gpu_engine_busy_percent -> igpu_engines_busy_percent Bonus: persistent daemon vs oneshot timer, no streaming JSON parser All three run as persistent daemons scraped by Prometheus, replacing the textfile-collector pattern of systemd timers writing .prom files. Dashboard PromQL queries updated to match new metric names.	2026-04-03 15:38:13 -04:00
Simon Gardling	479ec43b8f	llama-cpp: integrate native prometheus /metrics endpoint llama.cpp server has a built-in /metrics endpoint exposing prompt_tokens_seconds, predicted_tokens_seconds, tokens_predicted_total, n_decode_total, and n_busy_slots_per_decode. Enable it with --metrics and add a Prometheus scrape target, replacing the need for any external metric collection for LLM inference monitoring.	2026-04-03 15:19:11 -04:00
Simon Gardling	37ac88fc0f	lib: replace deprecated overrideDerivation with overrideAttrs overrideDerivation has been deprecated since 2019. The new overrideAttrs properly handles the env attribute set used by modern derivations to avoid the NIX_CFLAGS_COMPILE overlap error between env and top-level derivation arguments.	2026-04-03 15:18:22 -04:00
Simon Gardling	47aeb58f7a	llama-cpp: do logging All checks were successful Build and Deploy / deploy (push) Successful in 2m27s	2026-04-03 14:39:46 -04:00
Simon Gardling	daf82c16ba	fix xmrig pause	2026-04-03 14:39:20 -04:00
Simon Gardling	d4d01d63f1	llama-cpp: update + re-enable + gemma 4 E4B Some checks failed Build and Deploy / deploy (push) Failing after 20m16s	2026-04-03 14:06:35 -04:00
Simon Gardling	e765a98487	recyclarr: reset back to default basically All checks were successful Build and Deploy / deploy (push) Successful in 2m15s	2026-04-03 13:45:26 -04:00
Simon Gardling	124d33963e	organize All checks were successful Build and Deploy / deploy (push) Successful in 2m43s	2026-04-03 00:47:12 -04:00
Simon Gardling	1451f902ad	grafana: re-organize	2026-04-03 00:39:42 -04:00
Simon Gardling	8e6619097d	update Some checks failed Build and Deploy / deploy (push) Failing after 4m59s	2026-04-03 00:20:13 -04:00
Simon Gardling	c2ff07b329	llama-cpp: disable	2026-04-03 00:17:38 -04:00
Simon Gardling	9e235abf48	monitoring: fix disk-usage-collector timer calendar spec All checks were successful Build and Deploy / deploy (push) Successful in 2m14s	2026-04-03 00:17:21 -04:00
Simon Gardling	096ffeb943	llama-cpp: xmrig + grafana hooks	2026-04-03 00:17:17 -04:00
Simon Gardling	ab9c12cb97	llama-cpp: general changes	2026-04-03 00:17:14 -04:00
Simon Gardling	0aeb6c5523	llama-cpp: add API key auth via --api-key-file Some checks failed Build and Deploy / deploy (push) Failing after 2m49s Generate and encrypt a Bearer token for llama-cpp's built-in auth. Remove caddy_auth from the vhost since basic auth blocks Bearer-only clients. Internal sidecars (xmrig-pause, annotations) connect directly to localhost and are unaffected (/slots is public).	2026-04-02 18:02:23 -04:00
Simon Gardling	bfe7a65db2	monitoring: add zpool and boot partition usage metrics Add textfile collector for ZFS pool utilization (tank, hdds) and boot drive partitions (/boot, /persistent, /nix). Runs every 60s. Add two Grafana dashboard panels: ZFS Pool Utilization and Boot Drive Partitions as Row 5.	2026-04-02 18:02:23 -04:00
Simon Gardling	e41f869843	trilium: add self-hosted note-taking service Add trilium-server on port 8787 behind Caddy reverse proxy at notes.sigkill.computer. Data stored on ZFS tank pool with serviceMountWithZpool for mount ordering.	2026-04-02 17:44:04 -04:00
Simon Gardling	9baeaa5c23	llama-cpp: add grafana annotations for inference requests Poll /slots endpoint, create annotations when slots start processing, close with token count when complete. Includes NixOS VM test with mock llama-cpp and grafana servers. Dashboard annotation entry added.	2026-04-02 17:43:49 -04:00
Simon Gardling	0235617627	monitoring: fix intel-gpu-collector crash resilience Wrap entire read_one_sample() in try/except to handle all failures (missing binary, permission errors, malformed JSON, timeouts). Write zero-valued metrics on failure instead of exiting non-zero. Increase timeout from 5s to 8s for slower GPU initialization.	2026-04-02 17:43:13 -04:00
Simon Gardling	df15be01ea	llama-cpp: pause xmrig during active inference requests Add sidecar service that polls llama-cpp /slots endpoint every 3s. When any slot is processing, stops xmrig. Restarts xmrig after 10s grace period when all slots are idle. Handles unreachable llama-cpp gracefully (leaves xmrig untouched).	2026-04-02 17:43:07 -04:00
Simon Gardling	50453cf0b5	llama-cpp: adjust args All checks were successful Build and Deploy / deploy (push) Successful in 2m32s	2026-04-02 16:09:17 -04:00
Simon Gardling	bb6ea2f1d5	llama-cpp: cpu only All checks were successful Build and Deploy / deploy (push) Successful in 20m0s	2026-04-02 15:32:39 -04:00
Simon Gardling	f342521d46	llama-cpp: re-add w/ turboquant All checks were successful Build and Deploy / deploy (push) Successful in 28m52s	2026-04-02 13:42:39 -04:00
Simon Gardling	7e779ca0f7	power optimizations	2026-04-02 13:13:38 -04:00
Simon Gardling	06b2016bd6	recyclarr: things	2026-04-01 20:37:18 -04:00
Simon Gardling	f9694ae033	qbt: fix categories All checks were successful Build and Deploy / deploy (push) Successful in 2m24s	2026-04-01 15:25:40 -04:00
Simon Gardling	f775f22dbf	recylcarr: restart service after config change	2026-04-01 15:25:31 -04:00
Simon Gardling	1bb0844649	update Some checks failed Build and Deploy / deploy (push) Failing after 6m22s	2026-04-01 13:12:14 -04:00
Simon Gardling	297264a34a	tests: extract shared jellyfin test helpers and use real jellyfin in annotations test Some checks failed Build and Deploy / deploy (push) Failing after 2m35s	2026-04-01 11:24:44 -04:00
Simon Gardling	a5206b9ec6	monitoring: add grafana annotations for zfs scrub events	2026-04-01 11:24:43 -04:00

1 2 3 4 5 ...

832 Commits