Build and Deploy / deploy (push) Successful in 2m42s
Grafana 12 expects Prometheus annotation queries wrapped in a 'target'
object with datasource, expr, refId, and range fields. The previous
format had expr/step as top-level fields which Grafana silently ignored.
Build and Deploy / deploy (push) Failing after 31m3s
Build Caddy with the caddy-dns/njalla plugin to enable DNS-01 ACME
challenges. This issues a single wildcard certificate for
*.sigkill.computer instead of per-subdomain certificates, reducing
Let's Encrypt API calls and certificate management overhead.
Add ddns-updater service (nixpkgs services.ddns-updater) configured
with Njalla provider to automatically update DNS records when the
server's public IP changes.
The custom disk-usage-collector shell script + minutely timer is replaced
by prometheus-zfs-exporter (pdf/zfs_exporter, packaged in nixpkgs as
services.prometheus.exporters.zfs). The exporter provides pool capacity
metrics (allocated/free/size) natively.
Partition metrics (/boot, /persistent, /nix) now use node_exporter's
built-in filesystem collector (node_filesystem_*_bytes) which already
runs and collects these metrics.
Also fixes a latent race condition in serviceMountWithZpool: the -mounts
service now orders after zfs-mount.service (which runs 'zfs mount -a'),
not just after pool import. Without this, the mount check could run
before datasets are actually mounted.
Build and Deploy / deploy (push) Successful in 1m42s
BROKE intel arc A380 completely because it was forced into L1.1/L1.2
pcie substates. Forcewaking the device would fail and it would never come up.
So I will be more conservative on power saving tuning.
xmrig's RandomX pollutes the L3 cache, making other processes appear
~3-8% busier. With a single 5% threshold for both stopping and
resuming, the script oscillates: start xmrig -> cache pressure
inflates CPU -> stop xmrig -> CPU drops -> restart -> repeat.
Split into CPU_STOP_THRESHOLD (15%) and CPU_RESUME_THRESHOLD (5%).
The stop threshold sits above xmrig's indirect pressure, so only
genuine workloads trigger a pause. The resume threshold confirms the
system is truly idle before restarting.
Build and Deploy / deploy (push) Successful in 8m47s
Two bugs found during live verification on the server:
1. Stuck state after external restart: if something else restarted xmrig
(e.g. deploy-rs activation) while paused_by_us=True, the script never
detected this and became permanently stuck — unable to stop xmrig on
future load because it thought xmrig was already stopped.
Fix: when paused_by_us=True and busy, check if xmrig is actually
running. If so, reset paused_by_us=False and re-stop it.
2. Flapping on xmrig restart: RandomX dataset init takes ~3.7s of intense
non-nice CPU, which the script detected as real workload and immediately
re-stopped xmrig after every restart, creating a start-stop loop.
Fix: add STARTUP_COOLDOWN (default 10s) — after starting xmrig, skip
CPU checks until the cooldown expires.
Both bugs were present in production: the script had been stuck since
Apr 3 (2+ days) with xmrig running unmanaged alongside llama-server.
llama.cpp server has a built-in /metrics endpoint exposing
prompt_tokens_seconds, predicted_tokens_seconds, tokens_predicted_total,
n_decode_total, and n_busy_slots_per_decode. Enable it with --metrics
and add a Prometheus scrape target, replacing the need for any external
metric collection for LLM inference monitoring.