Build and Deploy / deploy (push) Successful in 1m42s
BROKE intel arc A380 completely because it was forced into L1.1/L1.2
pcie substates. Forcewaking the device would fail and it would never come up.
So I will be more conservative on power saving tuning.
xmrig's RandomX pollutes the L3 cache, making other processes appear
~3-8% busier. With a single 5% threshold for both stopping and
resuming, the script oscillates: start xmrig -> cache pressure
inflates CPU -> stop xmrig -> CPU drops -> restart -> repeat.
Split into CPU_STOP_THRESHOLD (15%) and CPU_RESUME_THRESHOLD (5%).
The stop threshold sits above xmrig's indirect pressure, so only
genuine workloads trigger a pause. The resume threshold confirms the
system is truly idle before restarting.
Build and Deploy / deploy (push) Successful in 8m47s
Two bugs found during live verification on the server:
1. Stuck state after external restart: if something else restarted xmrig
(e.g. deploy-rs activation) while paused_by_us=True, the script never
detected this and became permanently stuck — unable to stop xmrig on
future load because it thought xmrig was already stopped.
Fix: when paused_by_us=True and busy, check if xmrig is actually
running. If so, reset paused_by_us=False and re-stop it.
2. Flapping on xmrig restart: RandomX dataset init takes ~3.7s of intense
non-nice CPU, which the script detected as real workload and immediately
re-stopped xmrig after every restart, creating a start-stop loop.
Fix: add STARTUP_COOLDOWN (default 10s) — after starting xmrig, skip
CPU checks until the cooldown expires.
Both bugs were present in production: the script had been stuck since
Apr 3 (2+ days) with xmrig running unmanaged alongside llama-server.
llama.cpp server has a built-in /metrics endpoint exposing
prompt_tokens_seconds, predicted_tokens_seconds, tokens_predicted_total,
n_decode_total, and n_busy_slots_per_decode. Enable it with --metrics
and add a Prometheus scrape target, replacing the need for any external
metric collection for LLM inference monitoring.
overrideDerivation has been deprecated since 2019. The new
overrideAttrs properly handles the env attribute set used by
modern derivations to avoid the NIX_CFLAGS_COMPILE overlap
error between env and top-level derivation arguments.