xmrig-auto-pause: fix stuck state after external restart, add startup cooldown

Two bugs found during live verification on the server: 1. Stuck state after external restart: if something else restarted xmrig (e.g. deploy-rs activation) while paused_by_us=True, the script never detected this and became permanently stuck — unable to stop xmrig on future load because it thought xmrig was already stopped. Fix: when paused_by_us=True and busy, check if xmrig is actually running. If so, reset paused_by_us=False and re-stop it. 2. Flapping on xmrig restart: RandomX dataset init takes ~3.7s of intense non-nice CPU, which the script detected as real workload and immediately re-stopped xmrig after every restart, creating a start-stop loop. Fix: add STARTUP_COOLDOWN (default 10s) — after starting xmrig, skip CPU checks until the cooldown expires. Both bugs were present in production: the script had been stuck since Apr 3 (2+ days) with xmrig running unmanaged alongside llama-server.
2026-04-05 23:20:47 -04:00
parent 324a9123db
commit bbcd662c28
3 changed files with 46 additions and 5 deletions
--- a/services/monero/xmrig-auto-pause.nix
+++ b/services/monero/xmrig-auto-pause.nix
@@ -26,6 +26,7 @@ lib.mkIf config.services.xmrig.enable {
      POLL_INTERVAL = "3";
      GRACE_PERIOD = "15";
      CPU_THRESHOLD = "5";
+      STARTUP_COOLDOWN = "10";
    };
  };
 }
--- a/services/monero/xmrig-auto-pause.py
+++ b/services/monero/xmrig-auto-pause.py
@@ -31,6 +31,10 @@ GRACE_PERIOD = float(os.environ.get("GRACE_PERIOD", "15"))
 # a pause. On a 12-thread system, one fully loaded core ≈ 8.3% of total.
 # Default 5% catches anything using more than ~60% of a single core.
 CPU_THRESHOLD = float(os.environ.get("CPU_THRESHOLD", "5"))
+# After starting xmrig, ignore CPU spikes for this many seconds to let
+# RandomX dataset initialization complete (~4s on the target hardware)
+# without retriggering a stop.
+STARTUP_COOLDOWN = float(os.environ.get("STARTUP_COOLDOWN", "10"))


 def log(msg):
@@ -77,10 +81,14 @@ def systemctl(action, unit):
 def main():
    paused_by_us = False
    idle_since = None
+    started_at = None  # monotonic time when we last started xmrig
    prev_total = None
    prev_work = None

-    log(f"Starting: poll={POLL_INTERVAL}s grace={GRACE_PERIOD}s threshold={CPU_THRESHOLD}%")
+    log(
+        f"Starting: poll={POLL_INTERVAL}s grace={GRACE_PERIOD}s "
+        f"threshold={CPU_THRESHOLD}% cooldown={STARTUP_COOLDOWN}s"
+    )

    while True:
        total, work = read_cpu_ticks()
@@ -102,10 +110,24 @@ def main():
        prev_total = total
        prev_work = work

+        # Don't act during startup cooldown — RandomX dataset init causes
+        # a transient CPU spike that would immediately retrigger a stop.
+        if started_at is not None:
+            if time.monotonic() - started_at < STARTUP_COOLDOWN:
+                time.sleep(POLL_INTERVAL)
+                continue
+            started_at = None
+
        busy = real_work_pct > CPU_THRESHOLD

        if busy:
            idle_since = None
+            if paused_by_us and is_active("xmrig.service"):
+                # Something else restarted xmrig (deploy, manual start, etc.)
+                # while we thought it was stopped. Reset ownership so we can
+                # manage it again.
+                log("xmrig was restarted externally while paused — reclaiming")
+                paused_by_us = False
            if not paused_by_us:
                # Only claim ownership if xmrig is actually running.
                # If something else stopped it (e.g. UPS battery hook),
@@ -122,6 +144,7 @@ def main():
                    log(f"Workload ended ({real_work_pct:.1f}% CPU) past grace period — starting xmrig")
                    if systemctl("start", "xmrig.service"):
                        paused_by_us = False
+                        started_at = time.monotonic()
                    idle_since = None

        time.sleep(POLL_INTERVAL)