xmrig-auto-pause: use cgroup.freeze and thaws

This commit is contained in:
2026-04-21 14:30:03 -04:00
parent a8cf95c7dd
commit 018b590e0d
3 changed files with 492 additions and 218 deletions

View File

@@ -2,15 +2,33 @@
config,
lib,
pkgs,
service_configs,
...
}:
let
cgroupDir = "/sys/fs/cgroup/system.slice/xmrig.service";
cgroupFreeze = "${cgroupDir}/cgroup.freeze";
in
lib.mkIf config.services.xmrig.enable {
systemd.services.xmrig-auto-pause = {
description = "Auto-pause xmrig when other services need CPU";
description = "Auto-pause xmrig via cgroup freezer when other services need CPU";
after = [ "xmrig.service" ];
# PartOf cascades stop/restart: when xmrig stops (deploy, apcupsd battery,
# manual), systemd stops auto-pause first and ExecStop thaws xmrig so
# xmrig's own stop does not hang on a frozen cgroup.
partOf = [ "xmrig.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
ExecStart = "${pkgs.python3}/bin/python3 ${./xmrig-auto-pause.py}";
# Safety net: any exit path (SIGTERM from PartOf cascade, systemctl stop,
# crash with Restart=) must leave xmrig thawed. The Python SIGTERM
# handler does the same thing; this covers SIGKILL / hard crash paths
# too. Idempotent.
ExecStop = pkgs.writeShellScript "xmrig-auto-pause-thaw" ''
f=${cgroupFreeze}
[ -w "$f" ] && echo 0 > "$f" || true
'';
Restart = "always";
RestartSec = "10s";
NoNewPrivileges = true;
@@ -22,6 +40,9 @@ lib.mkIf config.services.xmrig.enable {
];
MemoryDenyWriteExecute = true;
StateDirectory = "xmrig-auto-pause";
# Required so the script can write to cgroup.freeze under
# ProtectSystem=strict (which makes /sys read-only by default).
ReadWritePaths = [ cgroupDir ];
};
environment = {
POLL_INTERVAL = "3";
@@ -32,8 +53,19 @@ lib.mkIf config.services.xmrig.enable {
# steady-state floor to avoid restarting xmrig while services are active.
CPU_STOP_THRESHOLD = "40";
CPU_RESUME_THRESHOLD = "10";
STARTUP_COOLDOWN = "10";
STATE_DIR = "/var/lib/xmrig-auto-pause";
XMRIG_CGROUP_FREEZE = cgroupFreeze;
# Per-service CPU thresholds. Catches sub-threshold activity that never
# trips the system-wide gauge — a single Minecraft player uses 3-15% of
# one core (0.3-1.3% of a 12-thread host) which is pure noise in
# /proc/stat but dominant in the minecraft cgroup.
WATCHED_SERVICES = lib.concatStringsSep "," (
lib.optional config.services.minecraft-servers.enable "minecraft-server-${service_configs.minecraft.server_name}:2"
);
};
};
# Pull auto-pause along whenever xmrig starts. After= on auto-pause ensures
# correct order; Wants= here ensures it actually starts.
systemd.services.xmrig.wants = [ "xmrig-auto-pause.service" ];
}

View File

@@ -2,33 +2,54 @@
"""
Auto-pause xmrig when other services need CPU.
Monitors non-nice CPU usage from /proc/stat. Since xmrig runs at Nice=19,
its CPU time lands in the 'nice' column and is excluded from the metric.
When real workload (user + system + irq + softirq) exceeds the stop
threshold, stops xmrig. When it drops below the resume threshold for
GRACE_PERIOD seconds, restarts xmrig.
Two independent signals drive the decision; either one can trigger a pause:
This replaces per-service pause scripts with a single general-purpose
monitor that handles any CPU-intensive workload (gitea workers, llama-cpp
inference, etc.) without needing to know about specific processes.
1. System-wide non-nice CPU from /proc/stat. Catches any CPU-heavy workload
including non-systemd user work (interactive sessions, ad-hoc jobs).
Since xmrig runs at Nice=19, its CPU time lands in the 'nice' column and
is excluded from the metric.
2. Per-service CPU from cgroup cpu.stat usage_usec. Catches sub-threshold
service activity — a single Minecraft player drives the server JVM to
3-15% of one core, which is noise system-wide (0.3-1.3% of total on a
12-thread host) but dominant for the minecraft cgroup.
When either signal crosses its stop threshold, writes 1 to
/sys/fs/cgroup/system.slice/xmrig.service/cgroup.freeze. When both are quiet
for GRACE_PERIOD seconds, writes 0 to resume.
Why direct cgroup.freeze instead of systemctl freeze:
systemd 256+ has a bug class where `systemctl freeze` followed by any
process death (SIGKILL, watchdog, OOM, segfault, shutdown) strands the
unit in FreezerState=frozen ActiveState=failed with no recovery short of
a reboot. See https://github.com/systemd/systemd/issues/38517. Writing
directly to cgroup.freeze keeps systemd's FreezerState at "running" the
whole time, so there is no state machine to get stuck: if xmrig dies
while frozen, systemd transitions it to inactive normally.
Why scheduler priority alone isn't enough:
Nice=19 / SCHED_IDLE only affects which thread gets the next time slice.
RandomX's 2MB-per-thread scratchpad (24MB across 12 threads) pollutes
the shared 32MB L3 cache, and its memory access pattern saturates DRAM
bandwidth. Other services run slower even though they aren't denied CPU
time. The only fix is to stop xmrig entirely when real work is happening.
RandomX's 2MB-per-thread scratchpad (24MB across 12 threads) holds about
68% of the shared 32MB L3 cache on Zen 3, evicting hot lines from
interactive services. Measured on muffin: pointer-chase latency is 112ns
with xmrig running and 19ns with xmrig frozen — a 6x difference that
scheduler priority cannot address.
Hysteresis:
The stop threshold is set higher than the resume threshold to prevent
oscillation. When xmrig runs, its L3 cache pressure makes other processes
appear ~3-8% busier. A single threshold trips on this indirect effect,
causing stop/start thrashing. Separate thresholds break the cycle: the
resume threshold confirms the system is truly idle, while the stop
threshold requires genuine workload above xmrig's indirect pressure.
The system-wide stop threshold sits higher than the resume threshold
because background services (qbittorrent, bitmagnet, postgres) produce
15-25% non-nice CPU during normal operation, and xmrig's indirect cache
pressure inflates that by another few percent. A single threshold
thrashes on the floor; two thresholds break the cycle.
Per-service thresholds are single-valued. Per-service CPU is a clean
signal without background noise to calibrate against, so idle_since is
reset whenever any watched service is at-or-above its threshold and the
grace period only advances when every watched service is below.
"""
import os
import signal
import subprocess
import sys
import time
@@ -37,19 +58,23 @@ POLL_INTERVAL = int(os.environ.get("POLL_INTERVAL", "3"))
GRACE_PERIOD = float(os.environ.get("GRACE_PERIOD", "15"))
# Percentage of total CPU ticks that non-nice processes must use to trigger
# a pause. On a 12-thread system, one fully loaded core ≈ 8.3% of total.
# Default 15% requires roughly two busy cores, which avoids false positives
# from xmrig's L3 cache pressure inflating other processes' apparent CPU.
CPU_STOP_THRESHOLD = float(os.environ.get("CPU_STOP_THRESHOLD", "15"))
# Percentage below which the system is considered idle enough to resume
# mining. Lower than the stop threshold to provide hysteresis.
CPU_RESUME_THRESHOLD = float(os.environ.get("CPU_RESUME_THRESHOLD", "5"))
# After starting xmrig, ignore CPU spikes for this many seconds to let
# RandomX dataset initialization complete (~4s on the target hardware)
# without retriggering a stop.
STARTUP_COOLDOWN = float(os.environ.get("STARTUP_COOLDOWN", "10"))
# Per-service CPU thresholds parsed from "unit1:threshold1,unit2:threshold2".
# Thresholds are percentage of TOTAL CPU capacity (same frame as
# CPU_STOP_THRESHOLD). Empty / unset disables the per-service path.
WATCHED_SERVICES_RAW = os.environ.get("WATCHED_SERVICES", "")
# Path to xmrig's cgroup.freeze file. Direct write bypasses systemd's
# freezer state machine; see module docstring.
XMRIG_CGROUP_FREEZE = os.environ.get(
"XMRIG_CGROUP_FREEZE",
"/sys/fs/cgroup/system.slice/xmrig.service/cgroup.freeze",
)
# Directory for persisting pause state across script restarts. Without
# this, a restart while xmrig is paused loses the paused_by_us flag and
# xmrig stays stopped permanently.
# xmrig stays frozen until something else thaws it.
STATE_DIR = os.environ.get("STATE_DIR", "")
_PAUSE_FILE = os.path.join(STATE_DIR, "paused") if STATE_DIR else ""
@@ -58,6 +83,51 @@ def log(msg):
print(f"[xmrig-auto-pause] {msg}", file=sys.stderr, flush=True)
def _parse_watched(spec):
out = {}
for entry in filter(None, (s.strip() for s in spec.split(","))):
name, _, pct = entry.partition(":")
name = name.strip()
pct = pct.strip()
if not name or not pct:
log(f"WATCHED_SERVICES: ignoring malformed entry '{entry}'")
continue
try:
out[name] = float(pct)
except ValueError:
log(f"WATCHED_SERVICES: ignoring non-numeric threshold in '{entry}'")
return out
def _resolve_cgroup_cpustat(unit):
"""Look up the unit's cgroup path via systemd. Returns cpu.stat path or
None if the unit has no cgroup (service not running, unknown unit)."""
result = subprocess.run(
["systemctl", "show", "--value", "--property=ControlGroup", unit],
capture_output=True,
text=True,
)
cg = result.stdout.strip()
if not cg:
return None
path = f"/sys/fs/cgroup{cg}/cpu.stat"
if not os.path.isfile(path):
return None
return path
def _read_service_usec(path):
"""Cumulative cpu.stat usage_usec, or None if the cgroup has vanished."""
try:
with open(path) as f:
for line in f:
if line.startswith("usage_usec "):
return int(line.split()[1])
except FileNotFoundError:
return None
return None
def read_cpu_ticks():
"""Read CPU tick counters from /proc/stat.
@@ -84,123 +154,241 @@ def is_active(unit):
return result.returncode == 0
def systemctl(action, unit):
def main_pid(unit):
"""Return the unit's MainPID, or 0 if unit is not running."""
result = subprocess.run(
["systemctl", action, unit],
["systemctl", "show", "--value", "--property=MainPID", unit],
capture_output=True,
text=True,
)
if result.returncode != 0:
log(f"systemctl {action} {unit} failed (rc={result.returncode}): {result.stderr.strip()}")
return result.returncode == 0
try:
return int(result.stdout.strip() or "0")
except ValueError:
return 0
def _save_paused(paused):
"""Persist pause flag so a script restart can resume where we left off."""
def _freeze(frozen):
"""Write 1 or 0 to xmrig's cgroup.freeze. Returns True on success.
Direct kernel interface — bypasses systemd's freezer state tracking."""
try:
with open(XMRIG_CGROUP_FREEZE, "w") as f:
f.write("1" if frozen else "0")
return True
except OSError as e:
action = "freeze" if frozen else "thaw"
log(f"cgroup.freeze {action} write failed: {e}")
return False
def _is_frozen():
"""Read the actual frozen state from cgroup.events. False if cgroup absent."""
events_path = os.path.join(os.path.dirname(XMRIG_CGROUP_FREEZE), "cgroup.events")
try:
with open(events_path) as f:
for line in f:
if line.startswith("frozen "):
return line.split()[1] == "1"
except FileNotFoundError:
return False
return False
def _save_paused(pid):
"""Persist the xmrig MainPID at the time of freeze. pid=0 clears claim."""
if not _PAUSE_FILE:
return
try:
if paused:
open(_PAUSE_FILE, "w").close()
if pid:
with open(_PAUSE_FILE, "w") as f:
f.write(str(pid))
else:
os.remove(_PAUSE_FILE)
except OSError:
pass
try:
os.remove(_PAUSE_FILE)
except FileNotFoundError:
pass
except OSError as e:
log(f"state file write failed: {e}")
def _load_paused():
"""Check if a previous instance left xmrig paused."""
"""Return True iff our claim is still valid: same PID and still frozen.
Restart of the xmrig unit gives it a new PID, which invalidates any
prior claim — we can't "own" a freeze we didn't perform on this
instance. Also confirms the cgroup is actually frozen so an external
thaw drops the claim.
"""
if not _PAUSE_FILE:
return False
return os.path.isfile(_PAUSE_FILE)
try:
with open(_PAUSE_FILE) as f:
saved = int(f.read().strip() or "0")
except (FileNotFoundError, ValueError):
return False
if not saved:
return False
if saved != main_pid("xmrig.service"):
return False
return _is_frozen()
def _cleanup(signum=None, frame=None):
"""On SIGTERM/SIGINT: thaw xmrig and clear claim. Operators must never see
a frozen unit we owned after auto-pause exits."""
if _is_frozen():
_freeze(False)
_save_paused(0)
sys.exit(0)
def main():
paused_by_us = _load_paused()
idle_since = None
started_at = None # monotonic time when we last started xmrig
prev_total = None
prev_work = None
watched_services = _parse_watched(WATCHED_SERVICES_RAW)
watched_paths = {}
for name in watched_services:
path = _resolve_cgroup_cpustat(name)
if path is None:
log(f"WATCHED_SERVICES: {name} has no cgroup — ignoring until it starts")
watched_paths[name] = path
nproc = os.cpu_count() or 1
signal.signal(signal.SIGTERM, _cleanup)
signal.signal(signal.SIGINT, _cleanup)
paused_by_us = _load_paused()
if paused_by_us:
log("Recovered pause state from previous instance")
log(
f"Starting: poll={POLL_INTERVAL}s grace={GRACE_PERIOD}s "
f"stop={CPU_STOP_THRESHOLD}% resume={CPU_RESUME_THRESHOLD}% "
f"cooldown={STARTUP_COOLDOWN}s"
f"sys_stop={CPU_STOP_THRESHOLD}% sys_resume={CPU_RESUME_THRESHOLD}% "
f"watched={watched_services or '(none)'}"
)
idle_since = None
prev_total = None
prev_work = None
prev_monotonic = None
prev_service_usec = {}
while True:
total, work = read_cpu_ticks()
now = time.monotonic()
if prev_total is None:
prev_total = total
prev_work = work
prev_monotonic = now
# seed per-service baselines too
for name, path in watched_paths.items():
if path is None:
# Re-resolve in case the service has started since startup
path = _resolve_cgroup_cpustat(name)
watched_paths[name] = path
if path is not None:
usec = _read_service_usec(path)
if usec is not None:
prev_service_usec[name] = usec
time.sleep(POLL_INTERVAL)
continue
dt = total - prev_total
if dt <= 0:
dt_s = now - prev_monotonic
if dt <= 0 or dt_s <= 0:
prev_total = total
prev_work = work
prev_monotonic = now
time.sleep(POLL_INTERVAL)
continue
real_work_pct = ((work - prev_work) / dt) * 100
# Per-service CPU percentages this window. Fraction of total CPU
# capacity used by this specific service, same frame as real_work_pct.
svc_pct = {}
for name in watched_services:
path = watched_paths.get(name)
if path is None:
# Unit wasn't running at startup; try resolving again in case
# it has started since.
path = _resolve_cgroup_cpustat(name)
watched_paths[name] = path
if path is None:
prev_service_usec.pop(name, None)
continue
cur = _read_service_usec(path)
if cur is None:
# Service stopped; drop prev so it doesn't compute a huge delta
# on next start.
prev_service_usec.pop(name, None)
watched_paths[name] = None # force re-resolution next poll
continue
if name in prev_service_usec:
delta_us = cur - prev_service_usec[name]
if delta_us >= 0:
svc_pct[name] = (delta_us / 1_000_000) / (dt_s * nproc) * 100
prev_service_usec[name] = cur
prev_total = total
prev_work = work
prev_monotonic = now
# Don't act during startup cooldown — RandomX dataset init causes
# a transient CPU spike that would immediately retrigger a stop.
if started_at is not None:
if time.monotonic() - started_at < STARTUP_COOLDOWN:
time.sleep(POLL_INTERVAL)
continue
# Cooldown expired — verify xmrig survived startup. If it
# crashed during init (hugepage failure, pool unreachable, etc.),
# re-enter the pause/retry cycle rather than silently leaving
# xmrig dead.
if not is_active("xmrig.service"):
log("xmrig died during startup cooldown — will retry")
paused_by_us = True
_save_paused(True)
started_at = None
above_stop_sys = real_work_pct > CPU_STOP_THRESHOLD
below_resume_sys = real_work_pct <= CPU_RESUME_THRESHOLD
above_stop = real_work_pct > CPU_STOP_THRESHOLD
below_resume = real_work_pct <= CPU_RESUME_THRESHOLD
busy_services = [
n for n in watched_services if svc_pct.get(n, 0) > watched_services[n]
]
any_svc_at_or_above = any(
svc_pct.get(n, 0) >= watched_services[n] for n in watched_services
)
if above_stop:
stop_pressure = above_stop_sys or bool(busy_services)
fully_idle = below_resume_sys and not any_svc_at_or_above
if stop_pressure:
idle_since = None
if paused_by_us and is_active("xmrig.service"):
# Something else restarted xmrig (deploy, manual start, etc.)
# while we thought it was stopped. Reset ownership so we can
# manage it again.
log("xmrig was restarted externally while paused — reclaiming")
if paused_by_us and not _is_frozen():
# Someone thawed xmrig while we believed it paused. Reclaim
# ownership so we can re-freeze.
log("xmrig was thawed externally while paused — reclaiming")
paused_by_us = False
_save_paused(False)
if not paused_by_us:
# Only claim ownership if xmrig is actually running.
# If something else stopped it (e.g. UPS battery hook),
# don't interfere — we'd wrongly restart it later.
if is_active("xmrig.service"):
log(f"Real workload detected ({real_work_pct:.1f}% CPU) — stopping xmrig")
if systemctl("stop", "xmrig.service"):
paused_by_us = True
_save_paused(True)
_save_paused(0)
if not paused_by_us and is_active("xmrig.service"):
# Only claim ownership if xmrig is actually running. If
# something else stopped it (e.g. UPS battery hook), don't
# interfere.
if busy_services:
reasons = ", ".join(
f"{n}={svc_pct[n]:.1f}%>{watched_services[n]:.1f}%"
for n in busy_services
)
log(f"Stop: watched service(s) busy [{reasons}] — freezing xmrig")
else:
log(
f"Stop: system CPU {real_work_pct:.1f}% > "
f"{CPU_STOP_THRESHOLD:.1f}% — freezing xmrig"
)
if _freeze(True):
paused_by_us = True
_save_paused(main_pid("xmrig.service"))
elif paused_by_us:
if below_resume:
if fully_idle:
if idle_since is None:
idle_since = time.monotonic()
elif time.monotonic() - idle_since >= GRACE_PERIOD:
log(f"Workload ended ({real_work_pct:.1f}% CPU) past grace period — starting xmrig")
if systemctl("start", "xmrig.service"):
log(
f"Idle past grace period (system {real_work_pct:.1f}%) "
"— thawing xmrig"
)
if _freeze(False):
paused_by_us = False
_save_paused(False)
started_at = time.monotonic()
_save_paused(0)
idle_since = None
else:
# Between thresholds — not idle enough to resume.
# Between thresholds or a watched service is borderline — not
# idle enough to resume.
idle_since = None
time.sleep(POLL_INTERVAL)