Commit Graph

71 Commits

Author SHA1 Message Date
674d3cf539 fix tests 2026-04-12 15:36:04 -04:00
e9ce1ce0a2 grafana: replace llama-cpp-annotations daemon with prometheus query 2026-04-09 19:54:57 -04:00
d48f27701f xmrig-auto-pause: add hysteresis to prevent stop/start thrashing
xmrig's RandomX pollutes the L3 cache, making other processes appear
~3-8% busier. With a single 5% threshold for both stopping and
resuming, the script oscillates: start xmrig -> cache pressure
inflates CPU -> stop xmrig -> CPU drops -> restart -> repeat.

Split into CPU_STOP_THRESHOLD (15%) and CPU_RESUME_THRESHOLD (5%).
The stop threshold sits above xmrig's indirect pressure, so only
genuine workloads trigger a pause. The resume threshold confirms the
system is truly idle before restarting.
2026-04-07 01:09:06 -04:00
7afd1f35d2 xmrig-auto-pause: fix 2026-04-06 13:11:54 -04:00
bbcd662c28 xmrig-auto-pause: fix stuck state after external restart, add startup cooldown
Two bugs found during live verification on the server:

1. Stuck state after external restart: if something else restarted xmrig
   (e.g. deploy-rs activation) while paused_by_us=True, the script never
   detected this and became permanently stuck — unable to stop xmrig on
   future load because it thought xmrig was already stopped.

   Fix: when paused_by_us=True and busy, check if xmrig is actually
   running. If so, reset paused_by_us=False and re-stop it.

2. Flapping on xmrig restart: RandomX dataset init takes ~3.7s of intense
   non-nice CPU, which the script detected as real workload and immediately
   re-stopped xmrig after every restart, creating a start-stop loop.

   Fix: add STARTUP_COOLDOWN (default 10s) — after starting xmrig, skip
   CPU checks until the cooldown expires.

Both bugs were present in production: the script had been stuck since
Apr 3 (2+ days) with xmrig running unmanaged alongside llama-server.
2026-04-05 23:20:47 -04:00
324a9123db better organize related monero and matrix services 2026-04-04 14:32:26 -04:00
daf82c16ba fix xmrig pause 2026-04-03 14:39:20 -04:00
124d33963e organize 2026-04-03 00:47:12 -04:00
1451f902ad grafana: re-organize 2026-04-03 00:39:42 -04:00
096ffeb943 llama-cpp: xmrig + grafana hooks 2026-04-03 00:17:17 -04:00
9baeaa5c23 llama-cpp: add grafana annotations for inference requests
Poll /slots endpoint, create annotations when slots start processing,
close with token count when complete. Includes NixOS VM test with
mock llama-cpp and grafana servers. Dashboard annotation entry added.
2026-04-02 17:43:49 -04:00
297264a34a tests: extract shared jellyfin test helpers and use real jellyfin in annotations test 2026-04-01 11:24:44 -04:00
a5206b9ec6 monitoring: add grafana annotations for zfs scrub events 2026-04-01 11:24:43 -04:00
3196b38db7 tests: extract shared mock grafana server from jellyfin test 2026-04-01 11:24:43 -04:00
c6b889cea3 grafana: more things
1. Smoothed out power draw
- UPS only reports on 9 watt intervals, so smoothing it out gives more
relative detail on trends
2. Add jellyfin integration
- Good for seeing correlations between statistics and jellyfin streams
3. intel gpu stats
- Provides info on utilization of the gpu
2026-03-31 17:25:06 -04:00
5375f8ee34 gitea: add actions runner and CI/CD deploy workflow
This will avoid me having to run "deploy" myself on my laptop.
All I will need to do is push a commit and it will self-deploy.
2026-03-31 12:38:43 -04:00
cc8761a304 torrent-audit: init 2026-03-27 18:13:21 -07:00
a5f3af5ff3 ports refactor 2026-03-21 12:13:53 -04:00
643df612ad jellyfin: patch port 8096 being open
All jellyfin traffic should actually go through caddy.
This port being open caused a lot of confusion for me.
As I was getting traffic from typo'd domain names,
such as `jellfin.gardling.com`, which made NO SENSE!
But since it was going directly via port 8096, it
skipped caddy entirely so the traffic went through.
2026-03-04 13:29:54 -05:00
d4b679d1a5 cleanup 2026-03-03 19:39:10 -05:00
c34bd1626f fmt 2026-03-03 14:31:40 -05:00
b977b578e0 arr-init: extract to standalone flake repo 2026-03-03 14:31:39 -05:00
ad33f94e32 minecraft: make more responsive 2026-03-03 14:31:39 -05:00
294cb6453e ntfy-alerts: init 2026-03-03 14:31:36 -05:00
745d0ea4c2 arr-init: add module for API-based configuration 2026-03-03 14:31:28 -05:00
5f6aa2e200 jellyfin-qbittorrent-monitor: fix upload 2026-03-03 14:31:23 -05:00
a890508267 jellyfin-qbittorrent-monitor: dynamic bandwidth management 2026-03-03 14:31:22 -05:00
fb305cc9f4 fmt 2026-03-03 14:31:20 -05:00
0d1205210d feat(tmpfiles): defer per-service file permissions to reduce boot time 2026-03-03 14:31:18 -05:00
683a4f903d potentially fix fail2ban 2026-03-03 14:31:11 -05:00
12b681c8f2 cleanup 2026-03-03 14:31:05 -05:00
f7a0eef88f cleanup minecraft test 2026-03-03 14:31:05 -05:00
4de717a20d Revert "minecraft: fail2ban"
This reverts commit a23b3d8c5f1786204e3de18c3b8ba579a0e0e693.
2026-03-03 14:31:03 -05:00
a184dcee5b minecraft: fail2ban 2026-03-03 14:31:03 -05:00
c9fc1b028e hostPlatform -> targetPlatform 2026-03-03 14:31:02 -05:00
c6c96528a9 jellyfin-qbittorrent-monitor: don't use mock qbittorrent 2026-03-03 14:31:00 -05:00
9874c13052 jellyfin-qbittorrent-monitor: fix mock qbittorrent 2026-03-03 14:31:00 -05:00
a6a9196137 fmt 2026-03-03 14:30:59 -05:00
bd0c7cde6d tests: fix all fail2ban NixOS VM tests
- Add explicit iptables banaction in security.nix for test compatibility
- Force IPv4 in all curl requests to prevent IPv4/IPv6 mismatch issues
- Fix caddy test: use basic_auth directive (not basicauth)
- Override service ports in tests to match direct connections (not via Caddy)
- Vaultwarden: override ROCKET_ADDRESS and ROCKET_LOG for external access
- Immich: increase VM memory to 4GB for stability
- Jellyfin: create placeholder log file and reload fail2ban after startup
- Add tests.nix entries for all 6 fail2ban tests

All tests now pass: ssh, caddy, gitea, vaultwarden, immich, jellyfin
2026-03-03 14:30:59 -05:00
dc71dbc188 jellyfin-qbittorrent-monitor: handle qbittorrent going down state 2026-03-03 14:30:55 -05:00
0c677db3e0 jellyfin-qbittorrent-monitor: don't mock out jellyfin for testing 2026-03-03 14:30:52 -05:00
ecfc282526 rework qbittorrent jellyfin monitor test 2026-03-03 14:30:52 -05:00
da58597889 fix pkgs.system deprecation 2026-03-03 14:30:49 -05:00
165532bae3 nit: cleanup imports 2026-03-03 14:30:47 -05:00
7159e90186 organize 2026-03-03 14:30:43 -05:00
14539caad4 zfs: expand testing to include a failing multi case 2026-03-03 14:30:31 -05:00
e891d6f1ab zfs: fix qbittorrent 2026-03-03 14:30:26 -05:00
4ce1cb862e zfs: HEAVILY REFACTOR subvolume handling 2026-03-03 14:30:26 -05:00
7f9cd75902 zfs: fix zfs escaped spaces test 2026-03-03 14:30:24 -05:00
6a73b2f4f4 update 2026-03-03 14:30:16 -05:00