deploy-guard: block activation while users are online
Some checks failed
Build and Deploy / mreow (push) Successful in 51s
Build and Deploy / yarn (push) Successful in 47s
Build and Deploy / muffin (push) Failing after 1m9s

- modules/server-deploy-guard.nix: extendable aggregator registered via
  services.deployGuard.checks.<name>.{description,command}. Installs
  deploy-guard-check with per-check timeout, pass/block reporting, JSON
  output, DEPLOY_GUARD_BYPASS / /run/deploy-guard-bypass (single-shot).
- services/jellyfin/jellyfin-deploy-guard.nix: curl+jq on /Sessions,
  blocks when any session carries NowPlayingItem; soft-fails when unreachable.
- services/minecraft-deploy-guard.nix: mcstatus SLP query on 25565, blocks
  when players.online > 0; soft-fails when unreachable.
- flake.nix: wrap deploy.nodes.muffin activation with activate.custom so
  deploy-guard-check runs before switch-to-configuration. Auto-rollback
  catches the failure. dryActivate/boot branches preserved.
- deploy.sh: SSH preflight for ./deploy.sh muffin with --force /
  DEPLOY_GUARD_FORCE=1 (touches remote bypass marker). Connectivity
  failure is soft; activation still enforces.
- tests/deploy-guard.nix: aggregator contract, bypass mechanics, timeout,
  JSON output.
This commit is contained in:
2026-04-22 00:36:21 -04:00
parent ddac5e3f04
commit aef99e7365
11 changed files with 603 additions and 7 deletions

View File

@@ -156,6 +156,39 @@ Hard requirements that are asserted at eval time:
- **Hugepages**: services that need 2 MiB hugepages declare their budget in `service-configs.nix` under `hugepages_2m.services`. The `vm.nr_hugepages` sysctl is derived from the total.
- **PostgreSQL-first**: any service that supports PostgreSQL uses it (via peer-auth Unix socket when possible). Per-service Sqlite (or similar) is not liked.
## Deploy guard (muffin)
`modules/server-deploy-guard.nix` blocks `./deploy.sh muffin` / deploy-rs activation when a service it covers is in active use. Two paths enforce it:
- **Preflight**: `./deploy.sh muffin` SSHes to `server-public` and runs `deploy-guard-check` before the build. Connectivity failure is soft (activation still enforces). `./deploy.sh muffin --force` or `DEPLOY_GUARD_FORCE=1 ./deploy.sh muffin` touches `/run/deploy-guard-bypass` remotely (single-shot) and skips the preflight.
- **Activation**: the custom `activate.custom` wrapper in `flake.nix` runs `$PROFILE/sw/bin/deploy-guard-check` before `switch-to-configuration switch`. A non-zero exit triggers deploy-rs auto-rollback. Same bypass: `DEPLOY_GUARD_BYPASS=1` env or pre-touched `/run/deploy-guard-bypass`.
### Adding a new check
In the service's own file (or a sibling `<service>-deploy-guard.nix`):
```nix
{ config, lib, pkgs, ... }:
let
check = pkgs.writeShellApplication {
name = "deploy-guard-check-<service>";
runtimeInputs = [ /* curl, jq, etc. */ ];
text = ''
# exit 0 when the service is idle / unreachable (soft-fail)
# exit 1 with a reason on stdout/stderr when live users would be disrupted
'';
};
in
lib.mkIf config.services.<service>.enable {
services.deployGuard.checks.<service> = {
description = "Active <service> users";
command = check;
};
}
```
Existing registrations live in `services/jellyfin/jellyfin-deploy-guard.nix` (REST `/Sessions` via curl+jq) and `services/minecraft-deploy-guard.nix` (Server List Ping via `mcstatus`). Prefer soft-fail on unreachable — a service that's already down has no users to disrupt.
## Technical details
- **Privilege escalation**: `doas` everywhere; `sudo` is disabled on every host.