Files
nixos/AGENTS.md
Simon Gardling 26401f5316 yarn: rotate tpm identity after fTPM reset
BIOS 2423→4101 update on yarn required an fTPM reset, which broke the
sealed age identity at /var/lib/agenix/tpm-identity. Bootstrapped a new
identity against the new SRK and rotated yarn's recipient.

age-plugin-tpm 1.0+ emits age1tag1… (p256tag) recipients by default and
refuses to encrypt to legacy age1tpm1… ones, so rotated mreow's recipient
to the same encoding (same key, new bech32 HRP) and added an
age-plugin-tag→age-plugin-tpm symlink in the rage wrapper so rage's
plugin dispatch finds the binary under the new prefix. Stripped the
trailing host labels from the tpm recipient strings — rage's stricter
bech32 parser now rejects the trailing whitespace; labels live in
adjacent Nix comments instead.
2026-04-30 18:41:36 -04:00

237 lines
16 KiB
Markdown

# AGENTS.md
## Project Overview
Unified NixOS flake for three hosts:
| Host | Role | nixpkgs channel | Activation |
|------|------|----------------|-----------|
| `mreow` | Framework 13 AMD AI 300 laptop (niri, greetd, swaylock) | `nixos-unstable` | `./deploy.sh` locally |
| `yarn` | AMD Zen 5 desktop (niri + Jovian-NixOS Steam deck mode, impermanence) | `nixos-unstable` | pull from CI binary cache |
| `muffin` | AMD Zen 3 server (Caddy, ZFS, agenix, deploy-rs, 25+ services) | `nixos-25.11` | deploy-rs from CI |
One `flake.nix` declares both channels (`nixpkgs` and `nixpkgs-stable`) and composes each host from the correct channel. No single-channel migration is intended.
History pre-dating this repo lives in the merged subtree branches from `dotfiles` (commit `e9a44f6`) and `server-config` (commit `4bc5d57`). Use `git log <path>` (without `--follow`) and traverse back through the merge commits `dc481c2` and `6448a04` for pre-unify history.
## Layout
```
flake.nix # 3 hosts, 2 channels
deploy.sh # wrapper: current-host rebuild or `muffin` deploy-rs
hosts/<host>/ # host entrypoints (default.nix, home.nix, disk.nix, …)
modules/ # flat namespace; see module naming below
common.nix # imported by ALL hosts (nix settings, doas, fish shim)
desktop-*.nix # imported by mreow/yarn only
server-*.nix # imported by muffin only
<bare>.nix # scoped by filename (age-secrets, zfs, no-rgb, …)
home/
profiles/{gui,desktop,no-gui}.nix # home-manager profiles
progs/<program>.nix # one file per program (fish, helix, niri, zen/, emacs, …)
util/<helper>.nix # small derivations
services/ # muffin-only: caddy, jellyfin, gitea, matrix, monero, …
tests/ # pkgs.testers.runNixOSTest suite
lib/
default.nix # extends nixpkgs-stable.lib with mkCaddyReverseProxy, serviceMountWithZpool, …
overlays.nix # jellyfin-exporter, igpu-exporter, reflac, ensureZfsMounts
patches/nixpkgs/ # applied to nixpkgs-stable for muffin builds
secrets/
secrets.nix # agenix recipients (who can decrypt each .age)
desktop/ # agenix *.age (mreow + yarn) + disk-password (install-time only, git-crypt)
home/ # git-crypt: per-user HM secrets (api keys, steam id)
server/ # agenix *.age + git-crypt *.nix/*.tar/livekit_keys (muffin)
usb-secrets/ # USB-resident agenix identity for muffin (git-crypt inside the repo)
```
**Never read or write files under `secrets/`.** They are encrypted at rest (git-crypt for plaintext, agenix for `.age`). The git-crypt key is delivered to `muffin` at runtime as `/run/agenix/git-crypt-key-nixos.age`.
## Build & Deploy
```sh
# --- from any host ---
nix fmt # nixfmt-tree
nix flake update # bump both channels + inputs
nix flake update --input-name nixpkgs # bump just desktops' channel
nix flake update --input-name nixpkgs-stable # bump just muffin's channel
# --- per-host eval / build (add -L for verbose logs) ---
nix build .#nixosConfigurations.mreow.config.system.build.toplevel -L
nix build .#nixosConfigurations.yarn.config.system.build.toplevel -L
nix build .#nixosConfigurations.muffin.config.system.build.toplevel -L
# --- quick eval without building ---
nix eval .#nixosConfigurations.muffin.config.system.build.toplevel --no-build 2>&1 | head -5
# --- activate on current host (mreow / yarn only) ---
./deploy.sh # boot (default; next reboot)
./deploy.sh switch # apply immediately
./deploy.sh test # apply without boot entry
./deploy.sh build # build only
# --- deploy to muffin from anywhere ---
./deploy.sh muffin
# equivalent to:
nix run .#deploy -- .#muffin
# --- tests (muffin) ---
nix build .#packages.x86_64-linux.tests -L # all tests (slow)
nix build .#test-zfsTest -L # one test by name
# test names are the keys of tests/tests.nix; pattern is test-<name>
```
No unit tests for desktop configs. Validation is the `nix build` exit code plus the successful `nix-diff` against the previous generation.
If Nix complains about a missing file, `git add` it first — flakes only see tracked files.
## Module naming
| Prefix | Meaning | Example |
|--------|---------|---------|
| `common-` | imported by ALL hosts | `common-doas.nix`, `common-nix.nix`, `common-shell-fish.nix` |
| `desktop-` | imported by mreow + yarn only | `desktop-common.nix`, `desktop-steam.nix`, `desktop-networkmanager.nix` |
| `server-` | imported by muffin only | `server-security.nix`, `server-power.nix`, `server-impermanence.nix`, `server-lanzaboote-agenix.nix` |
| *(none)* | host-specific filename-scoped; see file contents | `zfs.nix`, `no-rgb.nix` (yarn + muffin) |
New modules: pick the narrowest prefix that's true, then add the import explicitly in the host's `default.nix` (there is no auto-discovery).
## Code style
- **Formatter**: `nixfmt-tree` (declared in `flake.nix`). Run `nix fmt` before every commit.
- **Indentation**: 2 spaces, enforced by the formatter.
- **Function args**: one per line, trailing comma, always end with `...`:
```nix
{
config,
lib,
pkgs,
username,
...
}:
```
- **Imports**: relative paths, one per line. Use the `../../modules/` style from `hosts/`; do not invent new aggregator modules unless more than one host uses the aggregation.
- **Package paths**: `lib.getExe pkgs.foo` over `"${pkgs.foo}/bin/foo"` when the derivation declares `meta.mainProgram`.
- **Unfree packages**: allowlisted per-module via `nixpkgs.config.allowUnfreePredicate`. Do not add a global permit.
- **Comments**: lowercase, `#` style. Use `# TODO!` / `# BUG!` / `# FIX:` prefixes for known issues that should be searchable.
- **No trailing commas** (Nix syntax forbids them).
- **`lib.mkDefault` / `lib.mkForce`**: prefer `mkDefault` in shared modules so hosts can override without fighting priority; use `mkForce` only to beat inherited defaults you can't reach any other way.
## Secrets
- **git-crypt** covers `secrets/**` per the root `.gitattributes`. Initialized with a single symmetric key checked into `secrets/server/git-crypt-key-nixos.age` (agenix-encrypted to the USB SSH identity).
- **agenix** decrypts `*.age` into `/run/agenix/` at activation on every host:
- **muffin**: identity is `/mnt/usb-secrets/usb-secrets-key` (ssh-ed25519 on a physical USB). Wired in `modules/usb-secrets.nix`.
- **mreow + yarn**: identity is `/var/lib/agenix/tpm-identity` (an `age-plugin-tpm` handle sealed by the host's TPM 2.0). Wired in `modules/desktop-age-secrets.nix`; yarn persists `/var/lib/agenix` through impermanence.
- **Recipients** are declared in `secrets/secrets.nix`. Desktop secrets are encrypted to the admin SSH key + each host's TPM recipient; server secrets stay encrypted to the muffin USB key.
- **Bootstrap a new desktop**: run `doas scripts/bootstrap-desktop-tpm.sh` on the host. It generates a TPM-sealed identity at `/var/lib/agenix/tpm-identity` and prints an `age1tag1…` recipient (legacy `age1tpm1…` recipients still decrypt but `age-plugin-tpm` 1.0+ refuses to encrypt to them; `modules/desktop-age-secrets.nix` symlinks `age-plugin-tag → age-plugin-tpm` so rage's plugin dispatch finds the binary under both prefixes). Append it to the `tpm` list in `secrets/secrets.nix` (label as a Nix `# host` comment, not as a trailing word inside the recipient string — rage's bech32 parser rejects the trailing whitespace), run `agenix -r` to re-encrypt, commit, `./deploy.sh switch`.
- **Encrypting a new server secret** uses the SSH public key directly with `age -R`:
```sh
age -R <(ssh-keygen -y -f secrets/usb-secrets/usb-secrets-key) \
-o secrets/server/<name>.age \
/path/to/plaintext
```
For desktop secrets, prefer `agenix -e secrets/desktop/<name>.age` from a shell with `age-plugin-tpm` on PATH — it reads `secrets/secrets.nix` and encrypts to every recipient listed there.
- **DO NOT use `ssh-to-age`**. It produces `X25519` recipient stanzas, which the SSH private key on muffin cannot decrypt (it only decrypts `ssh-ed25519` stanzas produced by `age -R` against the SSH pubkey). Mismatched stanzas show up as `age: error: no identity matched any of the recipients` at deploy time.
- Never read or commit plaintext secrets. Never log secret values.
## Service pattern (muffin)
Each file under `services/` follows this shape:
1. `imports` block with `lib.serviceMountWithZpool` and (optionally) `lib.serviceFilePerms`.
2. The service configuration (`services.<name> = { … }`).
3. Caddy reverse-proxy vhost (usually via `lib.mkCaddyReverseProxy` in `lib/default.nix`).
4. Firewall rules (`networking.firewall.allowed{TCP,UDP}Ports`) if externally reachable.
5. `services.fail2ban.jails.<name>` if the service authenticates users.
Custom lib helpers (in `lib/default.nix`) to prefer over reinventing:
- `lib.serviceMountWithZpool <service> <zpool> [dirs]`
- `lib.serviceFilePerms <service> [tmpfilesRules]`
- `lib.optimizePackage <pkg>` — applies `-O3 -march=znver3 -mtune=znver3`
- `lib.vpnNamespaceOpenPort <port> <service>` — confines service to the WireGuard namespace
- `lib.mkCaddyReverseProxy { subdomain|domain, port, auth ? false, vpn ? false }`
- `lib.mkFail2banJail { name, unitName ? "${name}.service", failregex }`
- `lib.mkGrafanaAnnotationService { name, description, script, after ? [], environment ? {}, loadCredential ? null }`
- `lib.extractArrApiKey <configXmlPath>` — shell snippet to read the `<ApiKey>` element
Hard requirements that are asserted at eval time:
- **Port uniqueness**: every port in `hosts/muffin/service-configs.nix` `ports.{public,private}` must be unique. The flake asserts this.
- **Public/private segregation**: public ports must appear in the firewall allow-list; private ports must not. The flake asserts both directions.
- **Hugepages**: services that need 2 MiB hugepages declare their budget in `service-configs.nix` under `hugepages_2m.services`. The `vm.nr_hugepages` sysctl is derived from the total.
- **PostgreSQL-first**: any service that supports PostgreSQL uses it (via peer-auth Unix socket when possible). Per-service Sqlite (or similar) is not liked.
## Deploy guard (muffin)
`modules/server-deploy-guard.nix` aggregates per-service "is anyone using this right now?" checks into a single `deploy-guard-check` binary on muffin. Enforcement is **preflight-only** — the guard runs over SSH *before* deploy-rs is invoked; activation itself is never gated. This matters because deploy-rs sets the new profile pointer before running the activation script, so a failed activation triggers auto-rollback which re-runs `switch-to-configuration` on the previous generation — that re-activation rotates agenix secrets, reinstalls lanzaboote, and reloads systemd units. The only safe place to stop a deploy is before deploy-rs starts.
Two drivers invoke the preflight:
- **`./deploy.sh muffin`** SSHes to `server-public` and runs `deploy-guard-check`. SSH connection failure is a hard abort (rc=255) because there is no second gate. `./deploy.sh muffin --force` (or `DEPLOY_GUARD_FORCE=1 ./deploy.sh muffin`) skips the preflight entirely.
- **CI (`.gitea/workflows/deploy.yml`)** has a `Deploy guard preflight` step between `Build muffin` and `Deploy via deploy-rs`. A non-zero exit fails the job before any closure copy or activation.
### Adding a new check
In the service's own file (or a sibling `<service>-deploy-guard.nix`):
```nix
{ config, lib, pkgs, ... }:
let
check = pkgs.writeShellApplication {
name = "deploy-guard-check-<service>";
runtimeInputs = [ /* curl, jq, etc. */ ];
text = ''
# exit 0 when the service is idle / unreachable (soft-fail)
# exit 1 with a reason on stdout/stderr when live users would be disrupted
'';
};
in
lib.mkIf config.services.<service>.enable {
services.deployGuard.checks.<service> = {
description = "Active <service> users";
command = check;
};
}
```
Existing registrations live in `services/jellyfin/jellyfin-deploy-guard.nix` (REST `/Sessions` via curl+jq) and `services/minecraft-deploy-guard.nix` (Server List Ping via `mcstatus`). Prefer soft-fail on unreachable — a service that's already down has no users to disrupt.
## Deploy finalize (muffin)
`modules/server-deploy-finalize.nix` solves the self-deploy problem: the gitea-actions runner driving CI deploys lives on muffin itself, so a direct `switch-to-configuration switch` restarts the runner mid-activation, killing the SSH session, the CI job, and deploy-rs's magic-rollback handshake. The failure mode is visible as "deploy appears to fail even though the new config landed" (or worse, a rollback storm).
The fix is a two-phase activation wired into `deploy.nodes.muffin.profiles.system.path` in `flake.nix`:
1. `switch-to-configuration boot` — bootloader-only, no service restarts. The runner, SSH session, and magic-rollback survive.
2. `deploy-finalize` — schedules a detached `systemd-run --on-active=N` transient unit (default 60s). The unit is owned by pid1, so it survives the eventual runner restart. If `/run/booted-system/{kernel,initrd,kernel-modules}` differs from the new profile's, the unit runs `systemctl reboot`; otherwise it runs `switch-to-configuration switch`.
That is, reboot is dynamically gated on kernel/initrd/kernel-modules change. The 60s delay is tuned so the CI job (or manual `./deploy.sh muffin`) has time to emit status/notification steps before the runner is recycled.
Back-to-back deploys supersede each other: each invocation cancels any still-pending `deploy-finalize-*.timer` before scheduling its own. `deploy-finalize --dry-run` prints the decision without scheduling anything — useful when debugging.
Prior art: the 3-path `{kernel,initrd,kernel-modules}` diff is lifted from nixpkgs's `system.autoUpgrade` module (the `allowReboot = true` branch) and was packaged the same way in [obsidiansystems/obelisk#957](https://github.com/obsidiansystems/obelisk/pull/957). nixpkgs#185030 tracks lifting it into `switch-to-configuration` proper but has been stale since 2025-07. The self-deploy `systemd-run` detachment is the proposed fix from [deploy-rs#153](https://github.com/serokell/deploy-rs/issues/153), also unmerged upstream.
## Technical details
- **Privilege escalation**: `doas` everywhere; `sudo` is disabled on every host.
- **Shell**: fish. `bash` login shells re-exec into fish via `programs.bash.interactiveShellInit` (see `modules/common-shell-fish.nix`).
- **Secure boot**: lanzaboote. Every host extracts keys from an agenix-decrypted tar at activation — desktops via `modules/desktop-lanzaboote-agenix.nix`, muffin via `modules/server-lanzaboote-agenix.nix`.
- **Impermanence**: muffin is tmpfs-root with `/persistent` surviving reboots (`modules/server-impermanence.nix`); yarn binds `/home/primary` from `/persistent` (`hosts/yarn/impermanence.nix`).
- **Disks**: disko.
- **Binary cache**: muffin runs harmonia; desktops consume it at `https://nix-cache.sigkill.computer`.
- **Kernel**:
- Desktops: `linux-cachyos-bore-lto`, `processorOpt = "x86_64-v3"` (see `modules/desktop-common.nix` — also trims ~80 legacy subsystems).
- muffin: `linuxPackages_6_12` (pinned; 6.18 has a ZFS deadlock in `dbuf_evict`).
- **Domain**: `sigkill.computer`. The old `gardling.com` redirects automatically.
## Agent-specific instructions
- If instructed to commit, **disable GPG signing** (`git commit --no-gpg-sign`). The author's GPG key is not available in this environment.
- Use `nix-shell -p <package>` if a tool is missing from the environment.
- For `nix build`, always append `-L` for verbose logs.
- If Nix reports a missing file, run `git add <file>` first — flakes only see git-tracked files.
- Do not read files under `secrets/`.
- Run `nix fmt` after editing any `.nix` file.
- Validate every change with `nix build .#nixosConfigurations.<host>.config.system.build.toplevel -L`.
- Commit messages are terse, lowercase; prefix with `<scope>:` when narrowly scoped (`caddy: add redirect`, `zfs: remove unneeded options`, `mreow: bump kernel`). Generic changes use `update` or a short description.