Files
nixos/AGENTS.md
Simon Gardling 26401f5316 yarn: rotate tpm identity after fTPM reset
BIOS 2423→4101 update on yarn required an fTPM reset, which broke the
sealed age identity at /var/lib/agenix/tpm-identity. Bootstrapped a new
identity against the new SRK and rotated yarn's recipient.

age-plugin-tpm 1.0+ emits age1tag1… (p256tag) recipients by default and
refuses to encrypt to legacy age1tpm1… ones, so rotated mreow's recipient
to the same encoding (same key, new bech32 HRP) and added an
age-plugin-tag→age-plugin-tpm symlink in the rage wrapper so rage's
plugin dispatch finds the binary under the new prefix. Stripped the
trailing host labels from the tpm recipient strings — rage's stricter
bech32 parser now rejects the trailing whitespace; labels live in
adjacent Nix comments instead.
2026-04-30 18:41:36 -04:00

16 KiB

AGENTS.md

Project Overview

Unified NixOS flake for three hosts:

Host Role nixpkgs channel Activation
mreow Framework 13 AMD AI 300 laptop (niri, greetd, swaylock) nixos-unstable ./deploy.sh locally
yarn AMD Zen 5 desktop (niri + Jovian-NixOS Steam deck mode, impermanence) nixos-unstable pull from CI binary cache
muffin AMD Zen 3 server (Caddy, ZFS, agenix, deploy-rs, 25+ services) nixos-25.11 deploy-rs from CI

One flake.nix declares both channels (nixpkgs and nixpkgs-stable) and composes each host from the correct channel. No single-channel migration is intended.

History pre-dating this repo lives in the merged subtree branches from dotfiles (commit e9a44f6) and server-config (commit 4bc5d57). Use git log <path> (without --follow) and traverse back through the merge commits dc481c2 and 6448a04 for pre-unify history.

Layout

flake.nix                  # 3 hosts, 2 channels
deploy.sh                  # wrapper: current-host rebuild or `muffin` deploy-rs
hosts/<host>/              # host entrypoints (default.nix, home.nix, disk.nix, …)
modules/                   # flat namespace; see module naming below
  common.nix               # imported by ALL hosts (nix settings, doas, fish shim)
  desktop-*.nix            # imported by mreow/yarn only
  server-*.nix             # imported by muffin only
  <bare>.nix               # scoped by filename (age-secrets, zfs, no-rgb, …)
home/
  profiles/{gui,desktop,no-gui}.nix   # home-manager profiles
  progs/<program>.nix                 # one file per program (fish, helix, niri, zen/, emacs, …)
  util/<helper>.nix                   # small derivations
services/                  # muffin-only: caddy, jellyfin, gitea, matrix, monero, …
tests/                     # pkgs.testers.runNixOSTest suite
lib/
  default.nix              # extends nixpkgs-stable.lib with mkCaddyReverseProxy, serviceMountWithZpool, …
  overlays.nix             # jellyfin-exporter, igpu-exporter, reflac, ensureZfsMounts
patches/nixpkgs/           # applied to nixpkgs-stable for muffin builds
secrets/
  secrets.nix              # agenix recipients (who can decrypt each .age)
  desktop/                 # agenix *.age (mreow + yarn) + disk-password (install-time only, git-crypt)
  home/                    # git-crypt: per-user HM secrets (api keys, steam id)
  server/                  # agenix *.age + git-crypt *.nix/*.tar/livekit_keys (muffin)
  usb-secrets/             # USB-resident agenix identity for muffin (git-crypt inside the repo)

Never read or write files under secrets/. They are encrypted at rest (git-crypt for plaintext, agenix for .age). The git-crypt key is delivered to muffin at runtime as /run/agenix/git-crypt-key-nixos.age.

Build & Deploy

# --- from any host ---
nix fmt                                                          # nixfmt-tree
nix flake update                                                 # bump both channels + inputs
nix flake update --input-name nixpkgs                            # bump just desktops' channel
nix flake update --input-name nixpkgs-stable                     # bump just muffin's channel

# --- per-host eval / build (add -L for verbose logs) ---
nix build .#nixosConfigurations.mreow.config.system.build.toplevel -L
nix build .#nixosConfigurations.yarn.config.system.build.toplevel -L
nix build .#nixosConfigurations.muffin.config.system.build.toplevel -L

# --- quick eval without building ---
nix eval .#nixosConfigurations.muffin.config.system.build.toplevel --no-build 2>&1 | head -5

# --- activate on current host (mreow / yarn only) ---
./deploy.sh                 # boot (default; next reboot)
./deploy.sh switch          # apply immediately
./deploy.sh test            # apply without boot entry
./deploy.sh build           # build only

# --- deploy to muffin from anywhere ---
./deploy.sh muffin
# equivalent to:
nix run .#deploy -- .#muffin

# --- tests (muffin) ---
nix build .#packages.x86_64-linux.tests -L        # all tests (slow)
nix build .#test-zfsTest -L                        # one test by name
# test names are the keys of tests/tests.nix; pattern is test-<name>

No unit tests for desktop configs. Validation is the nix build exit code plus the successful nix-diff against the previous generation.

If Nix complains about a missing file, git add it first — flakes only see tracked files.

Module naming

Prefix Meaning Example
common- imported by ALL hosts common-doas.nix, common-nix.nix, common-shell-fish.nix
desktop- imported by mreow + yarn only desktop-common.nix, desktop-steam.nix, desktop-networkmanager.nix
server- imported by muffin only server-security.nix, server-power.nix, server-impermanence.nix, server-lanzaboote-agenix.nix
(none) host-specific filename-scoped; see file contents zfs.nix, no-rgb.nix (yarn + muffin)

New modules: pick the narrowest prefix that's true, then add the import explicitly in the host's default.nix (there is no auto-discovery).

Code style

  • Formatter: nixfmt-tree (declared in flake.nix). Run nix fmt before every commit.
  • Indentation: 2 spaces, enforced by the formatter.
  • Function args: one per line, trailing comma, always end with ...:
    {
      config,
      lib,
      pkgs,
      username,
      ...
    }:
    
  • Imports: relative paths, one per line. Use the ../../modules/ style from hosts/; do not invent new aggregator modules unless more than one host uses the aggregation.
  • Package paths: lib.getExe pkgs.foo over "${pkgs.foo}/bin/foo" when the derivation declares meta.mainProgram.
  • Unfree packages: allowlisted per-module via nixpkgs.config.allowUnfreePredicate. Do not add a global permit.
  • Comments: lowercase, # style. Use # TODO! / # BUG! / # FIX: prefixes for known issues that should be searchable.
  • No trailing commas (Nix syntax forbids them).
  • lib.mkDefault / lib.mkForce: prefer mkDefault in shared modules so hosts can override without fighting priority; use mkForce only to beat inherited defaults you can't reach any other way.

Secrets

  • git-crypt covers secrets/** per the root .gitattributes. Initialized with a single symmetric key checked into secrets/server/git-crypt-key-nixos.age (agenix-encrypted to the USB SSH identity).
  • agenix decrypts *.age into /run/agenix/ at activation on every host:
    • muffin: identity is /mnt/usb-secrets/usb-secrets-key (ssh-ed25519 on a physical USB). Wired in modules/usb-secrets.nix.
    • mreow + yarn: identity is /var/lib/agenix/tpm-identity (an age-plugin-tpm handle sealed by the host's TPM 2.0). Wired in modules/desktop-age-secrets.nix; yarn persists /var/lib/agenix through impermanence.
  • Recipients are declared in secrets/secrets.nix. Desktop secrets are encrypted to the admin SSH key + each host's TPM recipient; server secrets stay encrypted to the muffin USB key.
  • Bootstrap a new desktop: run doas scripts/bootstrap-desktop-tpm.sh on the host. It generates a TPM-sealed identity at /var/lib/agenix/tpm-identity and prints an age1tag1… recipient (legacy age1tpm1… recipients still decrypt but age-plugin-tpm 1.0+ refuses to encrypt to them; modules/desktop-age-secrets.nix symlinks age-plugin-tag → age-plugin-tpm so rage's plugin dispatch finds the binary under both prefixes). Append it to the tpm list in secrets/secrets.nix (label as a Nix # host comment, not as a trailing word inside the recipient string — rage's bech32 parser rejects the trailing whitespace), run agenix -r to re-encrypt, commit, ./deploy.sh switch.
  • Encrypting a new server secret uses the SSH public key directly with age -R:
    age -R <(ssh-keygen -y -f secrets/usb-secrets/usb-secrets-key) \
        -o secrets/server/<name>.age \
        /path/to/plaintext
    
    For desktop secrets, prefer agenix -e secrets/desktop/<name>.age from a shell with age-plugin-tpm on PATH — it reads secrets/secrets.nix and encrypts to every recipient listed there.
  • DO NOT use ssh-to-age. It produces X25519 recipient stanzas, which the SSH private key on muffin cannot decrypt (it only decrypts ssh-ed25519 stanzas produced by age -R against the SSH pubkey). Mismatched stanzas show up as age: error: no identity matched any of the recipients at deploy time.
  • Never read or commit plaintext secrets. Never log secret values.

Service pattern (muffin)

Each file under services/ follows this shape:

  1. imports block with lib.serviceMountWithZpool and (optionally) lib.serviceFilePerms.
  2. The service configuration (services.<name> = { … }).
  3. Caddy reverse-proxy vhost (usually via lib.mkCaddyReverseProxy in lib/default.nix).
  4. Firewall rules (networking.firewall.allowed{TCP,UDP}Ports) if externally reachable.
  5. services.fail2ban.jails.<name> if the service authenticates users.

Custom lib helpers (in lib/default.nix) to prefer over reinventing:

  • lib.serviceMountWithZpool <service> <zpool> [dirs]
  • lib.serviceFilePerms <service> [tmpfilesRules]
  • lib.optimizePackage <pkg> — applies -O3 -march=znver3 -mtune=znver3
  • lib.vpnNamespaceOpenPort <port> <service> — confines service to the WireGuard namespace
  • lib.mkCaddyReverseProxy { subdomain|domain, port, auth ? false, vpn ? false }
  • lib.mkFail2banJail { name, unitName ? "${name}.service", failregex }
  • lib.mkGrafanaAnnotationService { name, description, script, after ? [], environment ? {}, loadCredential ? null }
  • lib.extractArrApiKey <configXmlPath> — shell snippet to read the <ApiKey> element

Hard requirements that are asserted at eval time:

  • Port uniqueness: every port in hosts/muffin/service-configs.nix ports.{public,private} must be unique. The flake asserts this.
  • Public/private segregation: public ports must appear in the firewall allow-list; private ports must not. The flake asserts both directions.
  • Hugepages: services that need 2 MiB hugepages declare their budget in service-configs.nix under hugepages_2m.services. The vm.nr_hugepages sysctl is derived from the total.
  • PostgreSQL-first: any service that supports PostgreSQL uses it (via peer-auth Unix socket when possible). Per-service Sqlite (or similar) is not liked.

Deploy guard (muffin)

modules/server-deploy-guard.nix aggregates per-service "is anyone using this right now?" checks into a single deploy-guard-check binary on muffin. Enforcement is preflight-only — the guard runs over SSH before deploy-rs is invoked; activation itself is never gated. This matters because deploy-rs sets the new profile pointer before running the activation script, so a failed activation triggers auto-rollback which re-runs switch-to-configuration on the previous generation — that re-activation rotates agenix secrets, reinstalls lanzaboote, and reloads systemd units. The only safe place to stop a deploy is before deploy-rs starts.

Two drivers invoke the preflight:

  • ./deploy.sh muffin SSHes to server-public and runs deploy-guard-check. SSH connection failure is a hard abort (rc=255) because there is no second gate. ./deploy.sh muffin --force (or DEPLOY_GUARD_FORCE=1 ./deploy.sh muffin) skips the preflight entirely.
  • CI (.gitea/workflows/deploy.yml) has a Deploy guard preflight step between Build muffin and Deploy via deploy-rs. A non-zero exit fails the job before any closure copy or activation.

Adding a new check

In the service's own file (or a sibling <service>-deploy-guard.nix):

{ config, lib, pkgs, ... }:
let
  check = pkgs.writeShellApplication {
    name = "deploy-guard-check-<service>";
    runtimeInputs = [ /* curl, jq, etc. */ ];
    text = ''
      # exit 0 when the service is idle / unreachable (soft-fail)
      # exit 1 with a reason on stdout/stderr when live users would be disrupted
    '';
  };
in
lib.mkIf config.services.<service>.enable {
  services.deployGuard.checks.<service> = {
    description = "Active <service> users";
    command = check;
  };
}

Existing registrations live in services/jellyfin/jellyfin-deploy-guard.nix (REST /Sessions via curl+jq) and services/minecraft-deploy-guard.nix (Server List Ping via mcstatus). Prefer soft-fail on unreachable — a service that's already down has no users to disrupt.

Deploy finalize (muffin)

modules/server-deploy-finalize.nix solves the self-deploy problem: the gitea-actions runner driving CI deploys lives on muffin itself, so a direct switch-to-configuration switch restarts the runner mid-activation, killing the SSH session, the CI job, and deploy-rs's magic-rollback handshake. The failure mode is visible as "deploy appears to fail even though the new config landed" (or worse, a rollback storm).

The fix is a two-phase activation wired into deploy.nodes.muffin.profiles.system.path in flake.nix:

  1. switch-to-configuration boot — bootloader-only, no service restarts. The runner, SSH session, and magic-rollback survive.
  2. deploy-finalize — schedules a detached systemd-run --on-active=N transient unit (default 60s). The unit is owned by pid1, so it survives the eventual runner restart. If /run/booted-system/{kernel,initrd,kernel-modules} differs from the new profile's, the unit runs systemctl reboot; otherwise it runs switch-to-configuration switch.

That is, reboot is dynamically gated on kernel/initrd/kernel-modules change. The 60s delay is tuned so the CI job (or manual ./deploy.sh muffin) has time to emit status/notification steps before the runner is recycled.

Back-to-back deploys supersede each other: each invocation cancels any still-pending deploy-finalize-*.timer before scheduling its own. deploy-finalize --dry-run prints the decision without scheduling anything — useful when debugging.

Prior art: the 3-path {kernel,initrd,kernel-modules} diff is lifted from nixpkgs's system.autoUpgrade module (the allowReboot = true branch) and was packaged the same way in obsidiansystems/obelisk#957. nixpkgs#185030 tracks lifting it into switch-to-configuration proper but has been stale since 2025-07. The self-deploy systemd-run detachment is the proposed fix from deploy-rs#153, also unmerged upstream.

Technical details

  • Privilege escalation: doas everywhere; sudo is disabled on every host.
  • Shell: fish. bash login shells re-exec into fish via programs.bash.interactiveShellInit (see modules/common-shell-fish.nix).
  • Secure boot: lanzaboote. Every host extracts keys from an agenix-decrypted tar at activation — desktops via modules/desktop-lanzaboote-agenix.nix, muffin via modules/server-lanzaboote-agenix.nix.
  • Impermanence: muffin is tmpfs-root with /persistent surviving reboots (modules/server-impermanence.nix); yarn binds /home/primary from /persistent (hosts/yarn/impermanence.nix).
  • Disks: disko.
  • Binary cache: muffin runs harmonia; desktops consume it at https://nix-cache.sigkill.computer.
  • Kernel:
    • Desktops: linux-cachyos-bore-lto, processorOpt = "x86_64-v3" (see modules/desktop-common.nix — also trims ~80 legacy subsystems).
    • muffin: linuxPackages_6_12 (pinned; 6.18 has a ZFS deadlock in dbuf_evict).
  • Domain: sigkill.computer. The old gardling.com redirects automatically.

Agent-specific instructions

  • If instructed to commit, disable GPG signing (git commit --no-gpg-sign). The author's GPG key is not available in this environment.
  • Use nix-shell -p <package> if a tool is missing from the environment.
  • For nix build, always append -L for verbose logs.
  • If Nix reports a missing file, run git add <file> first — flakes only see git-tracked files.
  • Do not read files under secrets/.
  • Run nix fmt after editing any .nix file.
  • Validate every change with nix build .#nixosConfigurations.<host>.config.system.build.toplevel -L.
  • Commit messages are terse, lowercase; prefix with <scope>: when narrowly scoped (caddy: add redirect, zfs: remove unneeded options, mreow: bump kernel). Generic changes use update or a short description.