Skip to content

Add GPG signing to linux binary tarball builds#5289

Open
PlatCore wants to merge 13 commits intomasterfrom
PlatCore/5160-add-signing-linux-binaries
Open

Add GPG signing to linux binary tarball builds#5289
PlatCore wants to merge 13 commits intomasterfrom
PlatCore/5160-add-signing-linux-binaries

Conversation

@PlatCore
Copy link
Copy Markdown
Contributor

@PlatCore PlatCore commented Apr 16, 2026

Why this should be merged

Linux binary tarballs are uploaded to S3 unsigned. RPM packages already have GPG signing. This adds detached signatures (.tar.gz.sig) to close that gap.

Closes #5160

How this works

  • build-tgz-pkg.sh: imports GPG key into temp GNUPGHOME, signs each tarball with gpg --detach-sign, verifies inline, uploads .sig to S3. No-op when no key is provided.
  • build-linux-binaries.yml: both jobs import GPG key from RPM_GPG_PRIVATE_KEY secret, pass it to the script, and include .sig in artifacts.

How this was tested

  • No GPG key: tarballs produced unsigned (backward compat)
  • With ephemeral GPG key: .sig files produced, gpg --verify passes
  • Empty key file: signing skipped (fork build scenario)
  • ~/.gnupg never touched (temp GNUPGHOME isolation verified)
  • shellcheck and yamllint clean

Need to be documented in RELEASES.md?

No

@PlatCore PlatCore moved this to In Progress 🏗️ in avalanchego Apr 16, 2026
@PlatCore PlatCore self-assigned this Apr 16, 2026
@PlatCore PlatCore added ci This focuses on changes to the CI process devinfra labels Apr 16, 2026
@PlatCore PlatCore force-pushed the PlatCore/5160-add-signing-linux-binaries branch from 7af6acf to 9b6a785 Compare April 16, 2026 18:31
@PlatCore PlatCore marked this pull request as ready for review April 16, 2026 19:06
@PlatCore PlatCore requested a review from a team as a code owner April 16, 2026 19:06
@PlatCore PlatCore requested a review from maru-ava April 16, 2026 19:06
@PlatCore PlatCore force-pushed the PlatCore/5160-add-signing-linux-binaries branch from 9b6a785 to ef7dc92 Compare April 27, 2026 18:54
Copy link
Copy Markdown
Contributor

@maru-ava maru-ava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s here is entirely reasonable, and I appreciate you doing the work. My original instinct was to make the minimal change you’ve proposed here and add signing to the existing workflow.

But after having just reviewed the DEB packaging series, I’d prefer to see this follow the same general pattern: locally reproducible and locally testable. We shouldn’t be extending release/signing behavior in a way that can only really be validated in CI or through ad-hoc manual testing. That means a local task-driven validation path and CI exercising that same path so that we have regression coverage when we change the workflow or its supporting functionality. That suggests:

  • a Taskfile entrypoint
  • a separate validation step/script
  • detached-signature verification in a fresh environment
  • CI invoking that same validation path

# Optional env vars:
# DOCKERFILE - Dockerfile name within CONTEXT_DIR (default: "Dockerfile")
# DOCKERFILE - Dockerfile name within CONTEXT_DIR (default: "Dockerfile")
# BUILDER_PLATFORM - Target platform for the image (e.g., "linux/amd64").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain why you think this is a good idea. Not everything that can be done, should be done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

PS the comment gives "if Jurassic Park scientists should" vibes )

@PlatCore PlatCore force-pushed the PlatCore/5160-add-signing-linux-binaries branch from 5dcefe7 to 5ce8a46 Compare May 2, 2026 01:10
PlatCore added 13 commits May 1, 2026 18:37
Add detached GPG signatures (.sig) to the tarball packaging pipeline,
reusing the same key infrastructure as RPM signing.
Use a job-scoped temp directory for GNUPGHOME instead of the default
~/.gnupg to avoid mutating shared state on persistent runners. Clean
up via trap EXIT so the keyring is removed even on script failure.

Pass passphrase via stdin (--passphrase-fd 0) instead of command line
to avoid exposure in /proc/<pid>/cmdline.
Add an optional DOCKERFILE env var (defaulting to "Dockerfile") so the
same script can build different packaging builder images. RPM's task
continues to work unchanged because it does not set DOCKERFILE.

This unblocks adding additional packaging builder images (e.g. for
linux tarballs) without duplicating the script.
Add a new packaging path for signed linux tarballs that mirrors the
existing RPM pattern: a Dockerfile, a build script that runs inside
the container, a validation script that runs in a fresh container,
and Taskfile entrypoints (build-tarballs, validate-tarballs,
test-build-tarballs).

The build container (Ubuntu 22.04 / glibc 2.35) compiles avalanchego
and subnet-evm from source, stages each, tar+gzips them, and signs
each archive with gpg --detach-sign (binary .sig). Signing is gated
on a GPG_KEY_FILE env var so unsigned local builds work without
secrets. The build script uses a temp GNUPGHOME with trap-EXIT
cleanup so it never mutates a shared keyring.

The validation script runs ubuntu:22.04 fresh, imports the public
key (when present), verifies each detached signature, extracts the
archives, runs --version on each binary, and asserts the embedded
git commit matches the build's commit.

The public key is exported into the local build dir solely for the
validation container's use; it is never uploaded as a release
artifact (the canonical public key continues to live only in S3).

Existing build-tgz-pkg.sh / build-linux-binaries.yml wiring is
unchanged in this commit; CI continues using the old path until
the next commit flips the switch.

Verified locally on macOS arm64: task test-build-tarballs produces
signed linux/arm64 tarballs whose signatures verify cleanly in the
validation container and whose binaries report the correct commit.
Replace the inline build-tgz-pkg.sh invocation with a call to
task test-build-tarballs (defined under .github/packaging/),
giving CI the same locally-reproducible build+validate path a
developer can run on macOS via task --taskfile.

Drop the host-side "Build the avalanchego binaries" and "Build
subnet-evm plugin" steps since the build container now owns
binary compilation as well as tarballing. This removes the
host-vs-container glibc skew and makes "what CI does" identical
to "what task test-build-tarballs does locally."

S3 upload is now its own step that explicitly uploads only
*.tar.gz and *.tar.gz.sig from build/tgz/ — the GPG public
key file produced for the validation container stays local
and is excluded from S3 and from GitHub artifacts.

Refactor a couple of run-step expressions to env-var indirection
(${{ github.event.inputs.tag }} and the GPG private key secret)
to address the workflow security hardening hook.

Delete the now-unused .github/workflows/build-tgz-pkg.sh.
Tarballs are overwritten by tar but .sig and GPG-KEY-avalanchego
files are not, so a signed run followed by an unsigned run with
the same tag/arch leaves stale signatures whose contents no longer
match the freshly built tarballs. Validation then fails with
"BAD signature", and signed-then-signed re-runs would also pick up
a stale public key file.

Remove *.tar.gz.sig and GPG-KEY-avalanchego from OUTPUT_DIR at the
top of build-tgz.sh so each run starts from a clean signing state.
The tarballs themselves are preserved (they are about to be
overwritten by tar anyway).
The tarball build pipeline accepts TGZ_ARCH to choose the target
architecture, but the underlying builder image was always built for
the host arch (uname -m). When TGZ_ARCH differed from the host
(e.g. TGZ_ARCH=amd64 on an arm64 workstation), the subsequent
docker run --platform linux/<TGZ_ARCH> failed because the local
image manifest had no matching platform.

Add an optional BUILDER_PLATFORM env var to build-builder-image.sh:
when set, derive the Go SHA256 checksum from that platform and pass
--platform to docker build so the resulting image's manifest lines
up with what `docker run --platform <target>` expects. When unset,
behavior is unchanged (host arch), so the RPM task is unaffected.

Plumb BUILDER_PLATFORM=linux/${TGZ_ARCH} into the Taskfile's
build-tgz-builder-docker-image task so cross-builds work via:

  TGZ_ARCH=amd64 task --taskfile .github/packaging/Taskfile.yml \
    test-build-tarballs

Verified locally on macOS arm64 by cross-building amd64 tarballs;
`file` on the extracted binaries reports ELF x86-64 and the
validation container (also amd64 via Docker emulation) ran the
binaries successfully.
Following review feedback, removing the BUILDER_PLATFORM env var and the
--platform flags introduced earlier. Each invocation now builds for the
host arch only; CI's per-arch runner matrix (ubuntu-22.04 +
custom-arm64-jammy) provides coverage natively, matching how the RPM
packaging path already works.

The TGZ_ARCH var stays as a filename-only knob that defaults to
the host arch via PACKAGING_TGZ_HOST_ARCH, mirroring RPM's
RPM_ARCH | default .PACKAGING_HOST_ARCH pattern.

Verified locally on macOS arm64: task test-build-tarballs still
produces signed and unsigned arm64 tarballs that validate
end-to-end in the fresh ubuntu:22.04 container.
Two regressions reported on the cross-arch simplification commit
(`83ce0f1b99`):

1. DOCKER_DEFAULT_PLATFORM divergence breaks the builder image build.
   The Dockerfile uses ${TARGETARCH} (resolved by Docker from the
   host platform, --platform flag, or DOCKER_DEFAULT_PLATFORM) to
   download Go, while build-builder-image.sh fetched the SHA256
   for `uname -m`. On Apple Silicon with DOCKER_DEFAULT_PLATFORM=
   linux/amd64 (a common setup) the values diverged and the
   sha256sum -c step inside the Dockerfile failed.

   Fix: build-builder-image.sh now passes --platform linux/${goarch}
   to docker build, pinning Docker's TARGETARCH to the same arch the
   script computed the checksum for. Same script is used by RPM, so
   this also closes the latent equivalent in the RPM path.

   Also pin --platform on the build-tarballs `docker run` and the
   validate-tgz.sh `docker run` so the entire pipeline stays at host
   arch even when DOCKER_DEFAULT_PLATFORM points elsewhere.

2. TGZ_ARCH override silently produced mislabeled tarballs.
   The Taskfile forwarded a user-supplied TGZ_ARCH into the build
   container as PACKAGING_TGZ_ARCH, but neither scripts/build.sh
   nor the subnet-evm build set GOARCH from it — the binaries were
   always at the container's native arch. On arm64,
   `task build-tarballs TGZ_ARCH=amd64` produced
   *-linux-amd64-*.tar.gz containing arm64 binaries, and validation
   would still pass because its container also ran at host arch.

   Fix: build-tgz.sh and validate-tgz.sh now derive arch from
   `uname -m` at runtime (deb-style mapping). The Taskfile no
   longer forwards TGZ_ARCH/PACKAGING_TGZ_ARCH to either script.
   Since each script computes arch from its own runtime env (which
   is pinned to host arch via --platform), filenames always match
   the binary contents.

   The validate-tarballs task is also moved from `env:` to
   command-line env-prefix for TAG/GIT_COMMIT, so parent shell
   env vars can no longer shadow them either (a separate Task v3
   quirk: `env:` block doesn't override the parent shell).

Verified locally on macOS arm64:
- `DOCKER_DEFAULT_PLATFORM=linux/amd64 task test-build-tarballs`
  succeeds end-to-end and produces correctly-labeled arm64
  tarballs (host arch overrides DOCKER_DEFAULT_PLATFORM).
- `TGZ_ARCH=amd64 task test-build-tarballs` ignores the override
  and still produces -arm64- tarballs that validate cleanly.
- Both unsigned and signed flows still work; signed run produces
  .sig files and a local GPG-KEY-avalanchego that validation
  imports and verifies in the fresh ubuntu:22.04 container.
The containerized build via test-build-tarballs writes
build/plugins/<vm-id> as root (graft/subnet-evm/scripts/build.sh
mkdir -p's it inside the container). On linux runners without
userns-remap, that directory is root-owned on the host and the
cleanup step's `rm -rf ./build` running as the runner user can't
recurse into it — the cleanup exits non-zero and the workflow job
is marked failed even after the artifacts uploaded successfully.

Use `sudo rm -rf ./build` (passwordless on GitHub runners) to
clean reliably across container-produced files. RPM doesn't hit
this because its cleanup only removes build/rpm (a directory the
script fully owns), not the whole build/ tree.
The combined check `[[ -n "${GPG_KEY_FILE:-}" && -s ... ]]` lumped
two distinct cases into the same "skip signing" branch:

  - GPG_KEY_FILE unset entirely (local dev — unsigned OK)
  - GPG_KEY_FILE set but file is 0 bytes (CI signing secret
    missing or blank — should fail closed, not silently ship
    unsigned release artifacts)

The release workflow writes secrets.RPM_GPG_PRIVATE_KEY to a temp
file unconditionally. If that secret is misconfigured, the file
ends up empty and the prior code silently produced unsigned
.tar.gz files. Validation skipped sig-verify (no public key
present), and the workflow happily uploaded unsigned tarballs to
S3 — exactly the failure mode signing exists to prevent.

Tri-state the check: unset → unsigned (local dev), set-but-empty
→ hard error with actionable message, set-and-non-empty → sign.

Verified locally:
  - No GPG_KEY_FILE env: produces 2 unsigned tarballs (validation
    skips sig-verify, smoke tests pass).
  - GPG_KEY_FILE=$(mktemp) (empty): exits non-zero with the
    "Refusing to produce unsigned release artifacts" error,
    output dir stays empty.
  - GPG_KEY_FILE pointing at a real key: full signed flow,
    .sig + GPG-KEY-avalanchego produced, validation passes.
Previously the build-tarballs task templated the secret values
directly into the docker run command line:

  {{if .GPG_KEY_FILE}}-e GPG_KEY_FILE={{.GPG_KEY_FILE}}{{end}}
  {{if .GPG_PASSPHRASE}}-e GPG_PASSPHRASE={{.GPG_PASSPHRASE}}{{end}}

Task hands the rendered cmd to `sh -c`, so any whitespace or
shell metacharacter in the templated value (a space splits the
arg in two; a $ triggers shell expansion; a quote breaks parsing)
makes docker see a truncated value or extra arguments instead of
the real secret. With a real production passphrase that contains
any of these characters, the release job would fail at signing
even though the secret itself is valid.

Switch to `-e VAR` (no `=value`), which tells docker to forward
the variable from the host process's environment. The value
never touches the shell command line, so it's insensitive to
whitespace, $, quotes, or anything else the secret might contain.

Verified locally with a passphrase containing a space and `$`:
end-to-end signed build succeeds, all 5 expected artifacts are
produced, validation passes.
The earlier cleanup at the top of build-tgz.sh removed stale .sig
files and the exported public key but left .tar.gz files intact,
on the assumption that tar would just overwrite them. That holds
for same-tag re-runs but not when the tag changes between runs:
the previous tag's tarballs persist with their original filenames.

The release workflow's S3 upload step matches *.tar.gz with a
wildcard, so on persistent runners (notably custom-arm64-jammy)
or after a failed cleanup, a re-run for vY would publish vX
tarballs alongside the vY release.

Extend the cleanup to also remove *.tar.gz from the output dir.
Same-tag re-runs are unaffected (the new run rewrites them);
tag-switch re-runs no longer leak old archives.

Verified locally: built v1.0.0-old, then re-ran with v2.0.0-new
without manual cleanup; only the v2.0.0-new files remained.
@PlatCore PlatCore force-pushed the PlatCore/5160-add-signing-linux-binaries branch from 5ce8a46 to 1d50125 Compare May 2, 2026 01:38
# into it as the unprivileged runner user on linux runners.
sudo rm -rf ./build

build-arm64-binaries-tarball:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There arm64 and amd64 workflows would appear to be substantially duplicative. Maybe refactor to use a common workflow that can be configured with arch and runner?

@@ -0,0 +1,115 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there room for sharing validation logic across tgz/rpm/deb?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci This focuses on changes to the CI process devinfra

Projects

Status: In Progress 🏗️

Development

Successfully merging this pull request may close these issues.

Update the linux binary builds to include signing

2 participants