Using local GPG private keys on HPC with load balancer
Abstract
Recently, I needed to run a process on an HPC cluster that required a secret, but I wanted to avoid storing my private key as a file on the cluster for security reasons. Instead, I looked for a way to decrypt an encrypted secret on the HPC while keeping my private key securely on my local machine. A great solution for this is GPG agent forwarding, which allows a remote machine to use a local GPG agent to decrypt secrets. This worked well when I could log into a single head node, but it broke when my HPC cluster implemented a load balancer that assigned me to a random node each time I logged in. The typical approach -- deleting the existing agent socket and reconnecting -- became unreliable. This post explains the problem in detail, walks through several failed solutions, and ultimately presents the working method I found to maintain secure GPG agent forwarding even when connecting through a randomized load balancer.
Background
I needed to run a process on a remote HPC that required access to a secret. Normally, I store secrets as encrypted files and decrypt them into memory using a private key. However, storing my private key on the HPC was not an option, as administrators have superuser privileges, making it a security risk. Instead, I wanted to use Agent Forwarding, which allows me to keep my private key on my local machine while using SSH to decrypt secrets on the remote server.
GPG provides a method for this via GPG Agent Forwarding, explained in the GPG Agent Forwarding documentation. The core idea is to map the local GPG agent’s extra socket (agent-extra-socket
) to the standard agent socket (agent-socket
) on the remote server using an SSH RemoteForward
. This setup allows the remote system to use my local GPG keyring, prompting for a passphrase on my local machine whenever the key is accessed remotely.
A small annoyance in this setup is that the remote agent-extra-socket
is created automatically by GPG at startup, and it must be deleted or overwritten before forwarding the local one. Ideally, the SSH option StreamLocalBindUnlink
would handle this automatically, but my HPC does not have it enabled, and enabling it requires admin privileges. Without this, I had to manually delete the socket before establishing the SSH connection with RemoteForward
.
This was an acceptable workaround–log in, delete the socket, then reconnect with forwarding enabled–but then the load balancer ruined everything.
The problem
Our local HPC recently started only allowing logins through a load-balanced head node, meaning when I SSH into the cluster, I get forwarded to a random login node. Now, my previous workaround – logging in to a specific node to delete the socket before reconnecting to that same node – became unreliable. Since I had no control over which node I landed on, I couldn’t guarantee that I was reconnecting to the same one where I had just deleted the socket. This randomness made the whole approach fall apart.
Failed solutions
Understanding GPG agent socket architecture
Before exploring solutions, it’s useful to understand how GPG agent sockets work. A GPG agent creates several socket files upon initialization:
S.gpg-agent (standard socket)
S.gpg-agent.extra (extra socket - intended for remote forwarding)
S.gpg-agent.ssh (SSH authentication socket)
S.gpg-agent.browser (browser integration socket)
These sockets follow the “everything is a file” philosophy. This means that sockets can be manipulated using file descriptors, just like regular files. The paths for the sockets can be determined using:
gpgconf --list-dir agent-socket # Remote standard socket
gpgconf --list-dir agent-extra-socket # Local extra socket
For forwarding to work, the local extra socket (S.gpg-agent.extra
) is mapped to the standard socket (S.gpg-agent
) on the remote system via SSH RemoteForward
. Any request to the remote GPG agent then gets redirected back to the local machine, keeping private keys secure, but allowing you to use them to decrypt secrets on the remote machine.
Attempted (and failed) solutions
- Brute-force deletion across all nodes
- I considered writing a shell script to SSH into the cluster multiple times, deleting the socket on every possible login node. Then, I could keep trying until I randomly landed on a node where the socket had been deleted. This felt extremely inefficient and hacky, relying purely on luck.
- Using SSH to delete the socket before forwarding
- Ideally, the SSH command itself could delete the socket before attempting to set up forwarding.
- However, SSH seems to establish forwards before running commands, making this ineffective.
- Additionally, I struggled to get SSH to return a shell after executing the remote command.
- Relocating the socket to a shared filesystem
- If GPG agent could use a different directory for its socket—one located on a shared filesystem rather than the per-node
/run/user
directory—then deleting the socket from one node would affect all nodes. - I found some documentation on configuring
extra-socket
ingpg-agent.conf
, but despite my efforts, I couldn’t get GPG to recognize a custom socket location.
- If GPG agent could use a different directory for its socket—one located on a shared filesystem rather than the per-node
The working solution: SSH ControlMaster
The breakthrough came with SSH’s ControlMaster feature, which allows multiple SSH sessions to share a single persistent connection to the same node. This ensures that all commands run on the same randomly assigned login node.
Steps to make it work
Start a persistent master SSH connection:
ssh -M -S ~/.ssh/control-socket user@hpc-domain
This establishes an SSH session and keeps the connection open for reuse.
Delete the GPG agent socket on the same connection:
ssh -S ~/.ssh/control-socket user@hpc-domain "rm -f /run/user/$(id -u)/gnupg/S.gpg-agent"
Since this runs over the same ControlMaster session, it executes on the same node.
Reconnect with RemoteForward enabled on the same session:
ssh -S ~/.ssh/control-socket -o "RemoteForward /run/user/$(id -u)/gnupg/S.gpg-agent $(gpgconf --list-dir agent-extra-socket)" user@hpc-domain
Now that the socket is deleted, the forwarding works reliably!
This works because the persistent connection ensures that all commands execute on the same randomly assigned node. This allows me to delete the GPG socket before establishing forwarding, solving the problem of the load balancer. Once the forwarding is set up, any subsequent GPG operations on the HPC use my local private key without storing anything sensitive on the remote system.
Conclusion
GPG agent forwarding is a powerful way to use encrypted secrets remotely without compromising security. However, load-balanced clusters introduce challenges when trying to delete and replace agent sockets. While traditional methods fail due to node randomness, SSH’s ControlMaster provides a reliable way to maintain the connection to a specific node, ensuring the necessary setup steps occur in the correct order.
If you’re working with HPC clusters and need secure secret management, this technique might save you a lot of frustration!