Hundreds of Clusters Attacked Due to Unpatched Flaw in Ray AI Framework
Thousands of servers running AI workloads are under attack by threat actors exploiting an unpatched vulnerability in the open-source Ray AI framework – widely used by such companies as OpenAI, Uber, Amazon, Netflix, and Cohere – giving hackers entrée to huge amounts of data and compute power.
The campaign has been ongoing for at least seven months, compromising AI production workloads that could lead to stolen or infected models, database credentials, private SSH keys that give attackers access to more machines on the same VM image template, and tokens to various platforms, giving hackers access to accounts on sites like Hugging Face, Slack, and Stipe and cloud environments, from Microsoft Azure and Google Cloud Platform to Amazon Web Services and Lambda Labs.
In addition, the attackers also were able to compromise infrastructures, including hundreds of compute clusters running GPUs from Nvidia, and install and cryptomining tools, according to researchers with cybersecurity firm Oligo. Each cluster includes nodes that run on the powerful GPUs, such as Nvidia’s A6000 chips, which are foundational to AI workloads. Hackers can leverage the GPUs to run their power-hungry cryptominers, including XMRig, NBMiner, and Java-based Zephyr miners.
Some attackers also are using reverse shells, which enables them to run arbitrary code in the production environment and keep persistence in the systems. They’re also using the open-source service Interactsh – which is designed to detect out-of-band interactions – to evade detection.
“When attackers get their hands on a Ray production cluster, it is a jackpot,” Oligo researchers Avi Lumelsky, Guy Kaplan, and Gal Elbaz wrote in a report. “Valuable company data plus remote code execution makes it easy to monetize attacks – all while remaining in the shadows, totally undetected (and, with static security tools, undetectable). A trove of sensitive information has been leaked via the compromised servers.”
A Disputed CVE
Last year, Anyscale, which maintains the Ray framework, was notified of five vulnerabilities that were disclosed in November, with four of them being fixed immediately in the master and released as part of Ray 2.8.1.
However, the fifth vulnerability – CVE-2023-48022 – remains in dispute, with Anyscale saying the issue is not a flaw. The company in a blog post last year wrote that the compute framework does not enforce authentication or support an authorization model. The onus is on the user, not the framework’s maintainer, according to the company.
“Due to Ray’s nature as a distributed execution framework, Ray’s security boundary is outside of the Ray cluster,” Anyscale wrote. “That is why we emphasize that you must prevent access to your Ray cluster from untrusted machines (e.g., the public internet). This is why the 5th CVE (the lack of authentication built into Ray) has not been addressed, and why it is not in our opinion a vulnerability, or even a bug.”
At the time, the company wrote that it would include isolation controls within Ray like authentication in a future release.
Anyscale officials this week, noting that Oligo had notified them of active exploitation of CVE-2023-48022, wrote in a blog post that the company is releasing tools to enable users to verify proper configuration of clusters to ensure accidental exposure can’t happen. Anyscale provided both client-side and service-code, which can be found on GitHub. The tooling is made available under Apache2.
“We’ve also pre-configured the defaults of the client side script to reach out to a server we have set up to simplify the process of determining whether or not ports are unexpectedly open,” the company wrote.
Shadow Vulnerability
The problem, according to Oligo, is that because the vulnerability was disputed, many development teams and static scanning tools are unaware of it. Because it’s being actively exploited in the wild, the disputed issue becomes a “shadow vulnerability,” a CVE that doesn’t pop up in static scans but can still lead to breaches and losses.
“Ray does not include any kind of authorization in its Jobs API,” they wrote. “The result: anyone with dashboard network access (HTTP port 8265) could potentially invoke arbitrary jobs on the remote host, without authorization. Ray includes code execution capabilities by design, so Anyscale believes the users should be responsible for its locality and security.”
Thousands of publicly exposed Ray servers around the globe have been compromised due to the vulnerability, which Oligo calls “ShadowRay.”
“Many of the machines included command history, making it much easier for attackers to understand what resides on the current machine and possibly leaking sensitive secrets from production that were used in previous commands,” Lumelsky, Kaplan, and Elbaz wrote. “A typical AI environment contains a wealth of sensitive information – enough to take an entire company down.”
Ray is Now a Best Practice
The Ray framework is used to scale both AI and Python applications and includes a distributed runtime – Ray Core – and complementary AL libraries and extensions for accelerating and distributing domain-specific machine-learning workloads. It has 30,000 stars in GitHub, indicating its high popularity.
“Models like [OpenAI’s] GPT-4 comprise billions of parameters, requiring massive computational power,” the Oligo researchers wrote. “Such large models cannot possibly fit on the memory of a single machine. Ray is the enabling technology that allows these models to run.”
It’s become a best practice in the industry, they wrote.
Shadow vulnerabilities are often invisible to scanning approaches. The researchers said that to detect these hard-to-see flaws, organizations need to continuously monitor the runtime environment, looking for exploit signs like specially crafted inputs or data being loaded from untrusted sources.
To secure Ray environment, the framework needs to be fun in a secured and trusted environment, with firewall rules or security groups added to prevent unauthorized access. In addition, users need to add authorization atop of the Ray Dashboard part – 8265 by default, and “if you do need Ray’s dashboard to be accessible, implement a proxy that adds an authorization layer to the Ray API when exposing it over the network,” they wrote.