Microsoft AI Researchers Exposed 38TB Private Info

Sometimes sensitive data isn’t safe—even in the hands of those who are typically concerned with securing it. The cloud and AI and a host of other technologies essential to digitization are increasing the odds that data will be exposed. The latest example: Microsoft AI researchers exposed terabytes of data that included private keys and passwords when they published a storage bucket of open source training data on GitHub.

The 38 terabytes of private data included a disk backup of two employees’ workstations, something that Microsoft blamed on an excessively permissive Shared Access Signature (SAS) token, a feature in Azure that researchers from Wiz said is difficult to monitor and revoke.

“In addition to the overly permissive access scope, the token was also misconfigured to allow ‘full control’ permissions instead of read-only. Meaning not only could an attacker view all the files in the storage account, but they could delete and overwrite existing files as well,” the researchers said in a blog post.

Given the repository’s original purpose—to offer AI models to be used in training code—researchers found this interesting. “The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library,” they said. “It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.”

They did stress that the storage account wasn’t directly exposed to the public. “In fact, it was a private storage account. The Microsoft developers used an Azure mechanism called ‘SAS tokens,’ which allows you to create a shareable link granting access to an Azure Storage account’s data—while upon inspection, the storage account would still seem completely private,” the researchers said.

The situation underscores the new risks of leveraging AI, which has researchers handling large amounts of training data, Wiz researchers noted in a blog post. The speedy move to create AI solutions means that organizations must put additional security checks and safeguards on the data handled by data scientists and engineers.

“Data is meant to be shared; but sharing data securely on the cloud today is like driving a car with a tiller fast and close to the edge of a cliff,” said Mohit Tiwari, co-founder and CEO at Symmetry Systems.

“Even security experts routinely get cloud access controls wrong. SAS tokens are risky because they are similar to shared links to folders that you hand out but have no good way to keep track of,” said Tiwari. “AWS has a deep web of policies covering role, attribute, bucket, cross-account, service control, etc.—to the extent that once you have 8,000 S3 buckets and 900PB [of data], you can’t say who is accessing what across your environment.”

Monitoring data access, he said, “is like drinking from the firehose of billions of data activity logs per day. Microsoft has recently been in the news for making data events expensive; and even when free, almost no cloud customer monitors data events at scale.”

The rush to embrace AI has increased the potential for security problems. “While AI can be a useful tool, organizations need to be aware of the potential risks associated with utilizing tools that leverage this relatively nascent technology,” said Patrick Tiquet, vice president, security and architecture, at Keeper Security.

“The recent incident involving the exposure of AI training data through SAS tokens highlights the need to treat AI tools with the same level of caution and security diligence as any other tool or system that may be used to store and process sensitive data,” Tiquet said.

“With AI in particular, the amount of sensitive data being used and stored can be extremely large,” he said.

“In some cases, organizations need assurances from AI providers that sensitive information will be kept isolated within their organization and not be used to cross-train AI products, potentially divulging sensitive information outside of the organization through AI collective knowledge,” he said. “The implementation of AI-powered cybersecurity tools requires a comprehensive strategy that also includes supplementary technologies to boost limitations as well as human expertise to provide a layered defense against the evolving threat landscape.”

Tiwari said that while detecting SAS tokens, finding public S3 buckets and the like “are useful point solutions—but a data security foundation is critical if we are to innovate and be secure.”

He took issue with framing the incident as a SAS token problem, which he called chasing the symptom. “There are hundreds of such pitfalls; while the root problem is data inventory and access permissions and monitoring,” Tiwari said. “Without that, a bug in any mechanism will land you at the same outcome—look at public S3 buckets, Wiz’s chaosdb vulnerability from last year, Twitter’s insider threat whistleblower reports, etc.”

SAS tokens provide access to anyone with authorization “to access something, but lack the authentication and auditing that come with other more secure IAM mechanisms,” he said.

“This focus on authorization seems to be unfortunately necessary in certain scenarios, where you want to share data more broadly, such as open sourcing,” Tiwari said. “But without knowing what data it provides access to and monitoring in place to ensure that the data being accessed is in line with expectations, you are reliant on upfront configuration that can’t be audited until it’s too late.”

Avatar photo

Teri Robinson

From the time she was 10 years old and her father gave her an electric typewriter for Christmas, Teri Robinson knew she wanted to be a writer. What she didn’t know is how the path from graduate school at LSU, where she earned a Masters degree in Journalism, would lead her on a decades-long journey from her native Louisiana to Washington, D.C. and eventually to New York City where she established a thriving practice as a writer, editor, content specialist and consultant, covering cybersecurity, business and technology, finance, regulatory, policy and customer service, among other topics; contributed to a book on the first year of motherhood; penned award-winning screenplays; and filmed a series of short movies. Most recently, as the executive editor of SC Media, Teri helped transform a 30-year-old, well-respected brand into a digital powerhouse that delivers thought leadership, high-impact journalism and the most relevant, actionable information to an audience of cybersecurity professionals, policymakers and practitioners.

teri-robinson has 197 posts and counting.See all posts by teri-robinson