Home » Security Bloggers Network » Microsoft AI involuntarily exposed a secret giving access to 38TB of confidential data for 3 years

Microsoft AI involuntarily exposed a secret giving access to 38TB of confidential data for 3 years

by Thomas Segura on September 26, 2023

The WIZ Research team recently discovered that an overprovisioned SAS token had been lying exposed on GitHub for nearly three years. This token granted access to a massive 38-terabyte trove of private data. This Azure storage contained additional secrets, such as private SSH keys, hidden within the disk backups of two Microsoft employees. This revelation underscores the importance of robust data security measures.

What happened?

WIZ Research recently disclosed a data exposure incident found on Microsoft’s AI GitHub repository on June 23, 2023.

The researchers managing the GitHub used an Azure Storage sharing feature through a SAS token to give access to a bucket of open-source AI training data.

This token was misconfigured, giving access to the entire account cloud storage rather than the intended bucket.

This storage comprised 38TB of data, including a disk backup of two employees’ workstations with secrets, private keys, passwords, and more than 30,000 internal Microsoft Teams messages.

SAS (Shared Access Signatures) are signed URLs for sharing Azure Storage resources. They are configured with fine-grained controls over how a client can access the data: what resources are exposed (full account, container, or selection of files), with what permissions, and for how long. See Azure Storage documentation.

After disclosing the incident to Microsoft, the SAS token was invalidated. From its first commit to GitHub (July 20, 2020) to its revoking, nearly three years elapsed. See the timeline presented by the Wiz Research team:

Why did the token have such an extended lifespan? If you take a look at the timeline, you'll see that the token's expiration date was extended by an additional 30 years post-expiration. This longevity isn't surprising when you consider that the token was intentionally engineered to be shared and grant access to training data.

Yet, as emphasized by the WIZ Research team, there was a misconfiguration with the Shared Access Signature (SAS).

Data Exposure

The token was allowing anyone to access an additional 38TB of data, including sensitive data such as secret keys, personal passwords, and over 30,000 internal Microsoft Teams messages from hundreds of Microsoft employees.

Here is an excerpt from some of the most sensitive data recovered by the Wiz team:

Not only was the access scope excessively permissive, but the token was also misconfigured to grant "full control" permissions instead of read-only. This means that an attacker not only had the ability to view all the files in the storage account but could also delete and overwrite existing files.

As highlighted by the researchers, this could have allowed an attacker to inject malicious code into the storage blob that could then automatically execute with every download by a user (presumably an AI researcher) trusting in Microsoft's reputation, which could have led to a supply chain attack.

Also read Examples of software supply chain attacks

Security Risks

According to the researchers, Account SAS tokens such as the one presented in their research present a high-security risk. This is because these tokens are highly permissive, long-lived tokens that escape the monitoring perimeter of administrators.

When a user generates a new token, it is signed by the browser and doesn't trigger any Azure event. To revoke a token, an administrator needs to rotate the signing account key, therefore revoking all the other tokens at once.

Ironically, the security risk of a Microsoft product feature (Azure SAS tokens) caused an incident for a Microsoft research team, a risk recently referenced by the second version of the Microsoft threat matrix for storage services:

Secrets Sprawl

This example perfectly underscores the pervasive issue of secrets sprawl within organizations, even those with advanced security measures. Intriguingly, it highlights how an AI research team, or any data team, can independently create tokens that could potentially jeopardize the organization. These tokens can cleverly sidestep the security safeguards designed to shield the environment.

Read the State of Secrets Sprawl 2023

Mitigation strategies

For Azure Storage users:

1 – avoid Account SAS tokens

The lack of monitoring makes this feature a security hole in your perimeter. A better way to share data externally is using a Service SAS with a Stored Access Policy. This feature binds a SAS token to a policy, providing the ability to centrally manage token policies.

Better though, if you don't need to use this Azure Storage sharing feature, is to simply disable SAS access for each account you own.

2 – enable Azure Storage analytics

Active SAS tokens usage can be monitored through the Storage Analytics logs for each of your storage accounts. Azure Metrics allows the monitoring of SAS-authenticated requests and identifies storage accounts that have been accessed through SAS tokens, for up to 93 days.

For all:

1 – Audit your GitHub perimeter for sensitive secrets

With around 90 million developer accounts, 300 million hosted repositories, and 4 million active organizations, including 90% of Fortune 100 companies, GitHub holds a much larger attack surface than meets the eye.

Last year, GitGuardian uncovered 10 million leaked secrets on public repositories, up 67% from the previous year.

GitHub must be actively monitored as part of any organization's security perimeter. Incidents involving leaked credentials on the platform continue to cause massive breaches for large companies, and this security hole in Microsoft's protective shell wasn't without reminding us of the Toyota data breach from a year ago.

ℹ️

On October 7, 2022 Toyota, the Japanese-based automotive manufacturer, revealed they had accidentally exposed a credential allowing access to customer data in a public GitHub repo for nearly 5 years. The code was made public from December 2017 through September 2022. While Toyota says they have invalidated the key, any exposure this long could mean multiple malicious actors had already acquired access.

Being able to detect exposed sensitive tokens on GitHub is a unique feature of GitGuardian's Public Monitoring system. It allows security analysts to quickly inspect an organization's footprint on the platform, identify valid secrets, and assess the severity of incidents. What is more, the engine is able to include developers’ personal public repositories — where 80% of corporate credentials are leaked — to an organization's perimeter.

If your company has development teams, it is very likely that some of your company's secrets (API keys, tokens, password) end up on public GitHub, so you should evaluate your GitHub attack surface by requesting a complimentary audit.

2 – Lay out traps in the form of honeytokens

Do you need time to restructure governance around cloud storage access, yet need to be alerted if highly sensitive parts get scanned by a malicious actor?

Your best allies are honeytokens. These tokens are decoy AWS secrets you can deploy strategically across your software assets to regain observability in the grey areas of your IT infrastructure. Getting the attackers' IP addresses, user agent, what actions they were attempting, and the timestamps of each attempt will help you thwart attempts before they can inflict damage on your software supply chain.

Final words

Every organization, regardless of size, needs to be prepared to tackle a wide range of emerging risks. These risks often stem from insufficient monitoring of extensive software operations within today's modern enterprises. In this case, an AI research team inadvertently created and exposed a misconfigured cloud storage sharing link, bypassing security guardrails. But how many other departments – support, sales, operations, or marketing – could find themselves in a similar situation? The increasing dependence on software, data, and digital services amplifies cyber risks on a global scale.

Combatting the spread of confidential information and its associated risks necessitates reevaluating security teams' oversight and governance capabilities. It also requires the provision of appropriate tools to identify and counteract emerging threat categories. While human errors are an inevitable part of the process, GitGuardian is here to guide you along your security journey.

Thomas Segura

What You Need to Scale AppSec Thomas Segura - Content Writer @ GitGuardian Author Bio Thomas has worked both as an analyst and as a software engineer consultant for various big French companies. His passion for tech and open source led him to join GitGuardian as technical content writer. He focuses now on clarifying the transformative changes that cybersecurity and software are going through. Website:https://www.gitguardian.com/ Twitter handle: https://twitter.com/GitGuardian Linkedin: https://www.linkedin.com/company/gitguardian Introduction Security is a dilemma for many leaders. On the one hand, it is largely recognized as an essential feature. On the other hand, it does not drive business. Of course, as we mature, security can become a business enabler. But the roadmap is unclear. With the rise of agile practices, DevOps and the cloud, development timeframes have been considerably compressed, but application security remains essentially the same. DevSecOps emerged as an answer to this dilemma. Its promise consists literally in inserting security principles, practices, and tools into the DevOps activity stream, reducing risk without compromising deliverability. Therefore there is a question that many are asking: why isn't DevSecOps already the norm? As we analyzed in our latest report DevSecOps: Protecting the Modern Software Factory, the answer can be summarized as follows: only by enabling new capacities across Dev, Sec and Ops teams can the culture be changed. This post will help provide a high-level overview of the prerequisite steps needed to scale up application security across departments and enable such capabilities. From requirements to expectations Scaling application security is a company-wide project that requires thorough thinking before an y decision is made. A first-hand requirement is to talk to product and engineering teams to understand the current global AppSec maturity. The objective at this point is to be sure to have a comprehensive understanding of how your products are made (the processes, tools, components, and stacks involved). Mapping development tools and practices will require time to have the best visibility possible. They should include product development practices and the perceived risk awareness/appetite from managers. One of your objectives would be to nudge them so they take into account security in every decision they make for their products, and maybe end up thinking like adversaries. You should be able to derive security requirements from the different perceptual risks you are going to encounter. Your job is to consolidate these into a common set for all applications, setting goals to align the different teams collaborating to build your product(s). Communicating transparently with all relevant stakeholders (CISO, technical security, product owner, and development leads) about goals and expectations is essential to create a common ground for improvement. It will be absolutely necessary to ensure alignment throughout the implementation too. Open and accessible guardrails Guardrails are the cornerstone of security requirements. Their nature and implementation are completely up to the needs of your organization and can be potentially very different from one company to the other (if starting from scratch, look no further than the OWASP Top10). What is most important, however, is that these guardrails are open to the ones that need them. A good example of this would be to centralize a common, security-approved library of open-source components that can be pulled from by any team. Keep users' accessibility and useability as a priority. Designing an AppSec program at scale requires asking “how can we build confidence and visibility with trusted tools in our ecosystem?”. For instance, control gates should never be implemented without considering a break-glass option (“what happens if the control is blocking in an emergency situation?”). State-of-the-art security is to have off-the-shelf secure solutions chosen by the developers, approved by security, and maintained by ops. This will be a big leap forward in preventing vulnerabilities from creeping into source code. It will bring security to the masses at a very low cost (low friction). But to truly scale application security, it would be silly not to use the software engineer's best ally: the continuous integration pipeline. Embed controls in the CI/CD AppSec testing across all development pipelines is the implementation step. If your organization has multiple development teams, it is very likely that different CI/CD pipelines configurations exist in parallel. They may use different tools, or simply define different steps in the build process. This is not a problem per se, but to scale application security, centralization and harmonization are needed. As illustrated in the following example CI/CD pipeline, you can have a lot of security control steps: secrets detection, SAST, artifact signing, access controls, but also container or Infrastructure as Code scanning (not shown in the example) (taken from the DevSecOps whitepaper) The idea is that you can progressively activate more and more control steps, fine-tune the existing ones and scale both horizontally and vertically your “AppSec infrastructure”, at one condition: you need to centralize metrics and controls in a stand-alone platform able to handle the load corresponding to your organization’s size. Security processes can only be automated when you have metrics and proper visibility across your development targets, otherwise, it is just more burden on the AppSec team's shoulders. In turn, metrics and visibility help drive change and provide the spark to ignite a cultural change within your organization. Security ownership shifts to every engineer involved in the delivery process, and each one is able to leverage its own deep (yet partial) knowledge of the system to support the effort. This unlocks a world of possibilities: most security flaws can be treated like regular tickets, rule sets can be optimized for each pipeline based on criticality, capabilities or regulatory compliance, and progress can be tracked (saved time, avoided vulnerabilities etc.). In simpler terms, security can finally move at the DevOps speed. Conclusion Security can’t scale if it’s siloed, and slowing down the development process is no longer an option in a world led by DevOps innovation. The design and implementation of security controls are bound to evolve. In this article, we’ve depicted a high-level overview of the steps to be considered to scale AppSec. This starts with establishing a set of security requirements that involve all the departments, in particular product-related ones. From there it becomes possible to design guardrails to make security truly accessible with a mix of hard and soft gates. By carefully selecting automated detection and remediation that provide visibility and control, you will be laying a solid foundation for a real model of shared responsibility for security. Finally, embedding checks in the CI/CD system can be rolled out in multiple phases to progressively scale your security operations. With automated feedback in place, you can start incrementally adjusting your policies. A centralized platform creates a common interface to facilitate the exchange between application security and developer teams while enforcing processes. It is a huge opportunity to automate and propagate best practices across teams. Developers are empowered to develop faster with more ownership. When security is rethought as a partnership between software-building stakeholders, a flywheel effect can take place: reduced friction leads to better communication and visibility, automating of more best practices, easing the work of each other while improving security with fewer defects. This is how application security will finally be able to scale through continuous improvement.

thomas-segura has 62 posts and counting.See all posts by thomas-segura

Microsoft AI involuntarily exposed a secret giving access to 38TB of confidential data for 3 years

What happened?

Data Exposure

Security Risks

Secrets Sprawl

Mitigation strategies

For Azure Storage users:

1 – avoid Account SAS tokens

2 – enable Azure Storage analytics

For all:

1 – Audit your GitHub perimeter for sensitive secrets

2 – Lay out traps in the form of honeytokens

Final words

Thomas Segura

GoPlus’s Latest Report Highlights How Blockchain Communities Are Leveraging Critical API Security Data To Mitigate Web3 Threats

C2A Security’s EVSec Risk Management and Automation Platform Gains Traction in Automotive Industry as Companies Seek to Efficiently Meet Regulatory Requirements

Zama Raises $73M in Series A Lead by Multicoin Capital and Protocol Labs to Commercialize Fully Homomorphic Encryption

RSM US Deploys Stellar Cyber Open XDR Platform to Secure Clients

ThreatHunter.ai Halts Hundreds of Attacks in the past 48 hours: Combating Ransomware and Nation-State Cyber Threats Head-On

Randall Munroe’s XKCD ‘Alert Sound’