Almost 10% of GenAI Prompts Include Sensitive Data: Study
An enterprise worry about generative AI since the time OpenAI first released ChatGPT more than two years ago was that sensitive or corporate data fed into large language models (LLMs) could be exposed. A recent study by cybersecurity firm Harmonic Security puts some numbers to those concerns.
The San Francisco- and London-based startup, in an analysis of tens of thousands of prompts put into ChatGPT and other AI models last year by enterprise employees, found that 8.5% of them contained sensitive information, the bulk of which were customer and employee data.
The findings reinforce data security concerns that organizations have about generative AI, despite the emerging technology’s promise of such business benefits as greater efficiency and speed. It puts companies in a quandary, Harmonic researchers wrote: “Organizations risk losing their competitive edge if they expose sensitive data. Yet at the same time, they also risk losing out if they don’t adopt GenAI and fall behind.”
“The combination of diverse sensitive data types and heavy reliance on free tools is alarming,” the researchers wrote in a blog post accompanying the research. “Blocking access to GenAI tools, however, is not a viable solution. Their appeal is so strong that employees bypass restrictions, often resorting to personal devices and networks, which increases Shadow IT use.”
It puts a greater burden on companies to ensure that they are making it possible for employees to use generative AI in a secure manner and educating them about the risks of uploading sensitive data and using less-secure, free-tier versions of the AI models, the researchers wrote.
Thousands of Prompts Analyzed
Harmonic, which was founded in 2023 and in October 2024 secured $17.5 million in Series A funding for its AI model-based data protection technology, analyzed tens of thousands of prompts that were input into ChatGPT and other models, including Microsoft’s Copilot, Google Gemini, Anthropic’s Claude, and Perplexity.
“In the vast majority of cases, employees are not doing anything crazy,” the researchers wrote. “They want to summarize a piece of text. They want to edit a blog. They want to write documentation for code. This is the mundane reality of GenAI usage across the enterprise. While most prompts were benign, 8.5% of prompts contained sensitive information.”
Customer, Employee Data High on List
Of that sensitive information, more than 45% involved customer information and another almost 27% involved employee data. Almost 15% related to legal and financials, with security (6.88%) and code (5.64%) rounding out the top five.
The researchers were able to drill down into the various categories. For example, customer reports – such as insurance claims – accounted for almost 66% of the sensitive customer data entered into AI models. Payroll and personally identifiable information (PII) accounted for more than 99% of the sensitive employee information, while data about sales pipelines, M&A, investment portfolios, and legal discussions were prominent in the financial and legal category.
Information about network configurations, security incident reports, access keys for codes, and proprietary source code also ranked high relative to their categories.
The Risk of Free-Tier AI Models
Organizations often try to get some control over the use of AI models by designating a corporate-approved tool, the researchers wrote. This is usually done via agreements from vendors that include assurances that the company’s data won’t be used to train the AI model. However, not all employees will use only the paid corporate accounts.
Last year, 63.8% of ChatGPT users used the free tier, resulting in 53.5% of sensitive prompts entered there. In addition, 58.62% of Gemini users, 75% of Claude users, and 50.48% of Perplexity users also used free versions of those models.
“The use of free tiers is significant and can exacerbate risks given they don’t have the security features that come with enterprise versions,” they wrote. “Many free-tier tools explicitly state they train on customer data, meaning sensitive information entered could be used to improve models, further compounding risks.”
An Ongoing Concern
Data leakage from generative AI models is a problem, as shown by such incidents over the past couple of years involving companies like Amazon and Samsung. Most companies are trying to get ahead of the problem. A study by the Enterprise Research Group (ESG) last year found that 96% of organizations are either enforcing or creating governance structures for the use of generative AI, and that 82% are worried about data leakage.
The Harmonic researchers wrote in the report that blocking employee use of certain AI models wasn’t enough to keep sensitive data from being leaked. They urged using systems to track and manage in real time the data being put into AI models, ensuring employees are using paid plans, create workflows that dictate how certain departments can use generative AI models, and continue to educate users on best practices and the risks of using the models.