AI, Web Scraping and the Transformation of Data Privacy: What the EDPB’s Rulings Mean for Businesses
Web scraping — the automated extraction of publicly available information from websites — has long been a standard method for gathering data at scale. Traditionally, businesses, researchers and developers have used it for purposes ranging from price comparison tools to market analysis. But with the rise of artificial intelligence, web scraping is no longer just about collecting raw data. AI transforms this data, embedding it into machine learning models that can generate insights, predict behaviors and even infer new information about individuals in ways that were never intended when the data was first made public.
As AI-driven scraping becomes more sophisticated, so do the legal and ethical challenges surrounding its use. The European Data Protection Board (EDPB) has responded with a series of rulings, including Opinion 28/2024 on web scraping and AI and its December 2024 opinion on AI models, reinforcing the principle that even publicly accessible data is still subject to privacy protections under the General Data Protection Regulation (GDPR). The latest rulings make it clear: Businesses leveraging scraped data for AI development must rethink their strategies to remain compliant.
The Legal Landscape: From Unauthorized Access to AI-Driven Privacy Risks
In the United States, legal disputes over web scraping have typically revolved around whether accessing publicly available data without permission constitutes unauthorized access under the Computer Fraud and Abuse Act (CFAA). Cases like hiQ Labs v. LinkedIn have affirmed that scraping public data does not violate the CFAA, though the ruling left open questions about how website terms of service and contractual limitations apply.
Europe, however, has taken a different approach, shifting the focus from unauthorized access to data privacy violations. Under GDPR, personal data remains protected regardless of whether it is publicly available. This creates a stark contrast with U.S. legal frameworks, particularly when AI is used to aggregate, analyze and infer new insights from scraped data.
The EDPB’s recent opinions emphasize that AI transforms the nature of scraped data, turning isolated public records into highly structured, inferential models that can re-identify individuals, predict behavior and process information in ways that go beyond what the data subjects originally consented to. The key question is no longer whether data is public, but rather how it is used and whether its processing is lawful under GDPR.
How AI Changes the Privacy Risks of Scraped Data
Web scraping alone does not inherently violate privacy laws. But when AI models use scraped data, they fundamentally alter its characteristics. Unlike static datasets, AI models generalize, predict and infer information based on the training data they ingest. This creates several privacy risks:
First, AI models can reveal information beyond what was originally scraped. Even if direct identifiers such as names or email addresses are removed, AI models can learn patterns that make it possible to re-identify individuals through inference. The EDPB warns that AI models trained on personal data may still be subject to GDPR if they allow for indirect identification through queries or membership inference attacks.
Second, AI models can amplify biases and inaccuracies present in scraped data. Publicly available data is often incomplete, outdated, or biased toward certain demographics. When AI models ingest this data, they inherit these biases, creating ethical and legal concerns about fairness, accuracy and potential discrimination.
Finally, the retention of scraped data for AI model training raises compliance risks. Unlike traditional databases, AI models do not merely store information—they encode it in ways that make deletion difficult. The EDPB’s opinions stress that if a dataset includes unlawfully processed personal data, the resulting AI model may also be unlawful, unless it has been sufficiently anonymized.
What Businesses Need to Do in Response to the EDPB’s Rulings
Companies that rely on web scraping for AI model training must act now to ensure compliance with the latest EDPB guidance. The days of indiscriminate data collection are over. Businesses must be able to justify their scraping activities, document their data sources and implement safeguards that align with GDPR principles.
First and foremost, businesses should re-evaluate their legal basis for scraping and AI model training. The EDPB’s December 2024 opinion reiterates that legitimate interest can sometimes serve as a basis for processing personal data, but only if strict conditions are met. Companies must demonstrate that their data collection is necessary, proportionate and does not override the rights of data subjects. For example, if an AI model is used to improve cybersecurity or provide beneficial automated services, legitimate interest may be arguable—but only if individuals’ privacy rights are adequately protected.
In addition, businesses need to assess whether their AI models can truly be considered anonymous. The EDPB makes it clear that anonymity is not a given. Regulators will evaluate AI models on a case-by-case basis, considering whether personal data could be extracted through reverse engineering or inference. If a model retains identifiable information, it remains subject to GDPR. Companies should conduct privacy impact assessments (PIAs) and adversarial testing to assess whether their models are vulnerable to re-identification attacks.
Another critical step is implementing robust documentation and compliance mechanisms. Businesses should maintain detailed records of their data collection practices, sources and filtering techniques. This includes:
Keeping logs of web scraping activities and ensuring that data collection aligns with a clearly defined purpose.
Implementing strict data minimization measures to exclude unnecessary personal information from datasets before AI training.
Regularly auditing AI models to ensure that they do not retain or infer personal data in unintended ways.
Furthermore, businesses should adopt stronger governance frameworks for data deletion and opt-out mechanisms. Under GDPR, individuals have the right to request deletion of their personal data. The challenge for AI developers is that data in a trained model is not easily removable. Companies should develop mechanisms for model retraining or fine-tuning that allow for compliance with data subject rights.
For companies that allow or facilitate scraping on their platforms, the implications are equally serious. Platforms that do not implement safeguards against scraping — such as robots.txt, rate limits, or access controls — may be seen as failing to protect user data, potentially exposing them to regulatory penalties. The EDPB’s opinions suggest that platforms have a responsibility to ensure that publicly available data is not misused for AI model training without oversight.
Finally, businesses must stay ahead of upcoming AI regulations. The EDPB has signaled that more specific guidelines on AI-driven data collection are forthcoming, including regulations that may impose stricter controls on the use of scraped data for machine learning. Companies should prepare by aligning their AI development processes with GDPR’s privacy-by-design principles, ensuring that data protection is embedded into AI systems from the outset.
Looking Ahead: The Future of AI and Web Scraping
The EDPB’s opinions on web scraping and AI mark a shift in regulatory thinking. Rather than focusing solely on access restrictions, regulators are now scrutinizing how AI transforms and processes data, assessing whether models trained on scraped information remain compliant with privacy laws.
For businesses, this means that compliance can no longer be an afterthought. Companies must adopt a proactive approach, integrating privacy safeguards into their AI workflows from the start. Those that fail to adjust to the new regulatory environment may face not only legal challenges but also reputational risks, as consumers and regulators alike demand greater accountability in AI development.
Ultimately, the future of AI-driven data collection will be defined by those who embrace responsible AI practices, prioritize data protection and build transparency into their systems. Organizations that adapt now will be well-positioned to navigate the evolving regulatory landscape while continuing to innovate in an AI-powered world.