AI’s original sin: scraping your data

08/06/2026

Today’s most popular generative AI tools are built on mass privacy violations. The Amnesty International report, Unlawful by Design, examines the models behind GPT-3, Google’s Gemini, Meta’s Llama, DeepSeek, Midjourney, and Stable Diffusion. All rely on scraping billions of public posts, images, and personal data from the web without the explicit consent of the people who created or appear in them. Amnesty calls this what this practice is under EU law and most national laws: unlawful. The scale is staggering: GPT-3 alone was built on data from at least 60 million scraped web domains, and Meta’s CEO openly boasted that the company’s competitive edge came from hundreds of billions of images and videos from Facebook and Instagram.

Bias, environment, and unequal costs

The consequences go beyond privacy. Scraped data carries real-world biases into AI outputs, disproportionately harming racialised and marginalised communities. Research cited in the report found that racial bias in outputs actually worsens as model size increases. A UNESCO study found that women were described in domestic roles four times more often than men across GPT and LLaMA outputs. These aren’t bugs – they’re features of training on a web that already reflects systemic inequality. Environmental costs are mounting too. Google’s greenhouse gas emissions rose 48% since 2019, partly driven by data centre expansion for AI. Communities in Chile, Mexico, and the American Southwest are already resisting data centres being built in drought-stricken areas. Meanwhile, Indigenous communities in Arizona remain without mains electricity as the state approves increased power production for those same data centres.

A threat to thought and expression

The report also raises a less visible but growing concern: the capacity of generative AI to shape what people believe. Repeated exposure to AI-generated content and algorithmic bias can subtly alter users’ mental models without their awareness – a threat the EU’s AI Act directly addresses through prohibitions on subliminal manipulation and dark patterns. On free expression, the systems’ overwhelming reliance on English-language, Western training data makes them structurally prone to censoring content in other languages and cultural contexts, with documented cases of Palestinian expression being flagged or suppressed.

The regulatory tools exist

For those working on digital rights, tech accountability, or democratic governance, this report is a useful tool. It reframes AI development not as an inevitable march of progress, but as a set of deliberate design choices – choices that can, and must, be regulated. Regulators are beginning to act: Italy’s data protection authority has formally accused OpenAI of GDPR violations, Germany’s Regional Court of Munich found its models infringe copyright, and the Netherlands’ data protection authority found chatbots produce politically skewed voting advice. The legal scaffolding exists. What’s needed is the political will to use it. Amnesty’s ask is clear: states should prohibit standalone generative AI systems built on unlawful web scraping, and companies must stop the practice immediately. The AI debate too often gets lost in abstraction. This report brings it back to fundamentals: consent, rights, accountability.

Read the full report

News

AI’s original sin: scraping your data

Bias, environment, and unequal costs

A threat to thought and expression

The regulatory tools exist