Data Collection and Cleaning for Large Language Model Pretraining at Web Scale

alt

Training a large language model isn’t about throwing more GPUs at the problem anymore. It’s about data. The difference between a model that understands context and one that just repeats garbage comes down to what you feed it - and how clean it is. Companies like OpenAI, Meta, and Apple now spend more time and money cleaning data than building architectures. In 2025, the most advanced models are trained on datasets of 10-13 trillion tokens, pulled from billions of web pages. But 90% of that raw data? It’s useless. Or worse - harmful.

Where Does the Data Even Come From?

Most large language models start with Common Crawl, a non-profit archive that’s been scraping the web since 2012. It holds over 25 billion web pages - text from blogs, forums, news sites, product pages, and spammy SEO dumps. It’s the closest thing we have to a mirror of the public internet. But it’s messy. A single crawl might include duplicate product listings from Amazon, Reddit threads with 500 identical replies, and entire websites copied verbatim across dozens of domains.

Other sources include RefinedWeb (a cleaned version of Common Crawl), GitHub repositories, Wikipedia, books from Project Gutenberg, and curated academic datasets. Some companies also license proprietary data - legal documents, medical records (with consent), or customer support logs. But even then, you’re not just copying and pasting. You’re building a pipeline.

The Four-Stage Cleaning Pipeline

There’s no single tool that fixes everything. Cleaning web-scale data is a multi-stage process, like filtering water through layers of sand, charcoal, and membranes.

  1. URL and Domain Filtering - First, you remove entire domains known for low quality: spam sites, link farms, adult content, and sites that auto-generate text. This alone cuts 40-60% of the data. Tools like the Dolma pipeline use heuristics: if a page has fewer than 100 words or more than 50% non-alphabetic characters, it’s tossed.
  2. Document Quality Scoring - Not all pages from good domains are good. You run each document through a lightweight classifier - often a small transformer model trained to recognize “high-quality” text. It looks for things like sentence structure, lexical diversity, and whether the text reads like a human wrote it. Pages with repetitive phrases (“buy now,” “click here,” “best price guaranteed”) get flagged. This removes another 25-35%.
  3. Deduplication - This is where things get technical. Duplicate content doesn’t just waste compute - it causes “double descent,” where the model memorizes exact phrases instead of learning patterns. Simhash fingerprints (64-bit hashes of text chunks) are used to find near-identical paragraphs. One team reduced deduplication time on a 50TB corpus from 14 days to 9 hours using this method. Some teams do paragraph-level deduplication instead of document-level - it’s slower but improves downstream performance by 7.3%.
  4. Safety and Toxicity Filtering - This is the hardest part. You don’t want your model spitting out hate speech, misinformation, or illegal content. But over-filtering kills nuance. A 2024 survey of 127 ML engineers found 68% struggled with false positives - especially in medical or legal text. One model flagged “abortion” as toxic because it appeared in anti-choice forums. Others flagged scientific terms like “chemical weapons” because they were mentioned in news reports. The solution? Use multiple filters: rule-based lists for obvious cases, then a reward model trained to score safety on a sliding scale. Meta’s research suggests you can retain 30-40% of the data after this stage without hurting performance.

Why More Data Isn’t Always Better

You might think: “If 13 trillion tokens are good, 20 trillion must be better.” Not true. Apple’s BETR method (Benchmark-Targeted Ranking), published in November 2024, showed something surprising: models trained on smaller, smarter datasets outperformed those trained on massive, unfiltered ones. BETR didn’t just filter out bad data - it picked documents that looked like the kinds of questions and answers used in evaluation benchmarks. The result? A 2.1x improvement in performance with the same compute. Larger models (70B+ parameters) actually benefit less from aggressive filtering - they can handle noise better. But smaller models? They need precision.

This flips the old assumption: scaling up isn’t the answer anymore. Targeting is.

A simhash fingerprint erases duplicate text fragments in a chaotic sea of web content under neon lights.

Synthetic Data: The New Wild Card

When real data is scarce - say, for high-level math reasoning or legal analysis - you generate it. DeepSeek-R1’s “cold-start” method uses reinforcement learning to create synthetic chain-of-thought examples. A model generates a math problem, then a second model checks the reasoning. Only examples with correct logic and clear steps get kept. Rejection sampling filters out 90% of the output. The result? High-quality, verified data that didn’t exist on the web.

But there’s a catch. Synthetic data can introduce hallucinations that look real. If you train on too many generated examples, the model starts believing its own fiction. That’s why teams use hybrid approaches: 70% real, 30% synthetic. Gartner predicts 65% of enterprise LLMs will use synthetic data by 2026 - up from 25% in 2024.

The Hidden Costs

Most people don’t realize how expensive this process is. Building a pipeline takes 3-6 months. You need distributed systems experts (Spark, Flink), NLP engineers, and cloud architects. One Reddit user spent eight weeks just removing forum spam - it made up 22% of their raw corpus. Copyright filtering eats up 35-40% of resources, yet adds little to model performance. And GDPR compliance? Processing user deletion requests can consume 15% of your pipeline’s bandwidth.

The Dolma dataset, which processed 3.8 trillion tokens, required 50 specialized nodes running for months. The learning curve? Four to six months just to get new engineers up to speed.

Synthetic data is verified by AI judges while real legal and medical documents are scanned in a dim lab.

Legal and Ethical Landmines

The EU AI Act, effective February 2025, requires full documentation of every piece of training data - where it came from, who owns it, whether consent was obtained. That’s a nightmare for web-scraped data. Law firms like DLA Piper estimate this adds 20-30% more preprocessing work. Fenwick & West warns that up to 25% of existing datasets could be legally questionable due to copyright violations.

And then there’s privacy. Princeton researchers developed Min-K% Prob, a method that can detect if a specific sentence was in a model’s training data - just by analyzing its probability scores. That means even if you delete a website, the model might still remember it. This isn’t theoretical. In late 2024, a model was shown to regurgitate entire patient records from anonymized medical forums.

What Works Today - And What Doesn’t

Here’s what the best teams are doing in 2025:

  • Using simhash for deduplication - fast, scalable, accurate.
  • Training small models to score quality before using big ones - saves millions in compute.
  • Aligning data to benchmarks, not just volume - Apple’s BETR is now a standard.
  • Keeping synthetic data under 30% - enough to fill gaps, not enough to distort.
  • Documenting everything - not just for compliance, but for debugging.
And here’s what fails:

  • Trying to clean everything with rules - language is too messy.
  • Filtering out all “controversial” topics - you end up with a bland, useless model.
  • Ignoring multilingual data - 100+ languages need separate identification pipelines.
  • Thinking data cleaning is a one-time task - models evolve, so must your data.

The Future Is Data-Centric

McKinsey found that in 2024, 57% of companies spent more on data preparation than model development. That’s up from 38% in 2022. The era of “bigger models” is over. The next leap won’t come from a new attention mechanism - it’ll come from better data.

By 2027, Gartner predicts 80% of enterprise LLMs will use task-specific corpora: a model for legal advice won’t train on Reddit threads. It’ll train on court rulings, contracts, and legal FAQs. A medical assistant won’t read Wikipedia - it’ll read peer-reviewed journals and clinical notes.

The bottleneck isn’t compute. It’s curation. The teams that win aren’t the ones with the most GPUs. They’re the ones with the cleanest data.

How much data do you actually need to train a large language model?

State-of-the-art models like GPT-4 are trained on around 13 trillion tokens. But that’s after cleaning. Raw data collection starts at 100+ terabytes - often 500TB to 2PB for big tech companies. After filtering, only 10-25% remains. For smaller models (7B-30B parameters), 1-5 trillion cleaned tokens are often enough. Quality matters more than quantity.

Can I use Common Crawl for my own LLM project?

Yes - and most teams do. Common Crawl is free and open. But it’s raw. You’ll need your own pipeline to clean it. Expect to spend weeks just removing duplicates and spam. Tools like Dolma and RefinedWeb offer pre-cleaned versions, but they’re still large (3-4 trillion tokens) and require heavy filtering for domain-specific use. If you’re building a medical or legal model, you’ll need to supplement it with curated data.

Is synthetic data reliable for training LLMs?

It can be - if used carefully. Synthetic data works best for niche tasks where real examples are rare: advanced math, code generation, or technical reasoning. The key is validation. Use rejection sampling, human review, or a second model to check outputs. Never rely on synthetic data alone. Mix it with real data (70/30 ratio) to avoid hallucinations. Teams using it effectively report 20-40% performance gains on targeted tasks.

What’s the biggest mistake people make in data cleaning?

Thinking it’s a one-time task. Data cleaning isn’t a step - it’s an ongoing process. As your model improves, you’ll find new patterns of failure: bias in certain regions, overfitting to forum language, or hallucinations from low-quality sources. Re-evaluate your pipeline every 3-6 months. Also, don’t over-filter. Removing too much “controversial” content makes your model useless in real-world scenarios.

How do I handle multilingual data?

Use language identification models like FastText or cld3. They can detect over 100 languages with 95%+ accuracy. But don’t just filter by language - clean each one separately. English data has different noise patterns than Japanese or Arabic. A spam filter trained on English will miss spam in other languages. Build language-specific pipelines, or at least apply different thresholds per language.

Do I need to worry about copyright if I use web data?

Absolutely. Courts are still deciding whether training on publicly scraped data violates copyright. Legal analysts warn that 15-25% of existing datasets could be challenged. If you’re building a commercial model, consider licensed data sources or focus on public domain content (like Wikipedia, Project Gutenberg). Document every source. Under the EU AI Act, you’ll need to prove data provenance - or risk fines.

Comments

Destiny Brumbaugh
Destiny Brumbaugh

lol they spend billions cleaning data but still can't stop models from saying 'as an AI' every 3 sentences. we're training AIs to be polite robots instead of letting them be messy human mirrors. fix the output, not the input.

January 1, 2026 AT 06:38

Sara Escanciano
Sara Escanciano

This is why we can't have nice things. Every time someone tries to make AI 'safe', they end up neutering it. You think banning the word 'abortion' from training data makes the model better? No. It just makes it dumb. The real problem is people who think language can be sanitized like a hospital ward.

January 1, 2026 AT 12:09

Elmer Burgos
Elmer Burgos

Honestly i think the synthetic data part is wild. I get why people do it but it feels like teaching a kid to ride a bike by having them watch a video of someone else riding it. Eventually they forget what real roads look like. Still, if it helps with niche tasks i guess its worth it as long as you dont overdo it.

January 1, 2026 AT 21:24

Write a comment