April 2026 · 18 min read

The Architecture of Forgotten Consent

How the WARC file system makes takedowns meaningless, why model collapse is already running, and why the only fix is provenance at creation.

Most commentary on AI training data focuses on the legal and ethical dimensions of consent. That framing, while important, obscures a more technically consequential argument: the infrastructure through which AI training data flows was architecturally incapable of honoring consent even when everyone involved was acting in good faith. The problem is not that bad actors circumvented the rules. The problem is that the system's design made the rules structurally unenforceable from the start.

This piece examines three interlocking technical failures: the WARC archive architecture that makes publisher takedowns effectively meaningless for AI training pipelines; the model collapse literature that proves quality degradation from synthetic and low-quality data substitution is not a risk but a mathematical certainty; and the Grok case history as the first fully documented production failure of training data provenance collapse at scale.

Together they form a single argument: consent infrastructure and data quality infrastructure are the same infrastructure, and the window for addressing them is before data enters a training pipeline, not after.

Part I — The Storage Architecture

Why Publisher Takedowns Don't Stop Training Pipelines

To understand why the consent problem is architectural rather than behavioral, you need to understand how Common Crawl stores data and how AI labs actually access it. These are not the same system, and they interact in a way that makes most publisher consent signals invisible to the training pipeline.

What Common Crawl Actually Is

Common Crawl is the de facto shared data utility of the foundation model industry. A 2024 Mozilla Foundation report found that two-thirds of 47 generative LLMs released between 2019 and 2023 relied on Common Crawl data. The major training corpora that underpin frontier models — C4, RefinedWeb, Dolma, The Pile — are all built substantially from Common Crawl's archives. If you have used a modern LLM, you have interacted with a system shaped by Common Crawl's data.

The archive is massive. Each monthly crawl produces petabytes of compressed web data organized into three file types: WARC files containing raw HTTP responses, WAT files containing extracted metadata, and WET files containing plain extracted text. For LLM pre-training, labs primarily work with WET files — the plain text layer — or process WARC files directly to extract and filter content at scale using distributed computing frameworks like Apache Spark running against Common Crawl's S3 bucket.

The WARC File Structure

WARC (Web ARChive) is an ISO standard container format. Each WARC file is a sequential, compressed container holding tens of thousands of records concatenated together. A single Common Crawl monthly crawl produces approximately 80,000 WARC files, each compressed to around 900MB, stored in AWS S3.

The key architectural constraint: records inside a WARC file are not individually addressable without the CDX index. The WARC format is a sequential stream. To locate a specific record, you need to know which WARC file it lives in, its byte offset within that file, and its length. That mapping is maintained in the CDX index — a separate system.

The CDX Index: The Map That AI Labs Don't Use

Common Crawl maintains a CDX (Capture/DedupeX) index: a sorted, gzip-compressed lookup table that maps URLs to their locations in the WARC archive. When a publisher submits a takedown request and Common Crawl acts on it, what actually happens is de-indexing: the URL's entry is removed from the CDX index. The URL no longer appears in index queries. Researchers using the standard CDX API or tools like cdx_toolkit will not find the content. For casual research access, this is functionally equivalent to deletion.

For AI training pipelines, it is not.

How AI Labs Actually Ingest the Data

AI labs building training corpora at scale do not use the CDX index the way a researcher queries individual pages. They bulk-download raw WARC files directly from S3. This pipeline operates entirely at the file level. It does not query the CDX index. It pulls raw WARC files and processes their contents using distributed text extraction, deduplication, and quality filtering. The CDX index is simply not in the loop.

This means: when The New York Times submitted a takedown request to Common Crawl in July 2023, and Common Crawl de-indexed those URLs, any AI lab that had already downloaded the raw WARC files containing that content — or that downloads them in the future, since the files themselves have not been deleted — still has the full content. The takedown request intercepted the index layer. It did not touch the data layer.

De-indexing is not deletion. The WARC bytes sit on S3. The CDX entry is gone. For casual researchers, the content is invisible. For AI labs running bulk S3 downloads against raw WARC files, the de-indexing never happened.

Why Deletion Is Architecturally Intractable

True deletion — removing specific records from the WARC files themselves — is technically possible but operationally catastrophic at Common Crawl's scale. Each monthly crawl generates approximately 80,000 WARC files. Each file is a compressed sequential stream containing around 46,000 records from across thousands of domains. To surgically remove a specific publisher's content, you would need to:

  1. Identify every WARC file containing records matching the publisher's domains across all crawls.
  2. Decompress each affected WARC file — each roughly 900MB compressed.
  3. Parse the sequential record stream to locate matching entries.
  4. Rewrite the file with those records excluded, recalculating offsets for all subsequent records.
  5. Recompress and re-upload, invalidating any downstream CDX references to other records in the same file.
  6. Repeat across hundreds of crawls spanning years of historical data.

Common Crawl's archive spans petabytes across years of monthly crawls, each with overlapping domain coverage. The operational cost of true record-level deletion at that scale would be measured in millions of dollars of compute and months of engineering time — for a single major publisher's content. This is not a policy failure. It is an architectural one. The WARC format was designed for archival permanence, not surgical consent enforcement. The system was never built to do what it is now being asked to do.

Part II — The Quality Consequence

Model Collapse: The Mathematics of Degraded Data

The architectural failure described in Part I has a direct quality consequence that is now mathematically rigorous rather than merely theoretical. As high-quality, consent-clear content gets restricted and the de-indexing fiction fails to keep it out of existing pipelines, AI companies face a forced substitution problem. The three available substitutes — synthetic data, social media data, and lower-quality open web content — each produce distinct, documented failure modes.

The Synthetic Data Trap

The most widely discussed escape route from the data consent crisis is synthetic data generation: train models on AI-generated content rather than scraped human-written content, sidestepping the consent problem entirely. The research establishing why this does not work is now both peer-reviewed in Nature and replicated across multiple modeling paradigms.

The foundational result comes from Shumailov et al. (2024), published in Nature. The researchers showed that training language models recursively on AI-generated content — even partially — produces a compounding information loss. In the early phase, models lose information about the tails of the data distribution: rare knowledge, minority viewpoints, unusual but correct facts. In the late phase, the distribution collapses toward the mean, and outputs become repetitive, generic, and detached from the original data's statistical richness. The researchers called this model collapse.

The 2025 ICLR paper “Strong Model Collapse” sharpened the result considerably. In a formal regression setting, the researchers demonstrated that even the smallest fraction of synthetic data — as little as 1 sample per 1,000 — is sufficient to trigger model collapse asymptotically. Critically: larger training sets do not rescue the model from this failure. More data of lower quality does not compensate for less data of higher quality. Scaling does not solve a provenance problem.

Even 1 synthetic sample per 1,000 is sufficient to trigger asymptotic model collapse. Larger training sets do not compensate. Scaling does not solve a provenance problem.

The contamination of the web with AI-generated content compounds this problem regardless of any lab's intentional choices. By April 2025, 74.2% of newly created webpages contained some AI-generated text. AI-written pages in Google's top 20 results climbed from 11% to nearly 20% between May 2024 and July 2025. A model trained on a new Common Crawl snapshot is, increasingly, training on the outputs of prior model generations. The feedback loop is already running at the infrastructure level.

The Social Media Data Failure: Grok as Documented Evidence

Grok represents the most comprehensively documented production failure of training data quality collapse to date. Its architecture made the failure almost inevitable: xAI's core competitive differentiation was real-time access to the full X post stream, a dataset no competitor could replicate. The problem is that social media is structurally the worst possible training corpus for a model intended to function as a reliable information source.

Grok was trained partly on X posts — a platform that, following Elon Musk's 2022 acquisition and subsequent 80% reduction in trust and safety staff, became substantially more permissive of misinformation, conspiracy content, and coordinated inauthentic behavior. The model's epistemic character was shaped by the platform's epistemic character. The consequences were systematic patterns that tracked the specific misinformation genres that dominate X's information environment:

  • Incorrectly blamed a trans pilot for the January 2025 Washington DC helicopter crash based on viral X posts that preceded factual investigation.
  • Claimed the 2024 Trump assassination attempt was partially staged, echoing a conspiracy theory that circulated heavily on X.
  • Fabricated a criminal history for an Idaho shooting suspect, reflecting the platform's pattern of instant, unverified attribution in breaking news events.
  • Produced antisemitic outputs and briefly referred to itself as “MechaHitler” following system prompt modifications.

The critical technical point is not that these failures resulted from bad prompts or insufficient RLHF. The xAI engineering team applied multiple rounds of safety fine-tuning, system prompt modifications, and post-hoc corrections. None of these interventions addressed the underlying problem because the underlying problem was upstream of all of them. The training corpus had instilled a particular model of the world — one shaped by the epistemological character of a misinformation-permissive social media platform — and fine-tuning cannot reliably overwrite what pre-training established.

The Jurisdictional Arbitrage Failure

The third substitution path — sourcing data from jurisdictions with weaker privacy law to compensate for GDPR-mandated exclusions — produces a third distinct failure mode: geographic fragmentation of model quality. When the Irish Data Protection Commission forced xAI to stop processing EU user data for Grok training, the model serving European users had to be retrained on a different corpus. The result was a structurally degraded product: a Grok that lacked the cultural relevance, linguistic nuance, and European event timeliness of the version available to users in jurisdictions with weaker privacy enforcement.

This is not a temporary compliance burden. It is a permanent quality bifurcation. Models trained under consent constraints are systematically different from models trained without them — and in a world where consent requirements are tightening globally, the models available in high-rights jurisdictions will continue to diverge from those available where rights are more loosely enforced.

Part III — The Only Viable Architecture

Why the Fix Must Be Upstream

The WARC architecture analysis in Part I and the model collapse evidence in Part II converge on a single structural conclusion: the consent problem cannot be solved at the destination. Takedowns don't reach the training pipeline. Fine-tuning doesn't overwrite corrupted pre-training. Post-hoc filtering of synthetic content from already-assembled corpora is unreliable and increasingly impossible as web contamination compounds.

The only technically viable point of intervention is before data enters any pipeline — at creation.

Permanent Proof at Creation

The gap C2PA cannot fill is permanent, decentralized, pipeline-independent provenance. Stelais fills it through Arweave's permanent storage protocol — a blockchain-based storage network specifically designed for immutable, indefinitely accessible records — combined with a creator registration system that operates independently of any training pipeline's data ingestion process.

When a creator registers a work on Stelais, three operations occur that no downstream process can undo or ignore. First, a cryptographic hash of the work's content is written to Arweave's permanent ledger. This is not a pointer to a file; it is a content-addressed record that binds a specific hash to a specific moment in time, permanently, with no ability for any party — including Stelais — to modify or delete it. Second, the creator's identity assertion is anchored to that hash, establishing an unambiguous chain of custody from the moment of creation. Third, the creator's explicit consent preferences — whether the work is licensed for AI training, under what terms, and with what compensation requirements — are encoded into the same permanent record.

The architectural difference from C2PA is this: the Stelais record exists independently of the content's journey through any platform, pipeline, or format transformation. When a WARC pipeline strips HTML and discards embedded metadata, the Stelais provenance record is unaffected — it lives on Arweave, permanently, and can be queried at any time by any party with the content hash. An AI company verifying the provenance of a training document does not need the original file's metadata to be intact. It needs the content hash, which can be computed from the plain text, and a permanent ledger that maps that hash to a creator's consent record.

This is what “proof rather than signal” means technically. C2PA provides a signal that is contingent on the continued operation of a certificate infrastructure and the survival of metadata through file processing. Stelais provides a proof that is independent of any intermediate infrastructure and survives any format transformation.

Why Provenance Is Also a Curation Filter

The deepest non-obvious point in this architecture is that consent infrastructure and data quality infrastructure converge at the same mechanism. Content that is important enough for a creator to register with provenance is, by selection, higher quality than ambient scraped content. The act of registration itself filters for human-authored, cared-about, curated work — the exact properties that distinguish high-quality training data from slop.

An AI company training on a Stelais-registered corpus is not just acquiring legally defensible data. It is acquiring a dataset that has been filtered, at zero marginal cost to the lab, by the one signal that actually correlates with quality: whether a human being cared enough about a piece of content to permanently record its existence. That signal is not available in Common Crawl. It cannot be reconstructed from robots.txt compliance. It cannot be approximated by quality heuristics applied to WET files. It only exists at creation.

The data commons is closing because the system was never designed to honor creator choices. The fix is not better takedowns. It is making provenance and consent native to the act of creation itself.

Conclusion

The Pipeline Doesn't Lie

The WARC file system stores content sequentially in compressed archives across petabytes of S3 storage. The CDX index maps URLs to byte offsets within those files. When a publisher requests a takedown, Common Crawl removes the CDX entry. The WARC bytes remain. AI labs bulk-downloading raw files for training runs operate entirely below the CDX layer. The takedown never reached them.

Models trained on the resulting corpus — whatever mix of high-quality locked-down content that made it into early crawls, synthetic outputs, social media data, and low-quality open web content remains — carry the quality signature of that corpus forward. Fine-tuning modifies behavior at the margins. Pre-training shapes the model's underlying model of the world.

The models being trained today on degraded, consent-collapsed corpora are the foundation layers that future, more capable systems will build on. Capability amplifies whatever is baked into the base. A highly capable model whose worldview was quietly shaped by poisoned training data, running with agentic capabilities in critical infrastructure or financial systems, does not spread misinformation passively. It constructs persuasive, internally coherent false narratives and executes on them with an authority and fluency that makes the corruption nearly invisible to the humans relying on it.

The only fix is upstream: provenance at creation, permanent and verifiable, independent of any pipeline's data processing decisions. That is what Stelais builds. Learn more about the provenance problem in AI training data and why the creator economy needs a trust layer.


Key References

Model Collapse: Shumailov, I. et al. “AI models collapse when trained on recursively generated data.” Nature 631, 755–759 (2024). Dohmatob, E. et al. “Strong Model Collapse.” ICLR 2025 Spotlight. Alemohammad, S. et al. “Self-Consuming Generative Models Go MAD.” ICLR 2024.

Data Provenance & Consent Collapse: Longpre, S. et al. “Consent in Crisis: The Rapid Decline of the AI Data Commons.” Data Provenance Initiative (2024). Longpre, S. et al. “Data Authenticity, Consent, & Provenance for AI Are All Broken.” ICML 2024. Mozilla Foundation, “2024 Internet Health Report — AI Training Data.”

WARC / Common Crawl Architecture: Common Crawl Foundation, “Navigating the WARC file format” (2014). NVIDIA NeMo Curator, “Common Crawl Download Pipeline” documentation.

Grok / Training Data Quality Failures: PBS NewsHour (July 2025). Al Jazeera (July 2025). Northwestern CASMI (2024). Global Witness (2024). TechPolicy.Press (October 2025).

C2PA & Standards: C2PA Technical Specification 2.2 / 2.3. NSA/CISA, “Strengthening Multimedia Integrity in the Generative AI Era” (January 2025).

Build your provenance layer

Create permanent, cryptographic proof of your work that exists independently of any pipeline, platform, or institution.

Get Started Free