How AI Scraping Actually Works — and What Stelais Is Building to Stop It

If you publish images, writing, music, or any creative work on the internet, it has almost certainly been ingested into an AI training dataset. Not because someone targeted you specifically, but because the infrastructure that feeds foundation models operates at a scale that makes individual creator consent structurally irrelevant. The system does not ask permission. It does not check licenses. It downloads the internet in bulk, extracts the content, and moves on.

This guide explains exactly how that system works — the actual technical pipeline, not the policy abstractions — and then explains what Stelais is building to give creators real protection: adversarial perturbations that make images unusable for AI training, and detection systems that identify when your work has been scraped.

Part I — The Scraping Infrastructure

Common Crawl: the shared data utility of AI

The majority of foundation models — GPT-4, Claude, Llama, Gemini — are trained on datasets built substantially from Common Crawl, a nonprofit that has been archiving the open web since 2008. Common Crawl produces a new snapshot of the internet roughly every month. Each snapshot is petabytes of compressed data stored as WARC (Web ARChive) files on Amazon S3. These snapshots are the raw material from which the major training corpora — C4, RefinedWeb, Dolma, The Pile — are assembled.

The scale is difficult to overstate. Each monthly crawl produces approximately 80,000 WARC files, each containing around 46,000 web page captures, compressed to roughly 900MB per file. A single crawl captures billions of pages across millions of domains. If you have a website, a blog, a portfolio, or a social media presence with public content, your work is in these archives.

How AI labs actually use this data

AI companies building training corpora do not browse these archives page by page. They bulk-download raw WARC files directly from S3 and process them using distributed computing frameworks. The pipeline is automated, running at the file level — it pulls the raw archives and applies text extraction, deduplication, quality filtering, and domain classification at scale.

For text content, the pipeline extracts plain text from HTML, discarding all markup, metadata, and container-level information. For images, crawlers download the image files directly, often re-encoding them and stripping all embedded metadata — EXIF, IPTC, XMP, and any C2PA Content Credentials — in the process.

This is the critical architectural fact: everything that is not the raw content itself gets discarded. Copyright notices in HTML footers. Licensing metadata in image files. robots.txt directives. C2PA manifests. Terms of service. All of it is stripped away before the content reaches the training pipeline. The AI model never sees it.

WARC files are immutable

WARC is an ISO standard container format designed for archival permanence. Records inside a WARC file are written sequentially into a compressed stream. Once written, they cannot be individually removed without decompressing the entire file, rewriting it with the record excluded, recalculating byte offsets for every subsequent record, and recompressing and re-uploading it — across hundreds of crawls spanning years of data.

When a publisher requests a takedown from Common Crawl, what actually happens is de-indexing: the URL is removed from the CDX lookup index. Researchers using the standard API will no longer find the content. But AI labs bulk-downloading raw WARC files bypass the CDX index entirely. The de-indexed content is still in the archive, still on S3, still available to any bulk download pipeline. For a full technical treatment of why this makes consent structurally unenforceable, see our guide on the architecture of forgotten consent.

De-indexing is not deletion. The bytes remain on S3. AI labs operating below the CDX layer never see the takedown. The architecture was designed for archival permanence, not consent enforcement.

Beyond Common Crawl: direct scraping

Common Crawl is the largest single data source for AI training, but it is not the only one. AI companies also operate their own crawlers. OpenAI's GPTBot, Google's Google-Extended, Anthropic's ClaudeBot, and others crawl the web directly for training data. For images specifically, LAION — the dataset that trained Stable Diffusion — was assembled by scraping billions of image-text pairs from across the open web.

These direct crawlers are theoretically subject to robots.txt directives. In practice, compliance is spotty. A 2024 TollBit study found that only 12.9% of websites had implemented AI-specific robots.txt blocks. OpenAI quietly stopped honoring robots.txt for its ChatGPT-User agent in late 2025. And even perfect robots.txt compliance only affects future crawls — it cannot retroactively remove content from snapshots already captured and distributed. For more on why robots.txt fails as a consent mechanism, see the provenance problem in AI training data.

Part II — Why Current Defenses Fail

Creators are not passive in this. There are tools and techniques marketed as AI scraping defenses. The problem is that each one operates at a layer that the scraping pipeline bypasses or ignores.

robots.txt and meta tags

robots.txt is a voluntary protocol. No law requires crawlers to obey it. Even when crawlers comply, they only affect future behavior — content already archived in WARC files or downloaded by prior crawls is unaffected. And the standard was designed for search engines, not AI training pipelines. The granularity is wrong: you can block a crawler from your entire site, but you cannot express per-file licensing terms, consent conditions, or compensation requirements.

Platform opt-outs

Some platforms (DeviantArt, ArtStation, Shutterstock) offer opt-out toggles that signal “do not use for AI training.” These are meaningful only if the AI company has a licensing agreement with the platform and chooses to honor the flag. They have no effect on Common Crawl, on direct web crawlers, or on any party that scrapes the content independently. An opt-out toggle on a platform is a request, not an enforcement mechanism.

Metadata and watermarks

Embedded metadata (EXIF, IPTC, C2PA Content Credentials) is stripped during ingestion. Traditional watermarks — visible overlays — degrade the image for legitimate viewers and are trivially removable with inpainting tools. Invisible watermarks (LSB steganography, DCT-domain marks) can survive some transformations but are not checked by training pipelines — they require active cooperation from the party ingesting the data, which is precisely the party you are trying to constrain.

The common thread

Every defense above relies on the same assumption: that the party scraping your content will voluntarily cooperate with your protection mechanism. robots.txt requires crawler compliance. Platform opt-outs require licensing agreements. Metadata requires pipeline preservation. Watermarks require detection cooperation. None of these assumptions hold when the scraping infrastructure operates at bulk scale, strips metadata by default, and faces no legal consequence for ignoring voluntary signals.

A defense that requires the attacker's cooperation is not a defense. The only protections that work are the ones that operate on the content itself, independent of any downstream system's behavior.

Part III — Adversarial Perturbations: Protection That Works on the Content Itself

What adversarial perturbations are

Adversarial perturbations are subtle, mathematically computed modifications to an image that are invisible (or nearly invisible) to the human eye but fundamentally disrupt how AI models interpret the content. The concept comes from adversarial machine learning research: neural networks are sensitive to carefully crafted input modifications that exploit the gap between human perception and computational feature extraction.

Applied to AI training defense, adversarial perturbations alter the pixel-level structure of an image in a way that causes AI models to learn incorrect representations from it. The image looks identical to a human viewer. To an AI model, the signal is corrupted.

The research landscape

The academic foundations are substantial and peer-reviewed:

Glaze (University of Chicago, 2023) — the first widely deployed adversarial tool for artists. Glaze applies style-specific perturbations that prevent diffusion models from learning a creator's artistic style. When a model trains on Glazed images, it fails to replicate the artist's style in generated outputs. Over 2.3 million artists have used Glaze since its release.
Nightshade (University of Chicago, 2024) — a more aggressive approach that “poisons” training data. Nightshade perturbations cause models to learn incorrect associations — a poisoned image of a dog might cause the model to associate dog-like features with an entirely different concept. At scale, Nightshade corrupts the model's learned representations in ways that are difficult to detect and expensive to reverse.
Mist (2023) — targets diffusion model fine-tuning specifically, disrupting the ability of models like Stable Diffusion to learn from perturbed images during LoRA or DreamBooth training.
Anti-DreamBooth (2023) — another approach focused on preventing identity-specific fine-tuning, particularly relevant for protecting portraits and face data from being used to create deepfakes.

These tools demonstrate a crucial principle: adversarial perturbations operate on the content itself, not on any metadata, protocol, or policy layer. They do not require the scraper's cooperation. They do not depend on robots.txt compliance, platform opt-outs, or metadata preservation. The protection travels with the pixels.

What Stelais is building

Stelais is developing an adversarial perturbation layer integrated directly into the proof-of-origin workflow. When a creator uploads an image to Stelais, the system will apply adversarial perturbations optimized for current-generation training pipelines before the creator distributes the work publicly.

The design goals are:

Perceptual transparency. The protected image must be visually indistinguishable from the original to human viewers. Creators should not have to choose between protection and presentation quality.
Pipeline resilience. The perturbations must survive the transformations that scraping pipelines apply: JPEG re-compression, resizing, format conversion, and cropping. A perturbation that disappears after a platform re-encodes the image is not useful.
Model generality. The perturbations must be effective across multiple model architectures — diffusion models, vision transformers, and future architectures — not just one specific model version. As AI architectures evolve, the perturbation strategy must evolve with them.
Integrated workflow. Protection should be a single step in the existing Stelais proof creation flow, not a separate tool with its own interface. Upload, protect, prove, publish.

This is active development work. The adversarial perturbation layer is not yet available in the current version of Stelais. We are investing in this because we believe that protection at the content level — not the policy level — is the only approach that scales against bulk scraping infrastructure.

Part IV — Detection: Knowing When Your Work Has Been Scraped

The detection problem

Prevention is half the equation. The other half is detection: knowing when your work has been ingested into a training dataset, used to generate derivative outputs, or scraped and republished without attribution.

Current detection is fragmented. Reverse image search finds exact copies but misses transformed versions. Perceptual hashing catches near-duplicates but fails when images are meaningfully altered. Style analysis can identify when a model has learned from a specific artist's work, but it requires manual comparison and expert judgment.

How Stelais approaches detection

Stelais is building detection capabilities that work in conjunction with the permanent provenance layer. The approach combines multiple signals:

Cryptographic hash matching. The SHA-256 hash anchored on Arweave at proof creation serves as a ground truth. Any exact copy of the work, anywhere on the internet, can be identified by computing its hash and checking it against the permanent ledger.
Perceptual fingerprinting. Beyond exact matches, Stelais uses perceptual hashing (pHash) and DCT-domain analysis to identify images that have been cropped, resized, re-compressed, or otherwise transformed — the standard modifications that scraping pipelines apply. A scraped and re-encoded version of your image produces a different SHA-256 hash but a similar perceptual fingerprint.
On-demand similarity scanning. Creators can scan a suspect URL or uploaded file against their registered works. When a match is detected — whether exact or perceptual — the result includes the source, the match confidence, and the original proof record for comparison. Scheduled, automated monitoring is on the roadmap.
Training set membership inference. An emerging area of research: techniques that can determine whether a specific data point was included in a model's training set. While this research is still maturing, Stelais is tracking developments in membership inference, model auditing, and output attribution with the goal of integrating detection at the model output level — identifying not just when your work was scraped, but when a model has learned from it.

Detection plus provenance

Detection alone is not enough — you also need proof. Identifying a scraped copy of your work is useful only if you can prove it is yours. This is where Stelais's permanent provenance layer becomes essential. The Arweave-anchored proof establishes an immutable, timestamped record of creation. When detection identifies unauthorized use, the proof provides the evidentiary foundation for enforcement — whether that means a DMCA takedown, a licensing negotiation, or a legal claim.

The combination is the point. Prevention (adversarial perturbations) reduces the value of scraping your work. Detection (fingerprinting and monitoring) identifies when it happens anyway. Provenance (Arweave-anchored proof) gives you the legal standing to act on what you find.

Prevention makes scraping costly. Detection makes it visible. Provenance makes it actionable. You need all three.

What You Can Do Now

The adversarial perturbation layer is in active development. In the meantime, here is what you can do today:

Register your work on Stelais. Establish permanent proof of creation on Arweave before publishing. This gives you an immutable, timestamped record that exists independently of any platform or pipeline.
Use Glaze and Nightshade. These are free, publicly available tools from the University of Chicago. Apply them to images before publishing online. They are not perfect, but they are the best available content-level protection today.
Set your consent terms. Stelais proofs include your explicit AI training consent preferences. When detection and enforcement infrastructure matures — and it will — your terms will already be on the permanent record.
Use Stelais similarity scanning. The existing scan feature identifies instances of your work appearing online using perceptual fingerprinting and hash matching. Run scans regularly to monitor for unauthorized use.

Learn more about the WARC architecture that makes consent unenforceable and why Stelais uses Arweave for permanent provenance.