AI & Machine Learning

AI Scrapers Are Costing Wiki Operators Real Money

The scraping surge nobody warned wiki operators about AI bots are scraping the web for LLM training data at rates the internet has never seen before, and wikis are taking the worst of it. The reason is straightforward: wikis are packed with clean, structured, encyclopedic text — exactly what AI companies want to feed their ... Read more

AI Scrapers Are Costing Wiki Operators Real Money
Illustration · Newzlet

The scraping surge nobody warned wiki operators about

AI bots are scraping the web for LLM training data at rates the internet has never seen before, and wikis are taking the worst of it. The reason is straightforward: wikis are packed with clean, structured, encyclopedic text — exactly what AI companies want to feed their models. That makes them high-priority targets, and the scraping pressure has escalated sharply in recent months.

Weird Gloop, the company behind some of the largest video game wikis on the internet — including Minecraft Wiki, Old School RuneScape Wiki, and League of Legends Wiki — has been fighting this battle for three years. The fight has grown significantly harder. Without continuous, active mitigation efforts, bot traffic would consume roughly ten times more of Weird Gloop’s compute resources than all legitimate human traffic combined. That human traffic is not trivial: it includes tens of millions of pageviews from actual players looking up game mechanics, item stats, and quest guides.

The scale of that disparity exposes a structural problem that goes beyond any single operator. Wikis exist in a fundamentally different financial reality than the platforms AI companies typically scrape alongside them. A social media giant or major news organization has engineering teams, legal departments, and revenue streams large enough to absorb sudden infrastructure cost spikes. A wiki operation does not. When a wave of aggressive bots hits, the costs land directly and immediately on a small team that was already running lean.

The bots have also gotten harder to stop. Early scraper traffic was relatively easy to identify and block. Now, crawlers increasingly mimic human browsing behavior, making detection a genuine technical challenge that demands ongoing engineering time — time that wiki operators would otherwise spend improving content or maintaining infrastructure for their actual users. The hidden subsidy that wikis are providing to the AI industry is not just financial. It is the labor of technical staff who now spend significant portions of their working hours fighting off the companies profiting from the content those wikis spent years building.

What makes wikis uniquely vulnerable

Wikis were engineered from the start to be open. Every page is publicly crawlable, every edit is logged and linkable, and search engine accessibility was always the point. That openness made wikis like the Minecraft Wiki, the Old School RuneScape Wiki, and the League of Legends Wiki some of the most useful reference sites on the internet. In the era of Google, being crawlable was a feature. In the era of industrial AI scraping, it became a structural vulnerability.

Weird Gloop, the company that operates those wikis, has spent three years in an escalating fight against bot traffic. The math is stark: without active mitigation, AI scrapers alone would consume roughly ten times more compute resources than all human visitors combined — and those human visitors number in the tens of millions. The bots don’t arrive at a steady, manageable rate. They hit in spikes, hammering infrastructure unpredictably and causing the kind of outages and performance degradation that drive away real users trying to look something up.

That spiky, destabilizing pattern is what separates AI scraping from ordinary web traffic. A site can plan for baseline load. It cannot easily absorb sudden, massive surges that overwhelm servers without warning. The cost isn’t just financial — it’s an attack on the reliability that wiki communities have built their reputations on.

Underneath the infrastructure problem sits a human one. Wiki content isn’t produced by paid editorial teams. Volunteer contributors write, fact-check, format, and maintain articles on games they love, asking nothing in return except that the information stays useful and accessible to other fans. When AI companies scrape that content to train commercial models, they extract the value of thousands of hours of unpaid labor. The contributors get no compensation, no credit, and no say. The operators get a server bill. The AI companies get a dataset.

The missing context: this isn’t a Google-scale problem, it’s a small-operator crisis

When headlines cover AI and copyright, they zoom in on the New York Times suing OpenAI, or Reddit negotiating data-licensing deals worth tens of millions of dollars. Those are real fights, but they involve organizations with legal departments, revenue streams, and the institutional weight to force a negotiation. The operators running independent wikis have none of that.

Weird Gloop, which hosts major video game wikis for Minecraft, Old School RuneScape, and League of Legends, has spent three years absorbing the operational damage directly. Without constant mitigation work, AI scrapers alone would consume roughly ten times the compute resources used by everything else combined — and that “everything else” includes tens of millions of human pageviews. The scrapers don’t pay for that compute. The wiki operator does.

This is the part of the story that gets lost. A small wiki host can’t call a lawyer, can’t threaten litigation, and can’t offer a licensing deal that makes aggressive scraping less attractive than negotiating. The only tools available are technical countermeasures — bot detection, rate limiting, blocking — and those require constant maintenance as scrapers evolve to mimic human browsing behavior. That engineering time costs money too.

The self-defeating logic sitting underneath all of this deserves more attention than it gets. AI models are trained on the open, community-built web. Wikis represent some of the most densely accurate, collaboratively maintained knowledge on the internet — exactly the kind of high-signal data that makes a language model more useful. When scraping pressure forces wiki operators to restrict access, go offline, or simply burn out and shut down, the training data pool shrinks. The industry is extracting value from a resource while systematically degrading the conditions that produced it. No major AI lab has publicly acknowledged this dynamic, let alone acted on it. The burden stays where it landed: on the people running the infrastructure they depend on most.

How operators are fighting back — and why it’s an arms race they’re losing

Wiki operators are not sitting still. Teams like Weird Gloop — which runs the Minecraft, Old School RuneScape, and League of Legends wikis — have spent three years building and continuously updating bot-detection systems just to keep their infrastructure functional. That engineering time does not come free. Every hour a developer spends writing new filtering rules or analyzing traffic anomalies is an hour not spent improving search, fixing editor tools, or building features that actual human visitors would notice and use.

The core problem is that the standard opt-out mechanisms are broken. Robots.txt was designed as a good-faith protocol, a way for site owners to signal which automated traffic they welcome and which they want to block. Scrapers operating on behalf of AI training pipelines routinely ignore it. Others spoof their user-agent strings, impersonating legitimate crawlers like Googlebot so that operators either block them and risk collateral damage to real indexing, or let them through and absorb the cost. When the rules rely on honesty, bad actors have an obvious structural advantage.

Each countermeasure Weird Gloop and operators like them deploy creates a temporary ceiling — and the scrapers eventually punch through it. Block a known IP range and the traffic shifts to residential proxies. Detect unusual request patterns and the scrapers slow down to mimic human browsing cadence. Rate-limit aggressively and run the risk of degrading the experience for real users who look, momentarily, like bots. Without mitigation, Weird Gloop estimates scraper traffic would consume roughly ten times the compute resources of all human traffic combined. With mitigation, the engineering team is locked in a permanent defensive crouch.

The structural mismatch here is stark. The AI companies deploying these scrapers operate with hundreds of millions in funding and dedicated infrastructure teams. The wiki operators trying to stop them run lean, often rely on volunteer contributors, and exist to serve niche communities — not to fight asymmetric technical warfare. The cat-and-mouse dynamic does not trend toward equilibrium. It trends toward exhaustion on one side and data extraction on the other.

What needs to change — and who has to act

The current situation places the entire burden of defense on the operators least equipped to bear it. Weird Gloop spends significant engineering time and money on bot mitigation that, without intervention, would consume ten times the compute resources of all legitimate human traffic combined. That cost belongs to the companies extracting the value, not the volunteers and small teams producing it. AI developers and the cloud providers hosting their crawler infrastructure need enforceable crawling standards — rate limits, identification requirements, and real financial consequences for violations — not voluntary guidelines that bad actors ignore by definition.

Policymakers have not caught up. The regulatory conversation around AI training data has focused almost entirely on copyright: who owns the text, whether scraping constitutes reproduction, what licensing frameworks should look like. These are legitimate questions, but they leave an entire category of harm untouched. The infrastructure damage inflicted on small operators — the bandwidth bills, the server strain, the engineering hours diverted from actual product work — has no legal framework addressing it. A wiki operator whose site is destabilized by crawler floods has no meaningful recourse today.

The AI industry also has a direct self-interest in solving this problem, not just an ethical obligation. The wikis, forums, and community knowledge bases being hammered by scrapers are precisely the high-quality, human-generated, deeply specific content that makes training data valuable. Minecraft Wiki and the Old School RuneScape Wiki exist because dedicated communities spent years building them. If aggressive, uncompensated scraping degrades the economics of running those communities — or forces operators to block all bots entirely — the training data pipeline degrades with them. An AI industry that treats the open web as an extraction resource without limits will eventually find that resource depleted.

Fixing this requires action on three fronts simultaneously: AI companies adopting and actually enforcing responsible crawling conduct, cloud providers refusing to host infrastructure that violates those standards, and legislators expanding the AI data policy conversation beyond copyright to include the infrastructure harm that smaller operators are absorbing right now with no relief in sight.

AI-Assisted Content — This article was produced with AI assistance. Sources are cited below. Factual claims are verified automatically; uncertain claims are flagged for human review. Found an error? Contact us or read our AI Disclosure.

More in AI & Machine Learning

See all →