Wikipedia:Wikipedia Signpost/Next issue/Op-ed

Article display preview:

Treysam – CC-BY-SA 4.0

How crawlers impact the operations of the Wikimedia projects

Our content is free, our infrastructure is not!

This is a draft of a potential Signpost article, and should not be interpreted as a finished piece. Its content is subject to review by the editorial team and ultimately by JPxG, the editor in chief. Please do not link to this draft as it is unfinished and the URL will change upon publication. If you would like to contribute and are familiar with the requirements of a Signpost article, feel free to be bold in making improvements!

This draft article ...

Y ... has a title defined.
How crawlers impact the operations of the Wikimedia projects
Y ... has a blurb defined.
Our content is free, our infrastructure is not!
N ... is not yet ready to be copyedited.
N ... has not yet been copyedited.
Y ... has an image.
N ... is not yet approved for publication.

Writer resources ...

The Newsroom (talk)

deadlines

Writing: 6 April 02:00 (-0 days left; 0%)

Publishing: 7 April 02:00 (1 day left; 6%)

Deadline has started. (refresh)

Last revised 18:29, 6 April 2025 (UTC) (0 seconds ago) by Smallbones (refresh)

← Back to Contents

View Latest Issue

Next issue

Op-ed

How crawlers impact the operations of the Wikimedia projects

Contribute —

By Birgit Mueller, Chris Danis, and Giuseppe Lavagetto, all from the Wikimedia Foundation

This article was originally published at Diff on April 1, 2025. It is licensed. CC-BY-SA 4.0.

Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we’ll discuss the reasons for this trend and its impact.

The Wikimedia projects are the largest collection of open knowledge in the world. Our sites are an invaluable destination for humans searching for information, and for all kinds of businesses that access our content automatically as a core input to their products. Most notably, the content has been a critical component of search engine results, which in turn has brought users back to our sites. But with the rise of AI, the dynamic is changing: We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

A view behind the scenes: The Jimmy Carter case

When Jimmy Carter died in December 2024, his page on English Wikipedia saw more than 2.8 million views over the course of a day. This was relatively high, but manageable. At the same time, quite a few users played a 1.5 hour long video of Carter’s 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic, doubling its normal rate. As a consequence, for about one hour a small number of Wikimedia’s connections to the Internet filled up entirely, causing slow page load times for some users. The sudden traffic surge alerted our Site Reliability team, who were swiftly able to address this by changing the paths our internet connections go through to reduce the congestion. But still, this should not have caused any issues, as the Foundation is well equipped to handle high traffic spikes during exceptional events. So what happened?

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

The graph below shows that the base bandwidth demand for multimedia content has been growing steadily since early 2024 – and there’s no sign of this slowing down. This increase in baseline usage means that we have less room to accommodate exceptional events when a traffic surge might occur: a significant amount of our time and resources go into responding to non-human traffic.

Multimedia bandwidth demand for the Wikimedia Projects. Credit Chris Danis.

65% of our most expensive traffic comes from bots

The Wikimedia Foundation serves content to its users through a global network of datacenters. This enables us to provide a faster, more seamless experience for readers around the world. When an article is requested multiple times, we memorize – or cache – its content in the datacenter closest to the user. If an article hasn’t been requested in a while, its content needs to be served from the core data center. The request then “travels” all the way from the user’s location to the core datacenter, looks up the requested page and serves it back to the user, while also caching it in the regional datacenter for any subsequent user.

While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

While undergoing a migration of our systems, we noticed that only a fraction of the expensive traffic hitting our core datacenters was behaving how web browsers would usually do, interpreting javascript code. When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total. This high usage is also causing constant disruption for our Site Reliability team, who has to block overwhelming traffic from such crawlers before it causes issues for our readers.

Wikimedia is not alone with this challenge. As noted in our 2025 global trends report, technology companies are racing to scrape websites for human-created and verified information. Content publishers, open source projects, and websites of all kinds report similar issues. Moreover, crawlers tend to access any URL. Within the Wikimedia infrastructure, we are observing scraping not only of the Wikimedia projects, but also of key systems in our developer infrastructure, such as our code review platform or our bug tracker. All of that consumes time and resources that we need to support the Wikimedia projects, contributors, and readers.

Our content is free, our infrastructure is not: Establishing responsible use of infrastructure

Delivering trustworthy content also means supporting a "knowledge as a service" model, where we acknowledge that the whole internet draws on Wikimedia content. But this has to happen in ways that are sustainable for us: How can we continue to enable our community, while also putting boundaries around automatic content consumption? How might we funnel developers and reusers into preferred, supported channels of access? What guidance do we need to incentivise responsible content reuse?

We have started to work towards addressing these questions systemically, and have set a major focus on establishing sustainable ways for developers and reusers to access knowledge content in the Foundation’s upcoming fiscal year. You can read more in our draft annual plan: WE5: Responsible Use of Infrastructure. Our content is free, our infrastructure is not: We need to act now to re-establish a healthy balance, so we can dedicate our engineering resources to supporting and prioritizing the Wikimedia projects, our contributors and human access to knowledge.