The frwiki article Cookie (informatique) (dedicated to HTTP cookies) receives a lot of pageviews in July 2022.
Yesterday, this represented a total of 3,4 millions pageviews, almost 10% of the total of frwiki pageviews.
Pyb | |
Jul 15 2022, 7:53 AM |
F36877133: Screenshot_20230228_151701_Wikipedia.jpg | |
Feb 28 2023, 2:21 PM |
F36525256: Screenshot_20230122-202626.png | |
Jan 27 2023, 12:40 PM |
F35327516: pageviews_circuit_de_pau_202204to07.png | |
Jul 21 2022, 9:05 AM |
The frwiki article Cookie (informatique) (dedicated to HTTP cookies) receives a lot of pageviews in July 2022.
Yesterday, this represented a total of 3,4 millions pageviews, almost 10% of the total of frwiki pageviews.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T138207 [Open question] Improve bot identification at scale | |||
Open | None | T313114 Analyze possible bot traffic for frwiki article Cookie (informatique) |
Thanks for the report! I checked data that are available in Turnilo about this article. Looks to be public clouds requesting the articles (https://w.wiki/5TBA), identifying as a wide variety of devices (https://w.wiki/5TBC).
At a first sight, not sure how to improve the spider detection (the only common thing is usage of public clouds). Data-Engineering might know. Tagging them too.
I did a bit of digging yesterday. I have looked at Turnillo webrequest_sampled_128. At first Uri host fr.wikipedia.org only shows 800k hits per day which looks surprising low to me. For July 13th - July 20th:
Uri Path | Hits |
---|---|
Total | 6.2m |
/w/load.php | 1.6m |
/w/api.php | 0.9m |
/wiki/Circuit_de_Pau-Ville | 387.9k |
/wiki/Cookie_(informatique) | 278.0k |
/w/index.php | 91.4k |
/ | 77.2k |
/w/rest.php/v1/search/title | 75.2k |
/wiki/Wikip%C3%A9dia:Accueil_principal (the main page) | 74.7k |
/beacon/media | 51.5k |
/wiki/Wikip�dia:Accueil_principal (the main page | 35.0k |
So looks like Circuit_de_Pau-Ville is affected as well.
The articles are:
https://fr.wikipedia.org/wiki/Cookie_(informatique) | Which is about HTTP web browser cookie |
https://fr.wikipedia.org/wiki/Circuit_de_Pau-Ville | A motor race circuit in a french city which was held on 20-22 May 2022
The reported page views for Circuit_de_Pau-Ville from April 1st to July 20th https://pageviews.wmcloud.org/?project=fr.wikipedia.org&platform=all-access&agent=user&redirects=0&start=2022-04-01&end=2022-07-20&pages=Circuit_de_Pau-Ville :
So roughly 20 page views, a bump beginning of May with a spike at 210 page views and quickly went back to less than 20 views per day. Yet Turnillo gives us 380k hit for /wiki/Circuit_de_Pau-Ville over the last 7 days.
pageviews for Cookie_(informatique) for the same April 1st / July 20th period:
I am not sure what happens with webrequest_sampled_128 but it sounds off. Maybe it is an internal probe of some sort or a glitch in our infrastructure which keeps requesting those two specific pages? Why does Cookie_(informatique) huge hits count not reflect in number of pageviews?
https://fr.wikipedia.org/wiki/Circuit_de_Pau-Ville shows 1.7M hits count over last 30 days from a single IP yet it barely has any pageviews reported. Looks like they are not related. Anyway something looks off.
With an ancient user-agent nonetheless. Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
This looks fishy.
The phenomenon seems to be correlated with a publication about cookies in our local Signpost subpage : https://fr.m.wikipedia.org/wiki/Sp%C3%A9cial:MobileDiff/195092300
We'll consider next week whether a member of Product-Analytics should do some additional investigation.
It looks like this is one of several examples for ways that the automated tag could be improved. My understanding is that the Data Engineering team does not have bandwidth to take up adjustments to the automated/user tags anytime in the near future (@odimitrijevic, is that right?). So I'm moving this to our Icebox until it looks like work might be able to get started.
another article with a lot of pageviews: Russia ( automated 7 millions, user 5 millions). It's only the first day, might disappear. Stats of the French Wikipedia are less and less relevant ;)
Improving the bot traffic detection is on the longer term planning horizon, however we are not likely to get to this in the near future given other priorities.
I'm going to heavily throttle page views to that article coming from the bot's AS(es) which would reduce the problem for now.
I honestly think this is an important topic to tackle, there are some security and safety aspect to this as well that I can't discuss publicly but I humbly ask you to reconsider the priorities on this. Feel free to contact me or @JanWMF privately for more information.
@Ladsgroup rate limiting or blocking at the edge should be used only for traffic putting the infrastructure at risk (eg. DDoS). So I don't think we should throttle this.
Such rate limit seems like hiding the problem under the rug, and will become a whack-a-mole game (see. T313114#8316051) adding extra load on SRE and cluttering critical tools.
We also recieved this from one of our users through the Android support email:
Hello,
I am contacting you because the page "cookie (informatique)" is always showing on the first position of the top 5 for the French language.
It seems its a bug (screenshot attached).
Could you please double-check and resolve the problem?
Wanted to note a couple of things I had explored earlier:
Nice! For example we could imagine someone doing QA testing of a browser from those hosting providers.
is from public cloud is a semi-manually curated list used for DDoS mitigation, so far we've been adding major cloud providers as we saw attacks originating from them.
Semi-manual to reduce the risk of blocking legit traffic, and (I think) for better integration with our existing tooling.
Maxmind provides a global IP database called enterprise-database which have a "User Type" field.
Indicates type of user of the IP address.
Outputs: business, cafe, cellular, college, consumer_privacy_network, content_delivery_network, government, hosting, library, military, residential, router, school, search_engine_spider, traveler
Here of course it's "hosting" that could help remove the false positives, but also "search_engine_spider" (even though I guess we filter on UA) or "content_delivery_network". From my quick look it could maybe replace our current Maxmind DB in the analytics pipeline (as the enterprise one seems to include everything the current one does).
One downside is that it would increase our dependency on Maxmind (which can be seen as a black box) and maybe cost.
@PBradley-WMF: Nope. It very infrequently shows up in top 20 search results when users search for "cookies" but the most clicks it got in a single day (across all various search queries) from the past 16 months is 200 (with ~10K impressions).
I'm wondering about the Links report - does it tell us which third party sites might be linking to the cookies article? https://support.google.com/webmasters/answer/9049606 ("Top linking sites for a given page")
Last time I have checked, most of the traffic to https://fr.wikipedia.org/wiki/Cookie_(informatique) came from a handful of IP addresses. The pattern has slightly changed and since a few days it is from a single IP hitting it at 200k requests per hour (if my math is correct). From (private link) Turnillo https://w.wiki/6J3$
@PBradley-WMF I took a look at the top linking sites for https://fr.wikipedia.org/wiki/Cookie_(informatique), and compared it to https://de.wikipedia.org/wiki/HTTP-Cookie, https://en.wikipedia.org/wiki/Cookie, https://en.wikipedia.org/wiki/HTTP_Cookie, and https://en.wikipedia.org/wiki/HTTP_Cookie. Several of the top linking sites for the French & German cookie pages are the same, actually. One common thread is that the links to Wikipedia appear in the privacy banner on pages (I saw this in several, but not all, of the websites I checked).
Note that the German page (https://de.wikipedia.org/wiki/HTTP-Cookie) does NOT have the same problem with automated activity that the French page does.
However, there was 1 site that linked to the French page over 1 million times. This puts it in the top 250 sites that link to anywhere on Wikipedia.org, and the site only links to that one page. But I don't think this necessarily explains the traffic - there are some other sites that link to a single Wikipedia.org page over a million times, and those pages don't get a suspicious level of traffic.
If you want to explore, I downloaded the data in this spreadsheet (private link): https://docs.google.com/spreadsheets/d/1efqQMonnfIW88il_f2DG5xRAmlOUz-D_jTjg8NQpf3w/edit?usp=sharing
Good news, the page Cookie (informatique) no more shows up the top reads of the Android Wikipedia Application. So I guess something is now flagging those views as automated which is great. That is how I noticed it originally.
For January 2023 the page had almost 200 million views, the second most viewed page only had 764 000 views. https://pageviews.wmcloud.org/topviews/?project=fr.wikipedia.org&platform=all-access&date=last-month&excludes=
I think that something is within the Android app. Daily topviews still put Cookie (informatique) at top (and so does the REST API).
I don't get it from the Wikipedia Android app (r/2.7.50426-r-2022-12-08 built by F-Droid)
But the others matches, so I am guessing it is somewhat skipped/filtered out by the app.
Apps use the wikifeeds service, which contains a hard-coded denylist of article titles that we update occasionally, and the Cookie article was added recently:
https://gerrit.wikimedia.org/r/c/mediawiki/services/wikifeeds/+/884303/3/lib/most-read.js