[go: up one dir, main page]

Page MenuHomePhabricator

Implement exclusion for IPs generating artificially high levels of banner impressions
Closed, ResolvedPublic

Description

Background
Throughout our Big English banner pre-testing in FY2425, we've observed an extreme increase in impressions served for some banner types in some geographies; notably desktop large in the U.S.

Concurrently, Product Analytics have been observing increases in page views and/or unique devices in the U.S. and other markets. They believe bots and automated traffic are responsible for some significant portion of those increases, and we believe that same automated traffic is bloating our banner impression data.

Planned action
The suggestion from Product and backed by Fundraising is to exclude traffic from IPs generating an abnormally high number of banner impressions in our internal impression reporting.

Considerations

  • What threshold should we use? Elliott observed that in a recent Big English pre-test, 5% of impressions served came from IPs with more than 10 hits.
  • How will we report on the impact? Can we model a given approach with recent data before implementing it? Can we roll the change back after it is made?

Event Timeline

Hey @Ejegg just starting to draft this task, could you review my description and correct anything I misstated?

Also I realized I should clarify: the action we're planning to take will only impact internal reporting, right? it isn't "user facing" e.g. some change in CentralNotice code that would actually stop trying to serve impressions to IPs above a given threshold?

Change #1085694 had a related patch set uploaded (by Ejegg; author: Ejegg):

[wikimedia/fundraising/tools/DjangoBannerStats@master] WIP count recent hits per IP to discard after limit

https://gerrit.wikimedia.org/r/1085694

Notes on solution coded up in the patch:

  • It only discards the hits from an IP after it counts 10 hits in a day - it can't go back and subtract the numbers from the previously-aggregated counts.
  • It slows down the calculations quite a bit, but on my machine was still fast enough to keep up with new file generation.

Change #1085694 merged by jenkins-bot:

[wikimedia/fundraising/tools/DjangoBannerStats@master] Count recent hits per IP to discard after limit

https://gerrit.wikimedia.org/r/1085694

Database table created on analytics origin db (frdb2003) and verified it has replicated out.

XenoRyet set Final Story Points to 4.