[go: up one dir, main page]

Page MenuHomePhabricator

API endpoint for URLs added within a diff
Open, Needs TriagePublicBUG REPORT

Description

I am working on https://en.wikipedia.org/wiki/Wikipedia:Citation_Watchlist a user script that scans revision log entries in recent changes, watchlist, and page history for URLs belonging to certain domains. The URLs are defined in lists, and the list of these lists is here: https://en.wikipedia.org/wiki/Wikipedia:Citation_Watchlist/Lists.

At the moment I am using the REST API to compute diffs between the scanned revision and the one coming before it. This is an expensive action and it is way too easy to run up against the limit. The reason I am doing this is to be able to identify what links are added by a given edit. (Really I want all references, not just link-based ones, but that's for another time.)

A thought then occurred to me: isn't Wikipedia already processing the external links being added within a revision? A feed of this data is consumed by the Internet Archive for archiving links added on Wikimedia projects. So then why not just make this data available over the API, with an action that is less expensive because it is based on work that is already being done?

My ideal API takes revision IDs as input and provides as output URLs added for each listed revision ID. Ideally I can request data for multiple revision IDs in the same web request.

Please add/remove project tags from this task as you see fit.

Event Timeline

What I have learned, much to my delight, is that this API endpoint exists... for current revisions.

https://en.wikipedia.org/w/api.php?action=query&titles=Easter_Island&prop=extlinks&format=json

If you specify a revision ID, it "normalizes" to page ID and gives you the current one.

A concrete step you could take is to add extlinks support for old revisions. That would make the user script I am writing even more efficient.