[go: up one dir, main page]

Page MenuHomePhabricator

Expose basic article identifiers (title, variant, revision...) to HTML scrapers
Open, LowPublicFeature

Description

There doesn't seem any easy way for a tool that takes an URL and processes the wiki article at that URL (spider, browser extension etc.) to identify the article sufficiently to interact with the API - the title and variant is contained in the URL but it might be in the path or the query, the path format might depend on wiki configuration, the URL might be in non-canonical encoding etc. MediaWiki scripts rely on page variables instead, but those use mw.config so they are not in a parsable format. The most important variables identifying the content (title, variant, revision, maybe page ID) should be embedded in the HTML in a machine-readable format.

Event Timeline

(The use case where this came up is a browser extension for sending the current article to action=readinglists&command=createentry.)

MediaWiki scripts rely on page variables instead, but those use mw.config so they are not in a parsable format.

Why not? /"wgPageName"\s*:\s*"([^"]+)"/, /"wgRelevantArticleId"\s*:\s*(\d+)/, or something similar depending on what you actually want, will extract the relevant data from the HTML.

The advice about parsing HTML with regex probably applies here.

We could just add the variables as a bunch of meta keywords, or something like JSON-LD, to get well-defined, machine-readable syntax.

Vvjjkkii renamed this task from Expose basic article identifiers (title, variant, revision...) to HTML scrapers to 79daaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 1:40 AM
Aklapper changed the subtype of this task from "Task" to "Feature Request".