Maniphest T193000

Expose basic article identifiers (title, variant, revision...) to HTML scrapers
Open, LowPublicFeature
Actions

Assigned To

None

Authored By

	Tgr
	Apr 25 2018, 10:33 AM

Description

There doesn't seem any easy way for a tool that takes an URL and processes the wiki article at that URL (spider, browser extension etc.) to identify the article sufficiently to interact with the API - the title and variant is contained in the URL but it might be in the path or the query, the path format might depend on wiki configuration, the URL might be in non-canonical encoding etc. MediaWiki scripts rely on page variables instead, but those use mw.config so they are not in a parsable format. The most important variables identifying the content (title, variant, revision, maybe page ID) should be embedded in the HTML in a machine-readable format.

Event Timeline

Tgr created this task.Apr 25 2018, 10:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2018, 10:33 AM

(The use case where this came up is a browser extension for sending the current article to action=readinglists&command=createentry.)

• Mholloway subscribed.Apr 25 2018, 12:43 PM

Anomie moved this task from Unsorted to Non-core-API stuff on the MediaWiki-Action-API board.Apr 25 2018, 1:47 PM

MediaWiki scripts rely on page variables instead, but those use mw.config so they are not in a parsable format.

Why not? /"wgPageName"\s*:\s*"([^"]+)"/, /"wgRelevantArticleId"\s*:\s*(\d+)/, or something similar depending on what you actually want, will extract the relevant data from the HTML.

The advice about parsing HTML with regex probably applies here.

We could just add the variables as a bunch of meta keywords, or something like JSON-LD, to get well-defined, machine-readable syntax.

• Vvjjkkii renamed this task from Expose basic article identifiers (title, variant, revision...) to HTML scrapers to 79daaaaaaa.Jul 1 2018, 1:14 AM

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

• Community_Tech_bot renamed this task from 79daaaaaaa to Expose basic article identifiers (title, variant, revision...) to HTML scrapers.Jul 1 2018, 6:53 AM

• Community_Tech_bot updated the task description. (Show Details)

• Community_Tech_bot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

• Community_Tech_bot added a subscriber: Aklapper.

CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 1:40 AM

Aklapper triaged this task as Low priority.Oct 9 2023, 12:47 AM

Aklapper changed the subtype of this task from "Task" to "Feature Request".

Expose basic article identifiers (title, variant, revision...) to HTML scrapersOpen, LowPublicFeatureActions

Description

Event Timeline

Expose basic article identifiers (title, variant, revision...) to HTML scrapers
Open, LowPublicFeature
Actions