- Affected components: TBD.
- Engineer for initial implementation: TBD.
- Code steward: TBD.
Motivation
OpenGraph has become an universal standard for providing a information suitable for showing a small preview of a webpage, used by a variety of social media (e.g. Twitter cards), messaging apps, and search engines.
The PageImages extension provides decent support for og:image. But, no similar support for og:description exists. The lack of support for decent page descriptions is a painful deficiency in MediaWiki, affecting both our internal tools (e.g. T185017) and our ability to share our content with the world (e.g. T142090). There is interest in WMF Readers in changing that.
Requirements
- The HTML response for viewing articles contains <meta property="og:description" content="…">.
- The description is in plain text and auto-generated from the first paragraph of the article (also known as a "page summary", or "text extract").
Exploration
Existing solutions
Some extensions exist for this purpose (e.g. OpenGraphMeta) but they all assume the description is manually provided by users, which doesn't scale.
The TextExtracts extension can provide automatic page descriptions, and it would be straightforward to display that, but the quality is not great, as the WMF doesn't really maintain the extension anymore, and has opted for a summary service (part of the (Page Content Service) instead, which has significantly more complicated logic for dealing dealing with templates, parantheses with lots of low-relevance data in the first sentence, and similar intricacies of Wikipedia articles (non-Wikipedias are not supported). The difference is substantial enough that only the logic used in PCS is acceptable for OpenGraph summaries.
This leaves two potential approaches:
- port the relevant part of PCS to PHP,
- or, figure out how to include output from an external service in the HTML of a page.
Neither of those options look great. The goal of the RfC is to determine which one is acceptable / preferable, and whether better alternatives exist. Also feedback from third party wikis on how they would generate text would be valuable (most people probably don't want to rely on PCS/RESTBase, and it's fairly Wikipedia-specific anyway; what would be the best level to abstract it away? e.g. should we fall back to TextExtracts?)
Porting the summary logic in the Page Content Service to MediaWiki
The code that would have to be ported is fairly small, but right now it is not time-sensitive, uses DOM traversal liberally, and takes long for very large pages, while as a part of the parsing/rendering it would have to finish quickly even for a big article. Also, the input would have to be the HTML rendered by the PHP parser instead of Parsoid, which might maybe cause problems. So this would probably be a major rewrite effort and we'd end up doing the same thing in entirely different ways in MediaWiki and PCS and would have a double maintenance burden for description-generation code.
See T214000: Evaluate difficulty of porting PCS summary logic to PHP for details.
Using Page Content Service data in MediaWiki page HTML
Page Content Service uses Parsoid HTML (and to a smaller extent, MediaWiki APIs) as input; Parsoid uses the wikitext provided by the MediaWiki API. So when an edit happens, it needs to be published in MediaWiki, then processed in Parsoid, then processed in PCS. That's too slow for MediaWiki HTML rendering which is typically invoked immediately after an edit (since the editing user gets redirected back to the page). So a naive approach of just querying PCS from MediaWiki when a page is rendered wouldn't work.
On the other hand, the description is used by the sharing functionality of social media sites, which is triggered on demand, and maybe to a small extent by web crawlers, which might be triggered by an edit, but probably not within seconds. So if the description is wrong or missing for a short time, that should not be a big deal. That means we can use the following strategy when rendering a page:
- Look up the description in some local cache (see below).
- If that failed, query PCS.
- If it gives a fast response and the revision ID matches, use the description it returned, and cache it.
- If it gives a 404, or takes too much time, or the response has an older revision ID, use some kind of fallback (the outdated description returned by the service, or TextExtracts, or simply the empty string), and schedule an update job.
- If the response has a newer revision ID, use a fallback or set the description to empty. We are in an old revision, the description probably won't matter for any real-world use case. (FlaggedRevs and revision deletion might complicate this, though. See T163462 and T203835.)
- The update job ensures some small delay (maybe reschedules itself a few times if needed, although hopefully there's a cleaner way), then fetches the PCS description, caches it and purges the page from Varnish.
This would make page generation non-deterministic and hard to debug or argue about, and create a cyclic dependency between MediaWiki and PCS (as PCS relies on Parsoid which relies on MediaWiki APIs).
Other options considered
- Add a "functional" mode to the PCS summary endpoint, where it takes all data (or at least the wikitext) from the request, uses that data to get the Parsoid HTML (Parsoid already has a "functional" endpoint, or at least an approximation that's close enough for our purposes), and processes that to get the description. This is too slow to be acceptable during page rendering (p99 latency of PCS is tens of seconds). Although it might be used together with the job queue option to remove the cyclic dependency, if that's a major concern.