RfC: OpenGraph descriptions in wiki pages
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Tgr
	Jan 11 2019, 12:30 AM

Description

Affected components: TBD.
Engineer for initial implementation: TBD.
Code steward: TBD.

Motivation

OpenGraph has become an universal standard for providing a information suitable for showing a small preview of a webpage, used by a variety of social media (e.g. Twitter cards), messaging apps, and search engines.

The PageImages extension provides decent support for og:image. But, no similar support for og:description exists. The lack of support for decent page descriptions is a painful deficiency in MediaWiki, affecting both our internal tools (e.g. T185017) and our ability to share our content with the world (e.g. T142090). There is interest in WMF Readers in changing that.

Requirements

The HTML response for viewing articles contains <meta property="og:description" content="…">.
The description is in plain text and auto-generated from the first paragraph of the article (also known as a "page summary", or "text extract").

Exploration

Existing solutions

Some extensions exist for this purpose (e.g. OpenGraphMeta) but they all assume the description is manually provided by users, which doesn't scale.

The TextExtracts extension can provide automatic page descriptions, and it would be straightforward to display that, but the quality is not great, as the WMF doesn't really maintain the extension anymore, and has opted for a summary service (part of the (Page Content Service) instead, which has significantly more complicated logic for dealing dealing with templates, parantheses with lots of low-relevance data in the first sentence, and similar intricacies of Wikipedia articles (non-Wikipedias are not supported). The difference is substantial enough that only the logic used in PCS is acceptable for OpenGraph summaries.

This leaves two potential approaches:

port the relevant part of PCS to PHP,
or, figure out how to include output from an external service in the HTML of a page.

Neither of those options look great. The goal of the RfC is to determine which one is acceptable / preferable, and whether better alternatives exist. Also feedback from third party wikis on how they would generate text would be valuable (most people probably don't want to rely on PCS/RESTBase, and it's fairly Wikipedia-specific anyway; what would be the best level to abstract it away? e.g. should we fall back to TextExtracts?)

Porting the summary logic in the Page Content Service to MediaWiki

The code that would have to be ported is fairly small, but right now it is not time-sensitive, uses DOM traversal liberally, and takes long for very large pages, while as a part of the parsing/rendering it would have to finish quickly even for a big article. Also, the input would have to be the HTML rendered by the PHP parser instead of Parsoid, which might maybe cause problems. So this would probably be a major rewrite effort and we'd end up doing the same thing in entirely different ways in MediaWiki and PCS and would have a double maintenance burden for description-generation code.

See T214000: Evaluate difficulty of porting PCS summary logic to PHP for details.

Using Page Content Service data in MediaWiki page HTML

Page Content Service uses Parsoid HTML (and to a smaller extent, MediaWiki APIs) as input; Parsoid uses the wikitext provided by the MediaWiki API. So when an edit happens, it needs to be published in MediaWiki, then processed in Parsoid, then processed in PCS. That's too slow for MediaWiki HTML rendering which is typically invoked immediately after an edit (since the editing user gets redirected back to the page). So a naive approach of just querying PCS from MediaWiki when a page is rendered wouldn't work.

On the other hand, the description is used by the sharing functionality of social media sites, which is triggered on demand, and maybe to a small extent by web crawlers, which might be triggered by an edit, but probably not within seconds. So if the description is wrong or missing for a short time, that should not be a big deal. That means we can use the following strategy when rendering a page:

Look up the description in some local cache (see below).
If that failed, query PCS.
- If it gives a fast response and the revision ID matches, use the description it returned, and cache it.
- If it gives a 404, or takes too much time, or the response has an older revision ID, use some kind of fallback (the outdated description returned by the service, or TextExtracts, or simply the empty string), and schedule an update job.
- If the response has a newer revision ID, use a fallback or set the description to empty. We are in an old revision, the description probably won't matter for any real-world use case. (FlaggedRevs and revision deletion might complicate this, though. See T163462 and T203835.)
The update job ensures some small delay (maybe reschedules itself a few times if needed, although hopefully there's a cleaner way), then fetches the PCS description, caches it and purges the page from Varnish.

This would make page generation non-deterministic and hard to debug or argue about, and create a cyclic dependency between MediaWiki and PCS (as PCS relies on Parsoid which relies on MediaWiki APIs).

Other options considered

Add a "functional" mode to the PCS summary endpoint, where it takes all data (or at least the wikitext) from the request, uses that data to get the Parsoid HTML (Parsoid already has a "functional" endpoint, or at least an approximation that's close enough for our purposes), and processes that to get the description. This is too slow to be acceptable during page rendering (p99 latency of PCS is tens of seconds). Although it might be used together with the job queue option to remove the cyclic dependency, if that's a major concern.

Related Objects
Search...

Status	Subtype	Assigned	Task
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Open		None	T318449 [Tracking] Improve the sharing of MediaWiki content
Open		None	T56829 [Goal] Implement support for sharing protocols
Open		None	T131932 Video and audio files should expose 'player cards' to Twitter for embedded playback
Open	Feature	None	T63487 Sharing a Wikimedia Commons file description on Twitter should use a Twitter card
Stalled		ovasileva	T142090 Add hover-card like summary (og:description) to open graph meta data printing plain summary and headline property in the SameAs schema
Open		None	T220182 Include the extracted intro of a page using parser function
Open		None	T213505 RfC: OpenGraph descriptions in wiki pages
Resolved		• FBellamy-WMF	T214000 Evaluate difficulty of porting PCS summary logic to PHP
Open		None	T231797 Bundle Popups extension with MediaWiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

CKoerner_WMF subscribed.Jan 25 2019, 4:28 PM

Jdlrobson awarded a token.Jan 26 2019, 12:16 AM

Another alternative could be to inject the tags at the edge, before the HTML is cached. The rough workflow could be something like:

a page is edited
the parser reparses it
Parsoid/PCS reconstruct their parts
when a page is requested, if the HTML is present at the edge cache, serve it; otherwise ask for the HTML and summary and inject the needed tags.

In this way, we keep the current summary API while integrating it into the final HTML delivered to clients.

JeanFred subscribed.Jan 28 2019, 11:37 AM

@mobrovac: my impression was Ops is against edge composition in general. Also, this would solve the language issue and the cyclic dependency but not get rid of the speed issue. We'd either have to make a "fast" version of the summary API that has some guaranteed response speed and returns with a placeholder if needed, or have an aggressive timeout in the edge logic and apply a placeholder there (which probably means an empty placeholder - Varnish can't do something like "use the description for the previous revision if the current one is not available yet"; I imagine within PCS it would be doable).

Compared to the job queue approach, it's not clear how we'd guarantee eventual correctness of the cached HTML (important for vandalism fixes). Maybe changeprop could just purge the edge cache when the summary endpoint is done updating?

• Mholloway subscribed.Feb 5 2019, 9:46 PM

dr0ptp4kt subscribed.Feb 6 2019, 5:36 PM

• mobrovac moved this task from P1: Define to Under discussion on the TechCom-RFC board.Feb 6 2019, 9:35 PM

Joe subscribed.Feb 7 2019, 9:47 AM

• Jhernandez subscribed.Feb 7 2019, 11:49 AM

In T213505#4929601, @Tgr wrote:

@mobrovac: my impression was Ops is against edge composition in general. Also, this would solve the language issue and the cyclic dependency but not get rid of the speed issue. We'd either have to make a "fast" version of the summary API that has some guaranteed response speed and returns with a placeholder if needed, or have an aggressive timeout in the edge logic and apply a placeholder there (which probably means an empty placeholder - Varnish can't do something like "use the description for the previous revision if the current one is not available yet"; I imagine within PCS it would be doable).

Compared to the job queue approach, it's not clear how we'd guarantee eventual correctness of the cached HTML (important for vandalism fixes). Maybe changeprop could just purge the edge cache when the summary endpoint is done updating?

We're not against the use of edge composition per-se, but:

We're definitely against moving any remotely complex logic to it
I don't think it solves this specific problem
Implementation of any form of edge composition is dependent on the transition to ATS which is going to take a few more quarters.

As it stands, it's pretty clear to me that right now we can't build an healthy architecture for this feature. Probably @Tgr's proposal of opportunistically query PCS from MediaWiki is the best way to go, with a small caveat: we already do update content in MCS/PCS via changeprop for every edit AIUI.

Once the parsoid parser has been moved within MediaWiki, I can see PCS becoming a purely functional transformation service.

A few more caveats:

Dependency cycles should be avoided as much as possible (mw calls PCS which calls mw via a couple chains of calls)
Do not use Restbase to mediate calls between services - services should have their own cache, not relying on a ill-advised pattern we've used for too long (putting the cache of a service outside of said service control).

One last comment: I don't think porting PCS to php makes sense unless either of the following conditions is met:

we also have parsoid in PHP
we can avoid using parsoid-generated HTML

In T213505#4937306, @Joe wrote:

we already do update content in MCS/PCS via changeprop for every edit AIUI.

Yeah. That's not directly usefule here as it is slow (PCS updates can take a minute for outliers, the editor needs to see the HTML almost immediately) other than that we should not needlessly duplicate it. I'm assuming request coalescing in Varish would take care of that.

Dependency cycles should be avoided as much as possible (mw calls PCS which calls mw via a couple chains of calls)

Not sure how that's possible here, other than using a large enough delay and hoping that PCS already processed and cached the summary by then. But then, MediaWiki page save -> changeprop -> PCS -> Parsoid -> MediaWiki API is already a cycle so this doesn't seem to introduce anything new.

Do not use Restbase to mediate calls between services - services should have their own cache, not relying on a ill-advised pattern we've used for too long (putting the cache of a service outside of said service control).

That seems like an orthogonal issue. Right now everything else uses RESTBase and services do not have their own cache - that does not seem to be any more problematic here than in general, so I think it's fine for this functionality to use RESTBase as well. Once caching is moved into PCS, it will be trivial to switch to using that.

One last comment: I don't think porting PCS to php makes sense unless either of the following conditions is met:

we also have parsoid in PHP

we can avoid using parsoid-generated HTML

Yeah. Using the old parser and rewriting a bunch of processing logic to account for that (and use some one-off extension to mark up the output of certain templates since Parsoid does that and the old parser doesn't) would be the desperation option. But I'd rather avoid it, use the service, and probably port to PHP eventually once Parsoid has settled.

Pigsonthewing subscribed.Mar 6 2019, 2:21 PM

Yann subscribed.Mar 8 2019, 9:23 PM

Quick question: what's the plan for this from a product point of view? Is it slated for implementation soon? Knowing this will help us plan how/when to discuss.

@JKatzWMF @JMinor would you support us addressing T142090: Add hover-card like summary (og:description) to open graph meta data printing plain summary and headline property in the SameAs schema as time permits here in Q4 or perhaps as a Q1 FY 19-20 project, to bridge the gap between now and bigger social sharing work?

@Tgr what do you think the level of effort is for a first version of this?

Tgr moved this task from Code Review to Blocked on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Apr 9 2019, 4:07 PM

If I understand correctly, the complexity of using OpenGraph arises from the need to embed the extracted description directly in the HTML output of the rendered page. Is there something in OpenGraph that would allow us to add a level of indirection? Something like <link rel="opengraph" ...> or something, that we could use to point to the OpenGraph description, instead of embedding it?

If this is not possible, perhaps OpenGraph could use the same approach as T212189: New Service Request: Wikidata Termbox SSR: trigger generation upon save, but only pull in the generated content (from an external service) when a full HTML page is served from index.php, and inject the headers on the fly. If the extracted description is not ready yet, the page could be served without it (but marked uncacheable), or the output could wait.

Another note: the description says that porting the PCS logic to PHP is complicated by the fact that then, it would have to operate on top of the built-in MediaWiki parser. That problem should vanish with Parsoid-PHP.

In T213505#5064801, @Milimetric wrote:

Quick question: what's the plan for this from a product point of view? Is it slated for implementation soon? Knowing this will help us plan how/when to discuss.

Depends on how much effort it turns out to be. I was hoping to work on it in Q4 but since writing the RfC picked up other commitments so that seems unlikely now but I hope to find time for it in Q1.

In T213505#5064851, @dr0ptp4kt wrote:

@Tgr what do you think the level of effort is for a first version of this?

Assuming we go with the jobqueue version, probably a week or less? That does not include FlaggedRevs, I'm not quite sure what to do about that, unless RESTBase starts supporting it in the meantime.

In T213505#5103483, @daniel wrote:

Is there something in OpenGraph that would allow us to add a level of indirection? Something like <link rel="opengraph" ...> or something, that we could use to point to the OpenGraph description, instead of embedding it?

I spent some time looking for that when I wrote the RfC but didn't find anything.

If this is not possible, perhaps OpenGraph could use the same approach as T212189: New Service Request: Wikidata Termbox SSR: trigger generation upon save, but only pull in the generated content (from an external service) when a full HTML page is served from index.php, and inject the headers on the fly. If the extracted description is not ready yet, the page could be served without it (but marked uncacheable), or the output could wait.

That would mean that between the edit and the extract becoming available all requests would hit MediaWiki and would have to wait for MediaWiki to hit RESTBase. Not great.

Another note: the description says that porting the PCS logic to PHP is complicated by the fact that then, it would have to operate on top of the built-in MediaWiki parser. That problem should vanish with Parsoid-PHP.

Which is estimated to take a year and a half; ideally we'd want this sooner.

Tgr edited projects, added Product-Infrastructure-Team-Backlog-Deprecated; removed Product-Infrastructure-Team-Backlog-Deprecated (Kanban).May 7 2019, 4:39 PM

Tgr moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.

Jheald subscribed.Jun 6 2019, 10:14 AM

With the latest release of the page content service, I'm finding myself wishing I could use that rather than the PHP MobileFormatter class and all it's complicated (and buggy) code in MobileFrontend. The page content service seems to do everything that does and more (and has a whole team maintaining it). If the solution here is to port the summary HTML service to PHP it means at some point when faced with the same problem we'll likely need to port the even larger page content service to PHP. These are complicated services and non-trivial to port and I worry if this becomes the de-facto solution here it's going to slow down any work to simplify the mobile stack significantly. Please consider this when making a decision with the summary endpoint - other endpoints are likely to follow.

Might be a use case for T227776: General ParserCache service class for large "current" page-derived data.

In T213505#5324464, @daniel wrote:

Might be a use case for T227776: General ParserCache service class for large "current" page-derived data.

The issue here is that the cached HTML would have to be updated at some later point, when the summary became available. At a glance, the proposed change to ParserCache does not make that easier (or harder).

In T213505#5324809, @Tgr wrote:

The issue here is that the cached HTML would have to be updated at some later point, when the summary became available. At a glance, the proposed change to ParserCache does not make that easier (or harder).

The idea is to allow HTML associated with a page to be updated at any later time. The idea is further that we have one service for managing this kind of cache, instead of creating a new one for every kind of data we are deriving from page content.

Note addressing this would also help us fix an issue with our schema.org definitions (see T209999). Any updates here?

We looked at this in tech com and honestly we've lost track of where this is at. Is it still being discussed? Is there a preference for a way forward and resourcing and this is waiting for us? We can help set up discussions and move this forward, but we're not sure which way forward is.

The summary as I understand it for the two proposals:

1. Porting the summary logic in the Page Content Service to MediaWiki
Could be feasible once Parsoid is fully in production. It would take time and lots of testing given that the service is used by page previews in all wikipedia desktop pages and there are a lot of edge cases on both the summary generation, and picking the appropriate lead page image.

Seems like this approach doesn't have any blockers if it were to be chosen. It needs proper resourcing and extensive testing if the port goes through.

2. Using Page Content Service data in MediaWiki page HTML
This has been in discussion in the comments by a variety of people. This RFC needs support from TechCom to figure out if there is a feasible way to use an external service for a part of the rendered HTML.

There is information in the description, and some comments from @Tgr and @daniel discussing different variations of this workflow.

So TLDR: Need help to know if proposal 2) is feasible to implement, and how.

In T213505#5684209, @Jhernandez wrote:

2. Using Page Content Service data in MediaWiki page HTML
This has been in discussion in the comments by a variety of people. This RFC needs support from TechCom to figure out if there is a feasible way to use an external service for a part of the rendered HTML.

The primary problem here is that this external service in turn may call back to mediawiki, potentially creating a loop. Circular dependencies between services are a no-go.

If the service in question only reads from some kind of stash, and can never(!) call back to MediaWiki, it would become mostly a question of performance and consistency. If we used an external service that is fed asynchronously, the description may be re-generated only after the new version of the page has been rendered and cached, leading to an outdated version of the description in the cached version of the page.

It seems clear then that 2) is not a feasible implementation no?

There is also in the "others" section Add a "functional" mode to the PCS summary endpoint, but that is self discarded as too slow to be used in the rendering path, but could be used with the job queue part of 2.

So, would 2. Using Page Content Service data in MediaWiki page HTML but querying a stateless summary endpoint in PCS be feasible?

Do we need to flesh this option out on the description more or can it be discussed as is?

I was checking and there is a few requirements of the information a stateless version of the summary endpoint would need:

Parsoid HTML
Page props like coordinates and others
Page images
Site info

If we were to pass all those we could easily reuse what we have and compose the summary on the fly.

I'm guessing given the dependency on extensions like page images the PHP part would not live in core and rather in another extension.

In T213505#5684251, @Jhernandez wrote:

So, would 2. Using Page Content Service data in MediaWiki page HTML but querying a stateless summary endpoint in PCS be feasible?

Would this be called every time the skin renders? That would likely be too slow. The output would have to be cached somehow (maybe another application for T227776).

But we'd still have a problem with properly synchronizing events. We can

trigger generation of the summary on edit, before serving the page back to the user, so the page can already contain the updated OpenGraph description. That means the response is now blocked on *two* parser runs (internal and parsoid) instead of one, plus a call to PCS. That would likely significantly degrade the editing experience.
queue a job on edit, and in that job later generate the description. Until the job runs, pages are served with the old description embedded. When the job runs, the result is cached. Since the page may have been served with the old description embedded, we have to purge the web cache again. This solution would double the purge events for the web cache layer.

Maybe I'm missing something, but I don't really see an efficient solution with our current infrastructure.

This seems like a nice application for server-side-includes (SSI): varnish could compose in the OpenGraph tags at the network edge. The OG bits could come from a standalone service, and be in turn cached as snippets by varnish. The service would handle its own purging logic.

Or we could try and establish <link rel="og:description"> as an alternative to <meta property="og:description">. That would also solve the caching issue. Do we have enough wait to establish that?

• eprodromou added a subscriber: EvanProdromou.Nov 25 2019, 4:55 PM

• WDoranWMF added a project: Platform Engineering.Nov 25 2019, 4:59 PM

• WDoranWMF moved this task from Inbox to Feature Requests to Review on the Platform Engineering board.

Would this be called every time the skin renders? That would likely be too slow. The output would have to be cached somehow (maybe another application for T227776).

The absence of an og:description tag is not problematic so I think it is acceptable to delay the addition or update of one or even omit one entirely if a certain time theshold is not met. Option 2, seems fine to that regard. An outdated description is better than no description, provided action=purge will refresh it. Remember these summaries/descriptions barely change on established pages!

Or we could try and establish <link rel="og:description"> as an alternative to <meta property="og:description">.

I don't think this makes sense. The goal here is to be compatible with sites like Facebook and Twitter which do not support a link tag

In T213505#5690785, @Jdlrobson wrote:

Would this be called every time the skin renders? That would likely be too slow. The output would have to be cached somehow (maybe another application for T227776).

The absence of an og:description tag is not problematic so I think it is acceptable to delay the addition or update of one or even omit one entirely if a certain time theshold is not met. Option 2, seems fine to that regard. An outdated description is better than no description, provided action=purge will refresh it. Remember these summaries/descriptions barely change on established pages!

That still leaves the problem that the full page without the og:description will be cached in varnish, and the next request will get that cached page, and never hit the app server that would now have the description available and would insert og:description. We'll have to purge varnish caches when it becomes available, doubling the number of purge events. That sounds expensive, but ask the traffic folks what they think.

We'll have to purge varnish caches when it becomes available, doubling the number of purge events.

We'd only need to purge where the plain text version of the summary has changed. I'm guessing that would be far fewer than double. Are you saying that we can't conditionally purge varnish caches in this way?

In T213505#5691341, @Jdlrobson wrote:

We'll have to purge varnish caches when it becomes available, doubling the number of purge events.

We'd only need to purge where the plain text version of the summary has changed. I'm guessing that would be far fewer than double. Are you saying that we can't conditionally purge varnish caches in this way?

Ah, I think I see what you mean. Let me see if I'm understanding you correctly:

We have content X, and a summary x cached somewhere.
An edit generates content Y, we return (and cache in varnish) output with text Y but the old summary x embedded. We also schedule a job to calculate summary y of Y.
When y is ready, we check if y == x, and only if it isn't, we purge the page from varnish.

This sounds like it should be possible, yes. The summary would have to be cached separately, e.g. in the object cache. We'd have top be careful to only purge once y is available *everywhere*. Replicating caches across data centers isn't trivial. But I think it can work.

...we could further optimize. Given the above example, we could:

when saving Y, mark x as stale in the cache
when serving a page with content Y but stale x, mark the x as stale-and-used in the cache
when y becomes available, check the cache, and only if x is not the same as y AND x is flagegd as stale-and-used, purge varnish.

This optimizes for the common case that no output with text Y and summary x has been cached - the user that made the edit will have a session and will thus bypass varnish. Only if another user visits the page before the new summary y becomes ready will a stale page with Y+x be cached in varnish. That would typically happen only for high profile pages with lots of views.

Bingo. That's what I was proposing.

We in CPT think this is an interesting feature, and we're watching the development of the RFC to see how we can be helpful.

@daniel, given the involvement of the job queue and the async job, could we then use the existing summary service, or is that still considered a loop?

In terms of product requirements, I remember asking Product Management about if serving a stale description/page image for a while was a problem, and they confirmed that it would be an acceptable tradeoff in favor of getting such a prevalent feature in front of users.

I think whatever the choice is here doesn't exclude us from figuring out a long term way forward.

The summary service is very high value and has a high read volume, both from page previews on desktop wikipedia, but it is also used as a sub-resource in many of our existing REST APIs in node.

Given the importance, if it were to be used in the page rendering then someone should probably migrate the summary service to be in MediaWiki land relying on Parsoid HTML. Then the solution would instead be hooking into page rendering lifecycle to attach the HTML tags from a Summary extension, which would contain the migrated /page/summary logic. Maybe a topic for a future RFC once someone has resources for this work. (I'll keep it in mind for future Product Infrastructure maintenance projects)

In T213505#5694686, @Jhernandez wrote:

@daniel, given the involvement of the job queue and the async job, could we then use the existing summary service, or is that still considered a loop?

If the summary service calls MediaWiki in a way that doesn't absolutely guarantee that under no circumstances a can MediaWiki trigger a call to the summary service (directly or indirectly, synchronously or asynchronously), then it's a loop.

Relevant: T203127: Implement "last known good version" API

Some notes from an informal discussion about this task in TechCom:

From a fresh perspective, taking a "Boring (TM)" approach - compute it from a JobQueue job (queued from LinksUpdate, post-edit, as for other related derivative values in MediaWiki). The job could still use the external service like it does today (using MW's HTTP abstraction layer). So long as it is revision ID based in its communication to Parsoid, should be safe from race conditions regarding current page state. But, where to store? And, from GET page views, is it okay to be absent in the interim? Either way, purge canonical urls afterward if the OG data changed. While interaction between MW and a service is generally undesired (given the service may in turn call back into MW) this is less of an issue from the JobQueue because it's async with tight control around the scheduling via Kafka and the job runner service.
Alternatively, perhaps compute it on-demand upon page view. This means no precompute, precache or change propagation. It would from MW side it would only be cached as part of larger HTML payload for anons, in Varnish. For cache-miss and logged-ins, it would call into the service on every page view. Is the OG service in turn quick and cached sufficiently to do that?
Alternatively (2), if the service is or can become deterministic/stateless (lamda-like) it could be given the content, instead of it being in charge of getting the content. This way it could be called even before any edit is saved during stashing and pre-save operations (POST submit unsaved content there, then save the OG data as part of ParserOutput cache.
Alternatively (3), with Parsoid/PHP in place. Perhaps this can become simple and cheap enough to not warrant any propagation or caching. Instead, upon page view, or during save, a MW extension hook for OG data can simply do the transformation in-place in PHP and save it to the ParserOutput (for anons). We know from Parsoid's own experience that DOM-based operations seem to be as fast or faster in PHP, compared Node.js.

In T213505#5680253, @Milimetric wrote:

We looked at this in tech com and honestly we've lost track of where this is at. Is it still being discussed? Is there a preference for a way forward and resourcing and this is waiting for us? We can help set up discussions and move this forward, but we're not sure which way forward is.

AIUI what happened is that this proposal got stuck in TechCom's inbox for a couple month, after which there wasn't a sponsor anymore (as in, Reading Infrastructure had other priorities to work on and there was no point in pushing this) and it was stalled because of that. Presumably that has changed now?

In T213505#5684209, @Jhernandez wrote:

1. Porting the summary logic in the Page Content Service to MediaWiki
Could be feasible once Parsoid is fully in production.

I don't think Parsoid being an external PHP service instead of an external JS service helps us at all. We'd need it to be in-process, which is another year or so.

In the long term, the ideal process would be one where a PHP-ized version of PCS would hook into the parsing process, e.g. via the tree traversal listener provided by RemexHTML, and extract the required information on the fly. That would be a major rewrite (currently PCS is a series of full-document DOM transformations, which is very exuberant performance-wise), but if feasible it would mean that summary generation can mostly free-ride on parsing and wikitext rendering.

This use case is something to keep in mind when thinking about the PHP API of Parsoid.

In T213505#5684236, @daniel wrote:

The primary problem here is that this external service in turn may call back to mediawiki, potentially creating a loop. Circular dependencies between services are a no-go.

IMO that's a very simplistic approach of thinking about loops. The original proposed flow was MediaWiki page rendering -> PCS --(cache miss)--> Parsoid -> MediaWiki raw content fetch. There's no loop there because Parsoid does not trigger page rendering.

(IOW, lack of cycles on the service level is a sufficient but not necessary condition for avoiding loops, since you need a cycle on the functionality/endpoint level to create a loop.)

If you use LinksUpdate, as proposed by Krinkle, there is even more obviously no loop, because PCS summary requests would be triggered on edit, but PCS never causes edits. It would cause necessitate a permanent summary cache though, that doesn't expire when parser cache entries do.

We can

trigger generation of the summary on edit, before serving the page back to the user, so the page can already contain the updated OpenGraph description. That means the response is now blocked on *two* parser runs (internal and parsoid) instead of one, plus a call to PCS. That would likely significantly degrade the editing experience.

queue a job on edit, and in that job later generate the description. Until the job runs, pages are served with the old description embedded. When the job runs, the result is cached. Since the page may have been served with the old description embedded, we have to purge the web cache again. This solution would double the purge events for the web cache layer.

As noted in the task description, 1) is infeasible because Parsoid is (today) a lot slower than the MediaWiki parser, and PCS is slower still. 2) indeed increases purges (probably not to double, as Jon noted); I don't have a good grasp on how bad that is. Is it a deal-breaker?

This seems like a nice application for server-side-includes (SSI): varnish could compose in the OpenGraph tags at the network edge.

I'm not sure that changes anything fundamentally: either the initial render after save would have to be delayed to an unacceptable extent, or pages would have to be purged twice. (Probably not actually purged but expired; would that be a significant difference? Or is the concern about doubling the number of MediaWiki HTML requests, even though they come from the parser cache, which SSI would indeed avoid?)

Proposal for a short term solution using the existing service:

On edit:

schedule a job
when the job runs, it puts the currently cached description from object cache (or mayb parser cache with T227776; or maybe even the cassandra cache also used by PCS itself) into a local variable
the job then calls the service to get the re-computed description
if the new description is different, update the local cache, and purge the web cache entries for the page

On page view:

look for the locally cached value. if it exists, use it
if there is no cache entry, trigger the job, as above

We had an informal conversation about the task discussing the various implementation proposals and details.

I'll be writing down a more specific implementation proposal and we can restart the discussion from it.

• JMinor unsubscribed.Dec 4 2019, 6:06 PM

CKoerner_WMF unsubscribed.Dec 12 2019, 4:09 PM

Tgr mentioned this in T240691: TextExtracts extension frequent slows down opensearch API by several seconds.Dec 16 2019, 8:12 AM

Tgr mentioned this in T241437: Restore descriptions in opensearch API.Dec 30 2019, 12:14 AM

awight mentioned this in T157145: Twitter cards don't work for any projects besides Wikidata.Feb 21 2020, 3:58 PM

awight subscribed.Feb 22 2020, 4:21 AM

Anomie mentioned this in T245674: Reader gets page description in search results.Feb 24 2020, 6:47 PM

Krinkle updated the task description. (Show Details)Apr 4 2020, 12:24 AM

Krinkle moved this task from Under discussion to P2: Resource on the TechCom-RFC board.

"The description is in plain text and auto-generated from the first paragraph of the article (also known as a 'page summary', or 'text extract')."

Is that requirement correct, or should we be using the description from Wikidata?

Tgr mentioned this in T256505: TextExtracts extension: Code stewardship review.Jun 26 2020, 7:59 PM

kostajh subscribed.Aug 19 2020, 1:52 PM

SBisson subscribed.Oct 29 2020, 12:32 PM

jcrespo subscribed.Nov 10 2020, 2:34 PM

CDanis subscribed.Nov 10 2020, 2:36 PM

• AMuigai subscribed.Nov 25 2020, 12:37 PM

The Inuka team would like to take on and own what it takes to implement this. Our team is interested in reaching readers with Wikipedia in the platforms they spend time on, as one of our strategies to grow Wikipedia readership in emerging markets.

We'll begin by going through the history of what recommendations are feasible, and reach out to many of you for inputs on what is possible.

eamedina subscribed.Dec 1 2020, 2:36 PM

Izno subscribed.Apr 6 2021, 1:35 AM

R4356th subscribed.May 6 2021, 11:19 AM

Bugreporter mentioned this in T282172: Make short description work independent of Wikibase (create new short description extension).May 6 2021, 5:43 PM

Bugreporter mentioned this in T282170: Move "Short Descriptions" feature outside of main Wikibase.git code.Jun 10 2021, 4:47 PM

Tgr mentioned this in T282585: TDMP DR: Provide for asynchronously-available MediaWiki parser content fragments / components.Jun 11 2021, 6:43 PM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

Aklapper mentioned this in T287906: Text Extract with Math markup not parsed/displayed correctly in Discord.Aug 3 2021, 9:05 AM

Jdlrobson mentioned this in T220182: Include the extracted intro of a page using parser function.Dec 6 2021, 5:39 PM

Izno added a parent task: T220182: Include the extracted intro of a page using parser function.Dec 6 2021, 6:55 PM

Jdlrobson mentioned this in T231797: Bundle Popups extension with MediaWiki.Sep 23 2022, 9:21 PM

Jdlrobson added a subtask: T231797: Bundle Popups extension with MediaWiki.

daniel mentioned this in T319365: PCS caching and pregeneration when restbase is decommissioned.Oct 6 2022, 10:27 AM

dr0ptp4kt unsubscribed.Jul 28 2023, 3:25 PM

Tgr mentioned this in T336693: Re-implement reading lists REST interface outside RESTbase.Sep 14 2023, 6:19 AM

• FBellamy-WMF closed subtask T214000: Evaluate difficulty of porting PCS summary logic to PHP as Resolved.Apr 18 2024, 6:01 PM

Bugreporter mentioned this in T362560: Find a way to get description information from Wikibase .May 22 2024, 2:41 PM

Ottomata subscribed.Jul 22 2024, 3:27 PM