[go: up one dir, main page]

Page MenuHomePhabricator

syntaxhighlight and translate don't play well together in Parsoid
Open, LowPublic

Description

This may not be specific to syntaxhighlight, but more an outcome of how translate works right now (just a string transformation without any adherence to structures like extension boundaries), but in any case, that is one of the impacted extensions in practice on mediawiki.org.

Check the output of https://www.mediawiki.org/wiki/Parsoid?useparsoid=1 and you will see literal translate markup in the syntaxhighlight extension output.

Event Timeline

@ssastry What do you suggest to do instead of this? Being able to translate comments in code blocks is a must-have, but regular extension tags should not be expanded there.

Since Translate's processing of translate markup is post-save rather than post-parse, I am sure we can find a solution in Parsoid for this since Parsoid doesn't right now need to support translation region detection. So, for example, https://www.mediawiki.org/wiki/Parsoid/de?useparsoid=1 works just fine without any special treatment needed.

So, that implies that for extensions that don't process wikitext, Parsoid can simply strip translate markup from the content before passing on that content to the extension's handler. For extensions that process wikitext, Parsoid's existing annotation handling will take care of it (since those extensions will need to implement the sourceToDom handler).

So, we just need a way of distinguishing between these types of extensions and if we don't have a reliable way of doing that with existing config options, we can add one and that should take care of it.

I'll let @ihurbain correct me if I got this wrong since she worked on this originally and probably has better insight into the problem than me.

So, that implies that for extensions that don't process wikitext, Parsoid can simply strip translate markup from the content before passing on that content to the extension's handler.

Wouldn’t this cause roundtrip issues? If I edit a <syntaxhighlight> block in VisualEditor, I don’t want the <translate> tags to get lost.

ssastry triaged this task as High priority.Jul 5 2023, 7:37 PM

It depends. Parsoid's data-mw block will have the original source. So, if VE is using the data-mw block to initialize its native support for SyntaxHighlight, this is okay since VE will then serialize this back into data-mw and Parsoid uses that to generate wikitext.

But, if VE somehow uses the rendered HTML to initialize the editing surface, then yes this can lead to dirty diffs.

That is why I distinguished between wikitext-based extensions and non-wikitext-based extensions. My understanding is that for the latter (like SyntaxHighlight), either there is no native editing support in VE (and you have to directly edit data-mw) OR the native editing support is based on information in data-mw.

But, yes, any solution we implement will be based on consultations with the Editing Team.

I see, I wasn’t aware of data-mw. If it’s documented that the rendered HTML of “non-wikitext-based” extension tags isn’t guaranteed to roundtrip (and VE is checked for taking this into account), that’s a solution. (However, it’s an interesting question what counts as “non-wikitext-based”. Cite is obviously wikitext-based, SyntaxHighlight is obviously non-wikitext-based, but what about extensions that transform wikitext to various degrees before sending it to the parser, like Poem or CharInsert or Quiz?)

So, non-wikitext based extension tags *do* roundtrip but, they rely on changes being set in data-mw by clients ... how clients to do it is up to them.

As for what is wikkitext-based, yes, you are asking the right questions. It is not explicitly signalled in that form right now although at one point, we had considered such options but poem, gallery, imagemap, (you can look at the Parsoid implementatoins in the parsoid repo) and others that mangle their input source to generate target wikitext.

Broadly, extensions can define one or both of source2dom and dom2wikitext handlers. So, Cite (in the Parsoid repo) does this for its Ref & Reference tags. That code knows how to take the edited HTML and convert it back to wikitext. It doesn't rely on data-mw, for example.

But, Poem does not define a dom2wikitext handler. So, that means roundripping after edits relies on the data-mw on the poem output being properly updated.

(TANGENT: In the future, extensions targeting Parsoid & VE could in the future also define a VE editing surface support to bridge the output from its source2dom handler into a form that its dom2wikitext handler knows. That for some future time.).

We also have another option in the config called outputHasCoreMwDomSpecMarkup which is an indirect signal that the extension might be processing some kind of wikitext in its input. So, the strategy I alluded to in my earlier message would be to examine the set of handlers and options to infer something (which is a little unsatisfactory) or define a new config option as to how annotations should be handled.

Alternatively, the extension would have to explicitly deal with annotations by calling a Parsoid Extension API method like "$extApi->processAnnnotations( $extsrc, $handler)" where the default handling would be to simply strip them.

Anyway, to be continued.

I think I prefer the idea of the extension dealing with annotations by calling a Parsoid extension API method.

That said, I think that this would still need to be handled at the level of the extension defining the annotation. The weak, but pragmatic, argument is that, for Translate, we don't need to strip only the annotation markup, but also the translation unit markup (the <!--T:XX--> comments). Additionally, it's not necessarily easily generalizable to strip the annotation markup: we're (by definition) dealing with raw strings, and annotation markup may have attributes - handling that without loss of generality may be a large hammer for a (at this time) pretty small fly.

This also makes me pause on the "processAnnotations" fairly generic method proposal, because it feels like there's too many moving pieces for that.

My proposal would consequently be as follows:

  • Add a stripAnnotations( $text ) in ParsoidExtensionAPI
  • That method would be called from the "non-wikitext-extension" as needed (so in the case of SyntaxHighlight, in SyntaxHighlight.php:sourceToDom before processing the content)
  • It would fetch all annotation tags in the config, find the ExtensionTagHandler, and call stripMarkup for each of them
  • Add an abstract stripMarkup( $extApi, $text ) method in ExtensionTagHandler
  • Implement that stripMarkup for translate and tvar, mostly using what's currently used for TranslatablePageParser::cleanUpTags; smuggle the removal of markers in the translate tag handler.

Thoughts?

https://www.mediawiki.org/wiki/Parsoid/fr is an example of translation inside SyntaxHighlight working as expected.
Interestingly, https://www.mediawiki.org/wiki/Parsoid/fr?useparsoid=1 works properly too. Worth investigation.

Do translated pages have translation markup in them? (I think not?)

They don’t; translation markup is removed when rendering translation pages (this is the official name of pages like Parsoid/fr), as it’s not necessary, since these pages are always automatically rendered, never edited directly.

Change 957757 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Introduce stripping of annotation in non-wikitext extension content

https://gerrit.wikimedia.org/r/957757

Change 957761 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Translate@master] Implement stripMarkup method

https://gerrit.wikimedia.org/r/957761

Change 957763 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Set hasWikitextInput flag to false

https://gerrit.wikimedia.org/r/957763

Change 957757 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Introduce stripping of annotation in non-wikitext extension content

https://gerrit.wikimedia.org/r/957757

Change 962706 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.18.0-a26

https://gerrit.wikimedia.org/r/962706

Change 962706 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.18.0-a26

https://gerrit.wikimedia.org/r/962706

Change 957763 merged by jenkins-bot:

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Set hasWikitextInput flag to false

https://gerrit.wikimedia.org/r/957763

Change 957761 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Configure and implement stripAnnotationMarkup method

https://gerrit.wikimedia.org/r/957761

After the deploy and a purge, https://www.mediawiki.org/wiki/Parsoid?useparsoid=1 looks like the issue has been fixed.

Re-opening this one because there's the case of Translate documentation pages (e.g https://www.mediawiki.org/wiki/Help:Extension:Translate/Page_translation_administration) for which we'd actually want the example translate tags to not be stripped.

Subbu mentioned there may be overlaps with the nowiki handling for syntaxhighlight that may be more relevant (probably in the vicinity of T299103).

ihurbain lowered the priority of this task from High to Low.Jan 23 2024, 5:48 PM

This is not "high" anymore, because it works for the very vast majority of use cases.