MediaWiki 1.36/wmf.13 needlessly HTML encodes ASCII characters in DjVu text layer
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Xover
	Oct 15 2020, 7:37 AM

Description

Since the deployment of MediaWiki 1.36/wmf.13, text extracted from the text layer of DjVu files is subjected to HTML entity encoding for even some basic ASCII characters. This shows up to users as   (space), ' (single quote), and 
 (newline) in the ProofreadPage page editing interface in the Page: namespace. example page

There should generally be no HTML entity encoding visible here: the displayed text in the WikiEditor should be very nearly the raw content of the file's text layer.

So far as I know the provenance here is that when a file is uploaded its text layer is extracted by MediaWiki-DjVu and stuffed into some field of the image metadata in the database, lightly wrapped in an XML structure (see T192866). When a user opens a (redlink) wikipage in the Page: namespace, ProofreadPage extracts that field, removes the XML, and nukes a few select character codes (CR, LF, VS, PB: control characters in this context, not obviously relevant to this issue. See T230415.), before preloading the text into the WikiEditor text field.

In one of these steps it seems like a change has introduced HTML entity encoding that was not there before.

The obvious suspects here are MediaWiki-DjVu, ProofreadPage, and WikiEditor; with greatest probability being MediaWiki-DjVu. I'm not aware of changes to either of these in this release, and I've found nothing in the release notes. I know some work has been done on using Remex for character entity reference support and validation in MW over that last two-ish years, but I'm not aware of any recent such changes (and, again, nothing in the release notes). No clue who owns remex and its uses. Parser? VE?

(CC to @Soda who has been working on PRP recently and might know off-hand whether there has been any relevant changes to the code there)

Details

	Subject	Repo	Branch	Lines +/-
	Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"	mediawiki/extensions/ProofreadPage	wmf/1.36.0-wmf.13	+0 -27
	Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"	mediawiki/extensions/ProofreadPage	master	+0 -27

Customize query in gerrit

Related Objects

Mentioned In: T265704: Erroneous and broken HTML entities (such as "&#39 ;") displayed for certain characters in the ProofreadPage edit box
Mentioned Here: rEPRPb9ab94b866e4: Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle
T192866: Some DjVu files have too much metadata to fit in their database column
T230415: Stop ignoring paragraph and region separators in DjVu file OCR text layer

Event Timeline

Xover created this task.Oct 15 2020, 7:37 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2020, 7:37 AM

Looking at the whole metadata via API, they are fine (search "name": 307) in https://en.wikisource.org/w/api.php?action=query&titles=File:The%20Victoria%20History%20of%20the%20County%20of%20Lincoln%20Volume%202.pdf&prop=imageinfo&iiprop=metadata

Instead, when fetched as a single page (https://en.wikisource.org//w/api.php?titles=Page:The+Victoria+History+of+the+County+of+Lincoln+Volume+2.pdf/308&inprop=preload&prop=info&action=query) and transformed in "contentmodel": "proofread-page" they are corrupted.

Problem is visible also at API level, WikiEditor seems unlikely, I guess.

Hmm. If one API endpoint returns unencoded text, then lower-level components seem unlikely culprits. If one API returns encoded text, then higher-level components seem unlikely. IOW: those results suggest to me that this is happening at the API layer.

There've been some pushes to sanitize and clean up the various API endpoints and their outputs recently (fsvo "recent"). Could this be a side-effect of an attempt to "fix" API output that's also used in a different context?

Inductiveload subscribed.Oct 15 2020, 10:51 AM

Probably caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/628753 (rEPRPb9ab94b866e4: Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle)

Change 634105 had a related patch set uploaded (by Tpt; owner: Tpt):
[mediawiki/extensions/ProofreadPage@master] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634105

gerritbot added a project: Patch-For-Review.Oct 15 2020, 2:36 PM

Probably caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/628753

This seems to be indeed the problem cause. I have submitted a revert for review: https://gerrit.wikimedia.org/r/634105

I'm happy if there is another way to fix this. But note that it's not as easy as the commit message in https://gerrit.wikimedia.org/r/634105 makes it sound like. T263371 was about an actual XSS attack vector. I might be wrong, but I believe this means we should apply an alternative fix first, before reverting the previous one.

In T265571#6547116, @thiemowmde wrote:

I'm happy if there is another way to fix this. But note that it's not as easy as the commit message in https://gerrit.wikimedia.org/r/634105 makes it sound like. T263371 was about an actual XSS attack vector. I might be wrong, but I believe this means we should apply an alternative fix first, before reverting the previous one.

Would just sanitizing HTML entities work? That way we could pass the text through Sanitizer::removeHTMLtags() which would offer some amount of protection against XSS. wfEscapeWikitext() outright sanitizes all single quotes('), double quotes(") and double newlines (\n\n or \r\n or \r\r) which does make proofreading quite a bit difficult.

@thiemowmde T263371 is indeed an XSS attack vector if you output directly the file content in HTML. But ProofreadPage only uses the file content to prefill a Wikitext content area when a Page: page is created. So, I guess there are no extra threat here compared with just allowing anyone to edit the wiki.

… then I probably still don't get how the scary output shown in T263371 was created. If this can not happen in production, why is there a ticket?

Not having access to T263371 it's hard to say anything intelligent about the specific issue, but…

…untrusted input should ideally be sanitized at the point it first hits the system. For PDF/DjVu text layers this would be the point when it is first extracted and stuffed into iiprop (from where PRP subsequently gets it). Otherwise we'll have untrusted data sitting in a metadata field in the database, and every single bit of code, current or future, that hits it will have to worry about untainting it. And data you get from your own DB is very easy to assume is already sanitized (unless you introduce a concept of "tainted storage" like Perl's taint mode); it tends to "leak" into actually sanitized data when you least expect it.

Also, Commons must have dealt with this before for EXIF, so maybe there are useful examples lurking around somewhere to use as a template. Granted Commons can just nerf the data with prejudice, but the approach could still be transferable.

… then I probably still don't get how the scary output shown in T263371 was created. If this can not happen in production, why is there a ticket?

T263371 is about a proprietary extension that directly outputs the PDF/DjVu text layer in HTML. Not about ProofradPage text layer prelod usage.

sbassett subscribed.Oct 15 2020, 9:17 PM

Xover mentioned this in T265704: Erroneous and broken HTML entities (such as "&#39 ;") displayed for certain characters in the ProofreadPage edit box.Oct 16 2020, 9:22 AM

Aklapper merged a task: T265704: Erroneous and broken HTML entities (such as "&#39 ;") displayed for certain characters in the ProofreadPage edit box.Oct 16 2020, 9:40 AM

Aklapper added a subscriber: Seudo.

This problem is particularly annoying in the French Wikisource because the HTML entities are not even valid (an extra space is added before the semi-colon, maybe because of French typography rules), which makes necessary to fix all of these entities by hand in every page (example).

Apparently the HTML entities are fixed automatically in the English Wikisource (when I try in this book). ~~~~

Change 634416 had a related patch set uploaded (by Reedy; owner: Tpt):
[mediawiki/extensions/ProofreadPage@wmf/1.36.0-wmf.13] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634416

In T265571#6547436, @thiemowmde wrote:

… then I probably still don't get how the scary output shown in T263371 was created. If this can not happen in production, why is there a ticket?

Third party wiki (and extension) using the functionality but not escaping it themselves, seemingly

Change 634105 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634105

Change 634416 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@wmf/1.36.0-wmf.13] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634416

Mentioned in SAL (#wikimedia-operations) [2020-10-16T10:35:41Z] <reedy@deploy1001> Synchronized php-1.36.0-wmf.13/extensions/ProofreadPage/: Revert excessive escaping T265571 (duration: 01m 12s)

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.14; 2020-10-20).Oct 16 2020, 11:00 AM

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2020, 11:10 AM

In T265571#6549666, @Seudo wrote:

Apparently the HTML entities are fixed automatically in the English Wikisource (when I try in this book). ~~~~

English Wikisource deployed a default Gadget that simply does the reverse of the original patch earlier today (it replaces character entity references with literal characters). But judging by the logs and my quick testing this appears to have now been rolled back so the Gadget is no longer needed.

In T265571#6549793, @Xover wrote:

English Wikisource deployed a default Gadget that simply does the reverse of the original patch earlier today (it replaces character entity references with literal characters). But judging by the logs and my quick testing this appears to have now been rolled back so the Gadget is no longer needed.

Yes, I confim it appears to be fixed in the French Wikisource too.

matmarex closed this task as Resolved.Oct 16 2020, 8:44 PM

MediaWiki 1.36/wmf.13 needlessly HTML encodes ASCII characters in DjVu text layerClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

MediaWiki 1.36/wmf.13 needlessly HTML encodes ASCII characters in DjVu text layer
Closed, ResolvedPublic
Actions