Flow dumps and especially the history dumps run slowly on any project with a large number of revisions; see e.g. mediawikiwiki with 300k revs. Every month it's one of the last steps to finish up; this month we had external store dbs rebooted, which caused the step to be interrupted and restart, so it finished after the end of the month. Time to fix this once and for all.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | ArielGlenn | T164262 Make flow dumps run faster | |||
Resolved | • Catrope | T167456 Some Flow content contains control characters (e.g. \b (backspace)) | |||
Duplicate | None | T167821 Parsoid will faithfully output characters from mwentity entries that are illegal in XML, such as backspace (0x08) | |||
Open | None | T172025 Flow isAllowed gets actual revision text before it is needed |
Event Timeline
See also https://phabricator.wikimedia.org/T89398 where it was thought that T138306 (now T105859) was the problem. I have done some investigating, and it looks like the issue is much simpler: we get about 300k revisions from external store every time the flow history dumps run for mediawiki (200k for "current" dumps), one at a time. That's enough by itself. I'm going to implement prefetch for these dumps if at all possible, reading the revision from the previous successful run and making sure the content length according to the db matches the content length of the data read from the file. This means checking to see how boards are ordered when dumped (I hope baord_id ascending!) and how revisions are ordered within a board when dumped. If we don't have a determinate order, that will need to be fixed first.
I've been running test stubs against mediawiki, and have a draft of a flow-based dumpTextPass. However... while a stubs (metadata only) dump of "current" flow revisions completes in 4 hours, the stubs dump with all revisions is very slow indeed. I'm looking into which revisions take longer and why; in some cases we see 10 revisions a minute, extremely slow indeed. And afaict this is without content loading, though I'll be doing some double-checking to be sure that retrieval of just the metadata does not also retrieve the revision content as an unplanned side-effect.
Here's a sample: 11 revisions in 60 seconds. The metadata for ten of those 11 are displayed below.
grep '<revision id' /home/datasets/testing/public/mediawiki/20170513/mediawikiwiki-20170513-flowhistory.xml | wc -l ; sleep 60; grep '<revision id' /home/datasets/testing/public/mediawiki/20170513/mediawikiwiki-20170513-flowhistory.xml | wc -l 106676 106687 grep '<revision id' /home/datasets/testing/public/mediawiki/20170513/mediawikiwiki-20170513-flowhistory.xml | tail -10 <revision id="r14uksmdm0tyr3vn" userid="0" userip="88.130.117.234" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="r14uksmdm0tyr3vn" flags="utf-8,html" modstate="" contentlength="933" previouscontentlength="0" treeparentid="r14u0vzwfj6g7mu4" treedescendantid="r14uksmdm0tyr3vn" treerevid="r14uksmdm0tyr3vn" treeoriguserid="0" treeoriguserip="88.130.117.234" treeoriguserwiki="mediawikiwiki" /> <revision id="r14urkxckawlt3if" userid="0" userip="88.130.117.234" userwiki="mediawikiwiki" parentid="r14uksmdm0tyr3vn" changetype="edit-post" type="post" typeid="r14uksmdm0tyr3vn" flags="utf-8,html" modstate="" lasteditid="r14urkxckawlt3if" edituserid="0" edituserip="88.130.117.234" edituserwiki="mediawikiwiki" contentlength="1031" previouscontentlength="933" treeparentid="r14u0vzwfj6g7mu4" treedescendantid="r14uksmdm0tyr3vn" treerevid="r14urkxckawlt3if" treeoriguserid="0" treeoriguserip="88.130.117.234" treeoriguserwiki="mediawikiwiki" /> <revision id="r14utlu3qrkl5uvi" userid="0" userip="220.244.174.211" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="r14utlu3qrkl5uvi" flags="utf-8,html" modstate="" contentlength="251" previouscontentlength="0" treeparentid="r14uksmdm0tyr3vn" treedescendantid="r14utlu3qrkl5uvi" treerevid="r14utlu3qrkl5uvi" treeoriguserid="0" treeoriguserip="220.244.174.211" treeoriguserwiki="mediawikiwiki" /> <revision id="r14uwxnu5p2amfdw" userid="0" userip="220.244.174.211" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="r14uwxnu5p2amfdw" flags="utf-8,html" modstate="" contentlength="160" previouscontentlength="0" treeparentid="r14utlu3qrkl5uvi" treedescendantid="r14uwxnu5p2amfdw" treerevid="r14uwxnu5p2amfdw" treeoriguserid="0" treeoriguserip="220.244.174.211" treeoriguserwiki="mediawikiwiki" /> <revision id="r14v45el0x1yj6b6" userid="0" userip="88.130.117.234" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="r14v45el0x1yj6b6" flags="utf-8,html" modstate="" contentlength="324" previouscontentlength="0" treeparentid="r14utlu3qrkl5uvi" treedescendantid="r14v45el0x1yj6b6" treerevid="r14v45el0x1yj6b6" treeoriguserid="0" treeoriguserip="88.130.117.234" treeoriguserwiki="mediawikiwiki" /> <revision id="r14vglxk8r5p7882" userid="0" userip="220.244.174.211" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="r14vglxk8r5p7882" flags="utf-8,html" modstate="" contentlength="273" previouscontentlength="0" treeparentid="r14v45el0x1yj6b6" treedescendantid="r14vglxk8r5p7882" treerevid="r14vglxk8r5p7882" treeoriguserid="0" treeoriguserip="220.244.174.211" treeoriguserwiki="mediawikiwiki" /> <revision id="r1h0riajuiuyqi2p" userid="0" userip="88.130.102.253" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="r1h0riajuiuyqi2p" flags="utf-8,html" modstate="" contentlength="202" previouscontentlength="0" treeparentid="r14vglxk8r5p7882" treedescendantid="r1h0riajuiuyqi2p" treerevid="r1h0riajuiuyqi2p" treeoriguserid="0" treeoriguserip="88.130.102.253" treeoriguserwiki="mediawikiwiki" /> <revision id="r1h1dmnf0y0zg0tg" userid="0" userip="72.205.51.228" userwiki="mediawikiwiki" changetype="new-post" type="post" typeid="r1h1dmx2obomyuf7" flags="utf-8,topic-title-wikitext" modstate="" contentlength="43" previouscontentlength="0" treedescendantid="r1h1dmx2obomyuf7" treerevid="r1h1dmnf0y0zg0tg" treeoriguserid="0" treeoriguserip="72.205.51.228" treeoriguserwiki="mediawikiwiki" /> <revision id="r1h1dmx48lgn520l" userid="0" userip="72.205.51.228" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="r1h1dmx48lgn520l" flags="utf-8,html" modstate="" contentlength="578" previouscontentlength="0" treeparentid="r1h1dmx2obomyuf7" treedescendantid="r1h1dmx48lgn520l" treerevid="r1h1dmx48lgn520l" treeoriguserid="0" treeoriguserip="72.205.51.228" treeoriguserwiki="mediawikiwiki" /> <revision id="qyda2tpu079n742a" userid="0" userip="37.4.83.50" userwiki="mediawikiwiki" changetype="new-post" type="post" typeid="qyda2tovpffm0bz3" flags="utf-8,topic-title-wikitext" modstate="" contentlength="37" previouscontentlength="0" treedescendantid="qyda2tovpffm0bz3" treerevid="qyda2tpu079n742a" treeoriguserid="0" treeoriguserip="37.4.83.50" treeoriguserwiki="mediawikiwiki" />
Note that this is a "stubs" run, with output going to a plain text file, no compression. Earlier revisions were produced much more quickly.
The stubs run, though it explicitly does not call $revision->getContent() in formatRevision(), still requests revision content from the external storage servers, as seen by an strace. Hunting that down now.
It seems that somehow Flow\RevisionActionPermissions->isAllowed() winds up getting actual revision text if you follow the code far enough down the rabbit hole.
- $last = $cache->getLastRevisionFor( $revision ); leads to
- $lastRevision = $revision->getCollection()->getLastRevision(); which eventually leads to
- AbstractCollections::getAllRevisions() which calls
- Flow\Data'ObjectLocator::getStorage()->find which after a few layers gets around to
- Flow\Data\Storage\RevisionStorage::mergeExternalContent() which, once you get through mergeMulti, you find is going to do
- ExternalStore::batchfetchFromURLs() and there you have it.
We only need the metadata in order to make permission decisions, revision content ought to be lazyloaded, or at the very least there ought to be the option to lazyload it.
I should pont out that this is all kicked off from Flow\Dump\Exporter by formatRevisions, where isAllowed() is (rightly) called. (formatWorkflow -> formatTopic -> formatPost -> formatRevisions) I'm not sure which of the code paths to formatRevisions() gets travelled first, if that matters.
There's a point in my text pass code (will run after stubs, once stubs don't look up revision text :-D) where I need to be able to get an AbstractRevision object from the rev id (UUID). I didn't see any way to do that offhand; what's the right procedure? The idea is that the stub file has the revision metadata including the rev id, but no content; in order to get the content I want AbstractRevision::getContent() but how do I instantiate that object?
Change 355077 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] split up flow dumps into stubs and content passes
Change 355080 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/Flow@master] generate metadata dumps as first pass, rev text as second pass dumps
Change 355085 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] move most of dumpTextPass to an include file so it can be subclassed
The code as committed works ok, though I have yet to test it for multple topics. However, it does depend on the order in which revisions are currently dumped; i.e. right now it expects that revisions to a board added since the previous dump will be displayed after the currently existing revisions, and that the order in which existing revisions are dumped does not change. @Mattflaschen-WMF can you weigh in on this? Are new revisions added at the bottom of a topic, or a parent post? Where can they be inserted? Is the order of revisions, posts and children, topics fixed, or can everything move around? Once I have a clearer picture on that, I can fix up anything needed in these patches.
Also note that I wrote a rather horrid function that will get a PostSummary/PostRevision/Header from a revision id as read out of an xml file. It's probably using all the unapproved methods, but since I don't know which ones those are, perhaps you might have a look? Though that can wait until the rev order issue is sorted out.
Answered inline.
By displayed I understand you mean the order of the XML file in the old dump format (Flow\Dump\Exporter).
lf I understand correctly, I don't think the first one holds. If we assume there is a topic title/topic T, with two post children, A and B (each with only one revision), it will basically be (simplified structure, but order should be right):
<first revision of T> <children> <first revision of A> <first revision of B> </children>
Then say you edit A (say you're fixing the spelling of your own comment). It then looks basically like:
<first revision of T> <children> <first revision of A> <second revision of A and newest revision overall> <first revision of B> </children>
The newest revision is added to the A group (not after the B revision), which invalidates "will be displayed after the currently existing revisions". This is the structure of formatPost.
I think "the order in which existing revisions are dumped does not change" holds, except there can be missing revisions in the before or after dumps due to deletion/undeletion.
While I sort out that sort of post id stack I'm going to need to keep in order to make this work, just a quick question: posts (as in <post id="blah"> only live under topics, nothing else, right? Not in descriptions or summaries?
I've been working with a test wrapper to check my walking of the board/topic/post id stack. In the meantime I have come aross a flow revision with a literal ^H in the mediawikiwiki flow history dumps. Trying to track that down, XML readers break on the CDATA there, of course.
Verified that the revision text has ^H (\b) buried in it after gunzip of the compressed data direct from the external storage cluster with no transformations applied. I'll open a separate ticket for this with all the details.
T167456: Some Flow content contains control characters (e.g. \b (backspace)) has the details of the bad revision. I will check shortly to see if there are others; if so I will note them on that ticket.
Well, bad news. I had counted on topic ids within a board, posts at the same tree level within a topic, etc. to be ordered ascending. WIthout this, prefetch cannot work, because we cannot know reliably when to stop looking for an item that may not be in the file used for prefetch.
But as we see from the below (contents extracted from mediawikiwiki flow history), this is not so.
<board id="t5teroqw3qjxdm5g" title="Talk:ORES_review_tool">
<topic id="t5teroseo1rm857o">
<topic id="tagv0i8yso811d8t">
<topic id="takkdfuhsrxyti5b">
<topic id="tb135cnjjnzkmlg8">
<topic id="tb7oeu8ar5q7wzdz">
<topic id="tc4gikdsra0uhwld">
<topic id="tfgrv608gpuuyl94">
<topic id="tfx6tpcmxqzhbu16">
---><topic id="tb34n8tdmpv4vc38">
<topic id="tflhjj5x1numzg67">
<topic id="tg02ussctcvtfmtp">
<topic id="tk7xilv8wsgui390">
---><topic id="ta80e2wkxahx4ns5">
I have code that otherwise works (walks the post stack properly etc), which I'll stuff into gerrit just for the record. But at this point, I don't see a way forward without altering the existing dumps. @Mattflaschen-WMF Can the dump and the import be rewritten to maintain ascending order of each item within its parent?
https://gerrit.wikimedia.org/r/#/c/355080/3/maintenance/backupPrefetch.inc And there's the code. It's been tested against a script that pretends to have read through the stubs and come across revision(s) it wants.
Change 362177 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/Flow@master] force topics to be exported by UUID order
Yes (always nested somewhere under a topic), but as you've probably seen, not necessarily *directly* under the topic. It can be nested inside another <post>.
Change 362177 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Force topics to be exported by UUID order
Change 433680 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/Flow@master] option to dump by range of workflow ids b36-encoded
Yesterday's run of 3 hours went even better today: 1.5 hours for batches of 200 workflows each. I tried that same batchsize on snapshot1007 (php5) and it slowed to a crawl on the 6th batch, the behavior we've seen all this time in the flow history dumps. I'm trying a full run with no batching on the php7/stretch host now to see if that's magically better also.
Change 433680 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] option to dump by range of workflow ids b36-encoded
This means "i am not going to look at it for the next long while". I just saw a really fast run on php7/stretch: hours instead of days. No idea why but I'll take it. Downgrading priority too.
Change 355085 merged by jenkins-bot:
[mediawiki/core@master] move most of dumpTextPass to an include file so it can be subclassed