[go: up one dir, main page]

Page MenuHomePhabricator

Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors)
Closed, DuplicatePublicPRODUCTION ERROR

Description

A baseline of these exceptions is almost always present, but when compared to the last 12 hours there seems to be a significant increase:

image.png (265×684 px, 18 KB)

exception.class is one of Wikimedia\Rdbms\DBTransactionStateError, Wikimedia\Rdbms\DBQueryError


Sample error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Rdbms\DBQueryError: Error 1205: Lock wait timeout exceeded; try restarting transaction
Function: MediaWiki\Extension\DiscussionTools\ThreadItemStore::insertThreadItems
Query: INSERT INTO `discussiontools_item_ids` (it
exception.trace
from /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/Database.php(1523)
#0 /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/Database.php(1507): Wikimedia\Rdbms\Database->getQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/Database.php(1481): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
#2 /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/Database.php(854): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#3 /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/DatabaseMysqlBase.php(587): Wikimedia\Rdbms\Database->query(string, string, integer)
#4 /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/Database.php(1885): Wikimedia\Rdbms\DatabaseMysqlBase->doUpsert(string, array, array, array, string)
#5 /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->upsert(string, array, string, array, string)
#6 /srv/mediawiki/php-1.41.0-wmf.2/includes/libs/rdbms/database/DBConnRef.php(551): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.41.0-wmf.2/extensions/DiscussionTools/includes/ThreadItemStore.php(383): Wikimedia\Rdbms\DBConnRef->upsert(string, array, string, array, string)
#8 /srv/mediawiki/php-1.41.0-wmf.2/extensions/DiscussionTools/includes/Hooks/DataUpdatesHooks.php(48): MediaWiki\Extension\DiscussionTools\ThreadItemStore->insertThreadItems(MediaWiki\Revision\RevisionStoreRecord, MediaWiki\Extension\DiscussionTools\ContentThreadItemSet)
#9 /srv/mediawiki/php-1.41.0-wmf.2/includes/deferred/MWCallableUpdate.php(38): MediaWiki\Extension\DiscussionTools\Hooks\DataUpdatesHooks->MediaWiki\Extension\DiscussionTools\Hooks\{closure}()
#10 /srv/mediawiki/php-1.41.0-wmf.2/includes/deferred/DeferredUpdates.php(473): MWCallableUpdate->doUpdate()
#11 /srv/mediawiki/php-1.41.0-wmf.2/includes/deferred/RefreshSecondaryDataUpdate.php(103): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#12 /srv/mediawiki/php-1.41.0-wmf.2/includes/deferred/DeferredUpdates.php(473): RefreshSecondaryDataUpdate->doUpdate()
#13 /srv/mediawiki/php-1.41.0-wmf.2/includes/Storage/DerivedPageDataUpdater.php(1836): DeferredUpdates::attemptUpdate(RefreshSecondaryDataUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#14 /srv/mediawiki/php-1.41.0-wmf.2/includes/page/WikiPage.php(2145): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates(array)
#15 /srv/mediawiki/php-1.41.0-wmf.2/includes/jobqueue/jobs/RefreshLinksJob.php(244): WikiPage->doSecondaryDataUpdates(array)
#16 /srv/mediawiki/php-1.41.0-wmf.2/includes/jobqueue/jobs/RefreshLinksJob.php(153): RefreshLinksJob->runForTitle(MediaWiki\Title\Title)
#17 /srv/mediawiki/php-1.41.0-wmf.2/extensions/EventBus/includes/JobExecutor.php(79): RefreshLinksJob->run()
#18 /srv/mediawiki/rpc/RunSingleJob.php(77): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#19 {main}
Impact
Notes

Event Timeline

Looking at the latest errors, it seems related to ruwiki & discussion tools.

It is more than one wiki/section, just ruwiki seems to be taking half of all errors.

The errors are "Duplicate entry '0-X' for key 'itr_itemid_id_revision_id' INSERT INTO discussiontools_item_revisions" and "Lock wait timeout exceeded" in "INSERT INTO discussiontools_item_ids" and "INSERT INTO discussiontools_items"

Either the write logic needs tuning for performance optimization or some db traffic has shifted that requires optimization to avoid db contention.

If this is caused by a deploy, I would revert, as it is overloading ruwiki dbs (s6). If this is caused by traffic, I would start banning, as this could bring down s6 wikis:

Screenshot_20230404_230934.png (798×2 px, 275 KB)

It stopped now, but for some time writes got 40x the normal rate in s6:
https://grafana.wikimedia.org/goto/qCzK_vLVz?orgId=1

matmarex subscribed.

(summarizing from IRC:) The DiscussionTools errors are a known issue (T323077), but their rate so far has been low. We didn't make any related deployments recently (since this feature was enabled on ruwiki on March 21 in T315353#8716239).

Can it be related to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/905727 ?

If not, it can be simply because a refreshlinks job started reparsing all of ruwiki or the maint script got to ruwiki?

After looking at the binlogs, there is a lot of inserts happening in a very short period of time. While T323077 itself is not super concerning, the fact that something triggered so many writes (and that is why the above caused so many errors) is concerning, as it caused lag and general performance issues (significant increased latency). Not sure if this was caused by a job/normal operation or increased user traffic, but we should try to contain the write speed to avoid overloading other write operations. MediaWiki\Extension\DiscussionTools\ThreadItemStore::insertThreadItems should at the very least be throttled (unless there was something else that shouldn't happen causing that triggered that).

T323077 is a dealock and my guess is that refreshlinks tries to run the extension data updates twice leading to this mess but @jcrespo if you can give me some more data you have, I'll take a look. There might be a bug in the code we missed.

jcrespo renamed this task from Greater than average number of DBTransactionStateError/DBQueryErrors to Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors).Apr 4 2023, 10:07 PM

I've updated the ticket to make explicit that T323077 is not that concerning at the moment, the increased write througput is (which just amplifies errors from deadlocks). The deadlocks just generates noise and make the inserts slightly less performant. The write load is the one that causes the lag and app server latency issues.

it subsided drastically now, I tried to find some in binlog but couldn't, I think a template edit led to a lot of refreshes and subsequently a lot of writes, given that the maint script hasn't been run on ruwiki yet so any refreshlinks job on a page will lead to a lot of write naturally. So this is a problem that I think will go away.

That being said, I really wish we didn't have the rev-based part of it but there is a product reasoning for it.

I'm inclined to mark this as declined as I mentioned above, the new table is blank and hasn't filled yet. Refreshlinks might start to do a lot of writes until we actually backfill the data. Not an invalid issue but something that just will go away in the future.

I'm inclined to mark this as declined as I mentioned above, the new table is blank and hasn't filled yet. Refreshlinks might start to do a lot of writes until we actually backfill the data. Not an invalid issue but something that just will go away in the future.

That's ok to me. It just looked bad if it continued at that level for a long period of time.

Yeah, it was totally correct thing to do.

The lag spike seemed pretty bad: https://grafana-rw.wikimedia.org/d/G9kbQdRVz/mediawiki-loadbalancer?orgId=1&from=1680599979411&to=1680629413992&viewPanel=109

Ideally, we'd throttle writes at that point. Read-only mode throttling was removed in 6365db67ba34aded441f28b8e00463c74e2f21d9 .

We probably should enable rpl_semi_sync_master_wait_no_slave and maybe bump rpl_semi_sync_master_timeout at least.

The excessive writes caused replag not on the main dc but on the secondary dc because waitForReplication doesn't wait for replicas of the secondary datacenter nor semi-sync so upping timeout or tweaking semi-sync won't do anything here. Adding throttle won't save the issue as it was done by an overzealous job not users and making wikis read-only would turn problem of "some pages are outdated in the reads in the secondary dc" (arguably small impact) to a "wiki going fully readonly" and basically a full outage. I prefer graceful degradation here rather than panic.

The excessive writes caused replag not on the main dc but on the secondary dc because waitForReplication doesn't wait for replicas of the secondary datacenter nor semi-sync so upping timeout or tweaking semi-sync won't do anything here. Adding throttle won't save the issue as it was done by an overzealous job not users and making wikis read-only would turn problem of "some pages are outdated in the reads in the secondary dc" (arguably small impact) to a "wiki going fully readonly" and basically a full outage. I prefer graceful degradation here rather than panic.

Tweaking the timeouts and async mode behavior seems useful for throttling fast writes. Our current timeout is around 100ms. That doesn't involve a binary read-only mode, just slower writes.

In this specific case, it wasn't slow writes, it was flood of writes. I'm not sure we should solve a problem that doesn't exist

In this specific case, it wasn't slow writes, it was flood of writes. I'm not sure we should solve a problem that doesn't exist

I never said "slow" writes. I literally said "fast writes".

Lag that is only in the secondary DC does indeed make things tricky. Since both DCs handle traffic, minutes of lag affects readers and editors.

  • For readers, the lag will not affect most page views, so it's not worth mitigating if doing so is complicated.
  • The lag will cause confusing edit conflicts among editors for popular pages. After the user merges the conflicts, things proceed normally (since save/preview uses HTTP POST), so its tolerable. Clients doing high rate edits will normally be pinned to the primary DC as long they respect the sticky DC cookie, which avoids such conflicts on subsequent edits to any pages for 10 seconds.
  • The lag also causes editors' own changes to seemingly disappear once the 10 second sticky DC cookie expires. The $wgDatabaseReplicaLagWarning notice is buried in the footer and easily missed. Assuming the typical post-edit workflow does not involve navigating back to the same page nor navigating to an affected page (e.g. Special:Contributions) in the next few minutes (unless within 10 seconds), and lag resolves in that time frame, then most editors won't notice. I don't think it's a safe enough assumption though, especially without any throttle to stop the lag increase at some point.

It makes sense to need manual SRE/dev-ops intervention makes sense to handle cross-DC lag due to network/hardware problems (otherwise availability would suffer). I don't like that intervention is needed due to high write volume from jobs or clients.