[go: up one dir, main page]

Page MenuHomePhabricator

Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC
Closed, ResolvedPublic

Description

From #wikimedia-operations

04:24 < icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen

04:24 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.57 seconds                                                                                                                                     

04:24 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db1099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.85 seconds                                                                                                                                     

04:24 < icinga-wm> PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 628.36 seconds                                                                                                                                     

05:18 < icinga-wm> PROBLEM - High lag on wdqs1005 is CRITICAL: 3.041e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen                                      

05:20 < icinga-wm> PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1979 bytes 

05:22 < icinga-wm> RECOVERY - MariaDB Slave Lag: s8 on db2086 is OK: OK slave_sql_lag Replication lag: 0.33 seconds                                                                                                                                                  

05:22 < icinga-wm> RECOVERY - MariaDB Slave Lag: s8 on db2085 is OK: OK slave_sql_lag Replication lag: 0.00 seconds                                                                                                                                                  

05:29 < icinga-wm> RECOVERY - MariaDB Slave Lag: s8 on db1099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds

db1099 a recentchanges slave was indeed lagging: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?orgId=1&panelId=12&fullscreen&from=1529898238254&to=1529905275896&var-dc=eqiad%20prometheus%2Fops

db1071, the master, had no writes: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1071&var-port=9104&from=1529862857226&to=1529906057227

This could have possible caused a writes incident, it needs to be investigated.

Could this be a consequence of the load balancer misbehaviour (T180918) where a single slave brings down the entire section?

Event Timeline

jcrespo added a parent task: Restricted Task.Jun 25 2018, 6:09 AM

Edits halved, which is consistent with a wikidata outage:

Screenshot_20180625_081919.png (979×2 px, 78 KB)

No long running queries on master or db1099:3318:

Screenshot_20180625_081013.png (921×2 px, 140 KB)

Screenshot_20180625_080008.png (581×1 px, 60 KB)

Revisions were 10 times less than expected:

root@dbstore1002[wikidatawiki]> SELECT CONCAT(RIGHT(LEFT(rev_timestamp, 11), 3), '0') AS time, count(*) FROM revision where rev_timestamp like '20180625%' GROUP BY LEFT(rev_timestamp, 11); 
+------+----------+
| time | count(*) |
+------+----------+
| 0000 |     4494 |
| 0010 |     4650 |
| 0020 |     4575 |
| 0030 |     4009 |
| 0040 |     4008 |
| 0050 |     3888 |
| 0100 |     3623 |
| 0110 |     3411 |
| 0120 |     3489 |
| 0130 |     3618 |
| 0140 |     3751 |
| 0150 |     3727 |
| 0200 |     3829 |
| 0210 |     3965 |
| 0220 |     3887 |
| 0230 |     3303 |
| 0240 |     3886 |
| 0250 |     3576 |
| 0300 |     2923 |
| 0310 |     3247 |
| 0320 |     3142 |
| 0330 |     2973 |
| 0340 |     2908 |
| 0350 |     3156 |
| 0400 |     2856 |
| 0410 |     1232 |
| 0420 |      168 |
| 0430 |      168 |
| 0440 |      148 |
| 0450 |      240 |
| 0500 |      260 |
| 0510 |      251 |
| 0520 |      257 |
| 0530 |     2118 |
| 0540 |     2564 |
| 0550 |     2735 |
| 0600 |     2938 |
| 0610 |     3013 |
| 0620 |     2789 |
| 0630 |     2643 |
| 0640 |     1131 |
+------+----------+
41 rows in set (0.06 sec)

51,715 exceptions with:

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBReplicationWaitError from line 426 of /srv/mediawiki/php-1.32.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to db1071
Vvjjkkii renamed this task from Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC to ydaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Marostegui renamed this task from ydaaaaaaaa to Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC.Jul 1 2018, 8:03 PM
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)
CommunityTechBot raised the priority of this task from Medium to Needs Triage.Jul 5 2018, 7:06 PM

51,715 exceptions with:

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBReplicationWaitError from line 426 of /srv/mediawiki/php-1.32.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to db1071

These exceptions are from maintenance scripts. Prior to December 2015 the policy was for maintenance scripts to just wait forever in the case of replication lag, but since fedfee628c377eeea0453ed82af02b6878bd525b it throws an exception after a hard-coded 60 seconds. According to the commit message this "makes failure more explicit". I suppose it also allows a job runner to reconfigure itself if a slave is dropped.

The reason for the prior policy was because at the time, lag was very common, scripts were not always restartable.

Unfortunately it does not help to isolate the problem since exception spam is now the inevitable and harmless consequence of >60s lag.

I am not too worried about exceptions/error messages, I only pointed those in case it helped debug the real issues, the ones I mentioned at T198049#4310282 (in first appearance caused by a single production host having issues).

The drop may have been caused by the API maxlag parameter. Wikidata:Bots recommends using a maxlag parameter, and some client libraries set maxlag=5 by default. The point of this feature is to make bots pause during replication lag, to prioritise human users and avoid worsening the situation.

The relevant logs from this period have been deleted, so some guesswork is needed. I looked at the edits before the issue, between 3:50 and 4:00, and during the issue, between 4:20 and 4:30. In the earlier period, 7 out of the top 10 editors had "bot" in their names. Out of the remaining 3, 2 were using QuickStatements, a tool labs tool which edits at a high rate via OAuth. These 9 fully automated editors made 2727 edits in the earlier period, and 3 edits during the event.

The last of the top 10, Ghuron, made 240 edits in the earlier period and 48 during the event. The edits were repetitive, and similar in character across the two time periods.

If you exclude the top 10 editors from the earlier period from the analysis, the remaining edit rate only dropped from 194 to 115. The total number of unique users dropped from 53 to 38.

Maybe we should add some maxlag metrics to statsd.

db1071, the master, had no writes

It actually had a factor of 10 fewer writes, not zero writes.

I'm pretty sure there was no outage.

I had a closer look at the exceptions. Most of them came from jobs. There's a sizeable minority that came from appservers. Sampling a few shows they are from DeferredUpdates. For example LinksUpdate and GlobalUsage write batches of rows in a loop with commitAndWaitForReplication(). This means updates are being skipped. Unlike a job there is no possibility of a retry. This was a change made in 2016 as part of T95501. I'll file a bug for this. Otherwise, I think I'm done with this incident.