https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/ has been timing out (45mins+) for a while (first instance at 22 Feb 2024, 17:20 BST)
Description
Details
Related Objects
- Mentioned In
- T358620: LoadBalancer::awaitSessionPrimaryPos needs a timeout
T358484: maintenance/lag.php and maintenance/getLagTimes.php give confusing output when replication is stopped
T358364: Edits not saved on beta cluster - Mentioned Here
- T215217: deployment-prep (beta cluster): Code stewardship request
T358236: Cannot create a new wiki on beta cluster
Event Timeline
T358236 could cause this issue as the addWiki.php script failed in the middle of the process and I had to drop the test2wiki multiple times manually (as the script was failing due to multiple reasons, I fixed those one by one). Those things happened around similar time.
The first time I executed the addWiki.php script was 15:28:51 BST, which was a little bit more than an hour earlier. Then I dropped the test2wiki around 17:30:52 BST - a little bit after the first occurrence.
Job now disabled as of 2024-02-23Z12:29:35: https://sal.toolforge.org/log/kIjy1Y0BxE1_1c7sK1mK
This job runs the`wmf-beta-update-databases.py` script - I wonder if I can run it manually and see the output. I compared the last sucessfull run and first failed runs:
Last success: https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/74102/consoleFull
First failure: https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/74103/consoleFull
and looks like it just stopped after wikidatawiki and then never came back to life. Something had to happen around 18:23:41.
I'm to run manually mwscript update.php --wiki=aawiki --quick --skip-config-validation and check the results.
So I executed the script by hand and it looks like it gets stuck.
pmiazga@deployment-deploy03:~$ mwscript update.php --wiki=aawiki --quick --skip-config-validation #!/usr/bin/env php Warning: session_name(): Cannot change session name when headers already sent in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 619 MediaWiki 1.42.0-alpha Updater Your composer.lock file is up to date with current dependencies! Going to run database updates for aawiki Depending on the size of your database this may take a while! ...collations up-to-date. ...have rev_actor field in revision table.
after that it's just waiting. After couple minutes I killed the script with CTRL+C. The next should be
...watchlist_expiry table already exists.
I left the script working for 15 minutes and looks like it's slowly progressing:
pmiazga@deployment-deploy03:~$ mwscript update.php --wiki=aawiki --quick --skip-config-validation #!/usr/bin/env php Warning: session_name(): Cannot change session name when headers already sent in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 619 MediaWiki 1.42.0-alpha Updater Your composer.lock file is up to date with current dependencies! Going to run database updates for aawiki Depending on the size of your database this may take a while! ...collations up-to-date. ...have rev_actor field in revision table. ...watchlist_expiry table already exists. ...page_restrictions field does not exist in page table, skipping modify field patch. ...index ipb_address_unique already set on ipblocks table.
But previously it was doing all databases in ~12 minutes.
This seems to be causing cascade failures (edits not being saved/user preferences/etc.) so making UBN
It is hard to tell what is broken really. I have checked with sql aawiki and sql --write aawiki to check the SHOW PROCESSLIST; output but nothing stands out.
I have run the lags checkers command:
$ mwscript lag.php --wiki=aawiki deployment-db11 0 deployment-db12 false deployment-db13 false
What is false? Well who knows, no lag I assume?
Another maintenance script gives more entries, all set to 0
$ mwscript getLagTimes.php --wiki=aawiki 172.16.5.150 deployment-db11 0 172.16.3.239 deployment-db12 0 172.16.4.172 deployment-db13 0 172.16.5.150:3306 172.16.5.150:3306 0 172.16.3.239:3306 172.16.3.239:3306 0 172.16.4.172:3306 172.16.4.172:3306 0 172.16.5.150:3306 172.16.5.150:3306 0 172.16.3.239:3306 172.16.3.239:3306 0 172.16.4.172:3306 172.16.4.172:3306 0 172.16.5.150:3306 172.16.5.150:3306 0 172.16.3.239:3306 172.16.3.239:3306 0 172.16.4.172:3306 172.16.4.172:3306 0 172.16.5.150:3306 172.16.5.150:3306 0 172.16.3.239:3306 172.16.3.239:3306 0 172.16.4.172:3306 172.16.4.172:3306 0
That is not conclusive.
Wouldn't it be rad if one could execute a maintenance command and turn out debug logging to be spurt directly to the console log (that is worth a feature request, bonus if it comes with nice coloring of the severity level and the log bucket but I digress). The X-Wikimedia-Debug header let us turn that on and looking at our wmf-config/logging.php there is:
if ( getenv( 'MW_DEBUG_LOCAL' ) ) { // Route all log messages to a local file $wgDebugLogFile = '/tmp/wiki.log'; $wmgDefaultMonologHandler = 'wgDebugLogFile'; $wmgEnableLogstash = false; $wmgMonologChannels = []; $wgDebugDumpSql = true;
So you could theorically: MW_DEBUG_LOCAL=1 mwscript update.php and expect some debugging in /tmp/wiki.log. Except that mwscript is a shell script which invokes sudo and somewhere along the line the environment variable is not passed down to the invoked php script (I blame sudo). I have thus run bash -x /usr/local/bin/mwscript to retrieve the command line then copy pasted and manually inserted the environment variable:
sudo -u www-data MW_DEBUG_LOCAL=1 php /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=aawiki --quick --skip-config-validation --skip-external-dependencies
I have copied the resulting /tmp/wiki.log to /home/hashar/T358329.log which has at the bottom:
2024-02-26 10:25:38.934047 [5c23a0ea7579b5e77516f6ad] deployment-deploy03 aawiki 1.42.0-alpha rdbms INFO: Wikimedia\Rdbms\LoadBalancer::awaitSessionPrimaryPos: waiting for replica DB deployment-db12 to catch up... {"db_server":"deployment-db12","db_domain":""} 2024-02-26 10:25:38.935444 [5c23a0ea7579b5e77516f6ad] deployment-deploy03 aawiki 1.42.0-alpha rdbms DEBUG: Wikimedia\Rdbms\Replication\MysqlReplicationReporter::getReplicaPos [1.289ms] deployment-db12: SHOW GLOBAL VARIABLES LIKE 'gtid_%' {"db_server":"deployment-db12","db_name":null,"db_user":"wikiadmin","method":"Wikimedia\\Rdbms\\Replication\\MysqlReplicationReporter::getReplicaPos","sql":"SHOW GLOBAL VARIABLES LIKE 'gtid_%'","domain":"","runtime_ms":1.289,"db_log_category":"query"} 2024-02-26 10:25:38.936687 [5c23a0ea7579b5e77516f6ad] deployment-deploy03 aawiki 1.42.0-alpha rdbms DEBUG: Wikimedia\Rdbms\Replication\MysqlReplicationReporter::getReplicaPos [1.083ms] deployment-db12: SHOW SESSION VARIABLES LIKE 'gtid_%' {"db_server":"deployment-db12","db_name":null,"db_user":"wikiadmin","method":"Wikimedia\\Rdbms\\Replication\\MysqlReplicationReporter::getReplicaPos","sql":"SHOW SESSION VARIABLES LIKE 'gtid_%'","domain":"","runtime_ms":1.083,"db_log_category":"query"}
So the script waits for deployment-db12 to catch up and surely that should be reported to the user when that happens.
On deployment-db12 I have connected to mysql, went with show slave status \G in a very wide terminal to prevent message from being truncated and I got:
Last_SQL_Error: Query caused different errors on master and slave.
Error on master: message (format)='Cannot load from %s.%s. The table is probably corrupted' error code=1728 ;
Error on slave: actual message='no error', error code=0. Default database: 'test2wiki'.
Query: 'drop database test2wiki'
So I don't know. Someone/something tried to drop test2wiki and that broke replication? Fixing the replication is outside of skillset though :-\
@hashar yes, I dropped the database as addWiki.php script created a partially created schema (script failed in the middle of run), re-running script caused it to fail again as some tables were already created.
Error drop database test2wiki makes sense, I I dropped it, and then recreated it again - and this happened 3 times as script was broken also in other places.
Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 9 [Warning] Aborted connection 9 to db: 'unconnected' user: 'unauthenticated' host: '172.16.2.65' (This connection closed normally without authentication) Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 5 [Note] Slave I/O thread: connected to master 'repl@deployment-db11.deployment-prep.eqiad1.wikimedia.cloud:3306',replication started in log 'deployment-db11-bin.000088' at position 218279745 Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 6 [ERROR] Slave SQL: Query caused different errors on master and slave. Error on master: message (format)='Cannot load from %s.%s. The table is probably corrupted' error code=1728 ; Error on slave: a> Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 6 [Warning] Slave: Can't drop database 'test2wiki'; database doesn't exist Error_code: 1008 Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 6 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'deployment-db11-bin.000088' position 55979645 Feb 26 13:06:45 deployment-db13 mysqld[619520]: 2024-02-26 13:06:45 6 [Note] Slave SQL thread exiting, replication stopped in log 'deployment-db11-bin.000088' at position 55979645, master: deployment-db11.deployment-prep.eqiad1.wikimedia.cloud:3306
Mentioned in SAL (#wikimedia-releng) [2024-02-26T14:35:01Z] <TheresNoTime> deployment-prep, starting the creation of deployment-db14 per https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases for T358329
Just to explain to folks who might be following along, what's happening: the primary server (db13) will be cloned (via mariabackup --innobackupex) to the new replica; a new instance is bring created now for that. While that is happening, replication will be stopped and the primary will remain read-only. I am guessing that this will be a matter of some hours rather than a day. More updates as available.
Mentioned in SAL (#wikimedia-releng) [2024-02-26T14:59:46Z] <TheresNoTime> deployment-prep, root@deployment-db14:~# /opt/wmf-mariadb106/scripts/mysql_install_db --user=mysql --basedir=/opt/wmf-mariadb106 --datadir=/srv/sqldata T358329
Mentioned in SAL (#wikimedia-releng) [2024-02-26T15:05:14Z] <TheresNoTime> deployment-prep, db11, root@BETA[(none)]> set global read_only = true; T358329
Mentioned in SAL (#wikimedia-releng) [2024-02-26T15:09:45Z] <TheresNoTime> deployment-prep root@deployment-db11:~# mariabackup --innobackupex --stream=xbstream /srv/sqldata --user=root --host=127.0.0.1 --slave-info | nc deployment-db14 9210 T358329
Mentioned in SAL (#wikimedia-releng) [2024-02-26T15:10:35Z] <TheresNoTime> deployment-prep prev. command resulted in 2024-02-26 15:06:43 0 [ERROR] InnoDB: Operating system error number 24 in a file operation. T358329
Mentioned in SAL (#wikimedia-releng) [2024-02-26T15:11:39Z] <TheresNoTime> deployment-prep, db11, root@BETA[(none)]> set global read_only = false; while I figure out mariabackup error T358329
After setting ulimit -n 4094, error is:
root@deployment-db11:~# mariabackup --innobackupex --stream=xbstream /srv/sqldata --user=root --host=127.0.0.1 --slave-info | nc deployment-db14 9210 240226 15:25:13 innobackupex: Starting the backup operation IMPORTANT: Please check that the backup run completes successfully. At the end of a successful backup run innobackupex prints "completed OK!". [00] 2024-02-26 15:25:13 Connecting to server host: 127.0.0.1, user: root, password: not set, port: 3306, socket: /run/mysqld/mysqld.sock [00] 2024-02-26 15:25:13 Using server version 10.6.12-MariaDB-log --innobackupex based on MariaDB server 10.5.23-MariaDB debian-linux-gnu (x86_64) [00] 2024-02-26 15:25:13 uses posix_fadvise(). [00] 2024-02-26 15:25:13 cd to /srv/sqldata/ [00] 2024-02-26 15:25:13 open files limit requested 0, set to 4094 [00] 2024-02-26 15:25:13 mariabackup: using the following InnoDB configuration: [00] 2024-02-26 15:25:13 innodb_data_home_dir = [00] 2024-02-26 15:25:13 innodb_data_file_path = ibdata1:12M:autoextend [00] 2024-02-26 15:25:13 innodb_log_group_home_dir = ./ [00] 2024-02-26 15:25:13 InnoDB: Using Linux native AIO 2024-02-26 15:25:13 0 [Note] InnoDB: Number of pools: 1 [00] 2024-02-26 15:25:13 mariabackup: Generating a list of tablespaces 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 2024-02-26 15:25:13 0 [Note] InnoDB: Header page consists of zero bytes in datafile: ./test2wiki/job.ibd, Space ID:0, Flags: 0 [00] FATAL ERROR: 2024-02-26 15:25:13 Failed to validate first page of the file test2wiki/job, error 39
I would now be clutching at straws to diagnose this, so for the next person who takes a look, I got up to https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases#On_Existing_DB
nb. dropping test2wiki on db11 and then setting ulimit -n 40940 and retrying resolved the too many files open/test2wiki job error. Now hit mbstream: Can't create/write to file './mysql/innodb_index_stats.ibd' (Errcode: 17 "File exists") on db14 (which stopped the mariabackup stream from db11). Stopping at the moment to allow @ArielGlenn to take a look.
In the end @TheresNoTime figured it out: puppet was starting mariadb automatically when we didn't want it running and hence creating that file complained of in the error above. The cloning process looks like it's going ok at the moment.
Mentioned in SAL (#wikimedia-releng) [2024-02-26T16:38:17Z] <TheresNoTime> deployment-prep, db11, mariabackup --innobackupex --apply-log --use-memory=10G /srv/sqldata T358329
Stepping away for a bit, note for whomever, replication hasn't been started (@ArielGlenn is taking a look), and once confirmed working db-labs.php needs updating.
The cloning procedure is done for db14 but we are currently hunting around for the replication password, not where the docs ( https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases#Starting_Replication ) say it should be, not anywhere in that repo ever, apparently.
Hi 👋 I suggest waiting for me. I will take a look soon. Regarding databases, if you're not 100% sure what you're doing, things usually end up being worse. Been there done that.
Okay. So for the record, what has been done: new instance db14, cloning the data from primary db11, following https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Databases and stopping before enabling replication. Waiting for your input.
Mentioned in SAL (#wikimedia-releng) [2024-02-26T17:29:25Z] <apergos> waiting on Amir's input as per his comment on T358329; db14 remains up but not replicating.
db14 should be now getting replication well and working just fine. I'll write more details on why/what/etc. tomorrow (when I'm actually back from my honeymoon).
Someone should update labs db to get the new replica.
Change 1006593 had a related patch set uploaded (by Jforrester; author: Jforrester):
[operations/mediawiki-config@master] [BETA CLUSTER] List deployment-db14, new replica
Change 1006594 had a related patch set uploaded (by Jforrester; author: Jforrester):
[operations/mediawiki-config@master] [BETA CLUSTER] Drop deployment-db12 and deployment-db13 replicas
Change 1006593 merged by jenkins-bot:
[operations/mediawiki-config@master] [BETA CLUSTER] List deployment-db14, new replica
Change 1006594 merged by jenkins-bot:
[operations/mediawiki-config@master] [BETA CLUSTER] Drop deployment-db12 and deployment-db13 replicas
I had to reclone it again from db11, it's usually better to set the db to read only before doing the cloning (read only if you cloning from master, "stop slave" if you're cloning from another replica.) even though binlog can be replayed, there are statements happening that wouldn't be written to binlogs. Regardless, that's fixed now.
Regarding password, it's just a user on master with that password, you can just change its password. I did exactly that, the password is in my home directory on db11 and only readable by root.
The update is still timing out but not on run of update..php, I will check why.
Would you be willing to put that password into /var/lib/git/labs/private/modules/secret/secrets/mysql/repl_password on deployment-puppetmaster04? That is where the docs, such as they are, say it ought to be.
Part of the goal of these docs is so that we can not have to poke you or Manuel for beta cluster things, but that anyone with some db admin familiarity would be able to use those instructions to stand up a new replica as needed,
Thanks for all the work!
Yes I agree but that is inside a public git repository (from what I'm seeing) which means it'll be confusing will be potentially removed again. That's why I didn't put it there. It needs a better solution.
This is where I wish deployment-prep had an owner (see T215217, sigh).
If I had my druthers I'd put it in /srv/private on the active deployment-prep puppetmaster with some suitable file name, but there's not even a directory there. Maybe someone who is involved with beta a lot more regularly can chime in.
I have a feeling I know what's going on. It's trying to run virtual domain updates in the wrong way. I need to dig deeper.
The historical answer for storing actually secret secrets in a Cloud VPS project is for that project to setup a standalone Puppetmaster (which deployment-prep has) and then to store the secret values as local commits to the /var/lib/git/labs/private checkout. The timer task that keeps this checkout up to date with the upstream labs/private.git repo rebases the local patches on the upstream HEAD.
The repo on deployment-puppetmaster04.deployment-prep.eqiad.wmflabs currently has a number of these patches:
$ cd /var/lib/git/labs/private $ git status On branch master Your branch is ahead of 'origin/master' by 11 commits. (use "git push" to publish your local commits) nothing to commit, working tree clean $ git log --pretty=oneline origin/master..HEAD bf3392927b7b1b37267b395f60352c59926faa79 (HEAD -> master, tag: snapshot-202402261338) [ACTUALLY PRIVATE] acme_chief: Add openstack designate password 0201b86f2468d593d66fdcf2d16326e2ccbe0430 [local] generate SSH CA keys 9c12188cc979c1d3ccbee0976d4b211c239f08dd [LOCAL] acme_chief account secrets, and keyholder keys a39de7e1d4ca497930bfa49759a062d36a03a0d6 Add secrets for kafka logging cluster 6131e23e792e85e2006e6197026133fe81f49d0a Recommendation API passwords ed696ed2dfa41c0635fc266aff7ced007b5ffd28 Add DKIM keys 4a2b615fc537c7f7a8e15032db580dfece37e2df Add authdns-certcentral key 741b86109f80a508bee41067f9e382a635d47f84 Follow-up I8ea442cc: Add dumpsdeploy passphrase db0679044f98547c972858138f071ac087dda3ce LOCAL HACK: Add keyholder "secret" keys 1c7ed610a939d697b9ee75ef2f0a543eb6d8c4ad [LOCAL HACK] Add SSL certificate for beta 9143bfee079ed3559a1b5b32704b580ee27cc66e [LOCAL] Keys for mwdeploy user
This solution is certainly a hack, but it is a hack that has worked for us (for some values of "worked") for the last 10 years or so.
I ran the update with debug mode on. It required a bit of magic. I ran eval.php with -d 3 argument
And then:
> require_once __DIR__ . '/update.php'; #!/usr/bin/env php > $maint = new UpdateMediaWiki(); > $maint->loadParamsAndArgs( 'update', ['skip-config-validation' => true ] ); > $maint->setup(); > $maint->execute();
It gets stuck in:
...urlshortcodes table already exists. [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getAnyOpenConnection: found 'round' connection to #0. [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getAnyOpenConnection: found 'round' connection to #0. [debug] [rdbms] Wikimedia\Rdbms\Replication\MysqlReplicationReporter::getPrimaryPos [1.858ms] deployment-db11: SHOW GLOBAL VARIABLES LIKE 'gtid_%' [debug] [rdbms] Wikimedia\Rdbms\Replication\MysqlReplicationReporter::getPrimaryPos [2.006ms] deployment-db11: SHOW SESSION VARIABLES LIKE 'gtid_%' [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getAnyOpenConnection: found 'round' connection to #0. [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getAnyOpenConnection: found 'round' connection to #0. [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::awaitSessionPrimaryPos: replica DB deployment-db14 known to be caught up (pos >= 0-2886731158-480605821). [debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 1/ [info] [rdbms] Wikimedia\Rdbms\LoadBalancer::awaitSessionPrimaryPos: waiting for replica DB 172.16.4.172:3306 to catch up...
It's really obvious what's going on. Give me a bit to fix this.
Change 1006990 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/mediawiki-config@master] beta: Remove more mentions of the old replicas
Change 1006990 merged by jenkins-bot:
[operations/mediawiki-config@master] beta: Remove more mentions of the old replicas
After pulling the patch and running update.php, now it runs fast:
Purging caches...done. Done in 1.7 s.
I'm going to re-enable the job now. LoadBalancer::awaitSessionPrimaryPos needs a timeout. Feel free to create a follow up and assign it to me. I'll take a look (but not now). The whole reason is that if replication to any replica breaks, awaitSessionPrimaryPos waits until the whole script basically times out, it should probably wait for five seconds and move on (probably) but i.e. the whole reason it got slow was that the drop database broke replication.
db13 and db12 should be simply deleted. Having two replicas for beta is an overkill and waste of resources.
Mentioned in SAL (#wikimedia-releng) [2024-02-27T20:32:21Z] <James_F> Disabling deployment-db12 and deployment-db13 as follow-up to T358329; no longer used
Change 1010269 had a related patch set uploaded (by Jforrester; author: Jforrester):
[operations/mediawiki-config@master] [BETA CLUSTER] Disable MobileFrontend for Wikifunctions
Change #1010269 merged by jenkins-bot:
[operations/mediawiki-config@master] [BETA CLUSTER] Disable MobileFrontend for Wikifunctions
Mentioned in SAL (#wikimedia-releng) [2024-08-09T13:33:25Z] <zabe> delete deployment-db12 and deployment-db13 # T358329