[go: up one dir, main page]

Page MenuHomePhabricator

Automate the pre/post switchover tasks related to databases
Closed, ResolvedPublic

Description

As well documented by Manuel we should automate all the steps required to prepare the databases for the switchover few days before and to cleanup the special settings for the switchover after it's completed.

Opening this task for tracking purposes with the datacenter switchover tag.

Event Timeline

Marostegui triaged this task as Medium priority.Jul 30 2024, 8:48 AM
Marostegui moved this task from Triage to Ready on the DBA board.

Change #1058225 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] mysql_legacy: instance improvements

https://gerrit.wikimedia.org/r/1058225

Change #1058225 merged by jenkins-bot:

[operations/software/spicerack@master] mysql_legacy: instance improvements

https://gerrit.wikimedia.org/r/1058225

Change #1059052 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases.preparation: new cookbooks

https://gerrit.wikimedia.org/r/1059052

Change #1059053 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] cookbooks: add config for sre.switchdc.databases

https://gerrit.wikimedia.org/r/1059053

Change #1059053 merged by Volans:

[operations/puppet@production] cookbooks: add config for sre.switchdc.databases

https://gerrit.wikimedia.org/r/1059053

Status update:
Some addition to spicerack were made in this patch. I would actually be tempted to rename Instance.run_vertical_query() to Instance.run_query() and replace it as it's the only way to sensibly parse some output (with the caveats explained in the docstring).
With the puppet patch we save the replication credentials to the cumin hosts in a way accessibile by the cookbooks it will be possible to start testing (with test-cookbook passing the DRY-RUN mode option) the two new cookbooks introduced as good draft but still WIP here for the pre and post steps required for the switch datacenter.

Final update before vacations :)

I've update the above patch and performed a DRY-RUN test on both cookbooks forcing to run on only one section (x1):

  • test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t AAA eqiad codfw
  • test-cookbook -c 1059052 --dry-run sre.switchdc.databases.finalize -t AAA codfw eqiad This is with DC inverted as otherwise would not pass the initial validation

@Marostegui you can inspect the logs of the runs on cumin2002:/home/volans/cookbooks_testing/logs/sre/switchdc/databases

You would not find the CHANGE MASTER query in the logs due to the hack to avoid leaking to the logs the replication password (to be properly fixed when I'm back). Here's the actually run query:

Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "CHANGE MASTER TO master_host=\'db2196.codfw.wmnet\',\nmaster_port=3306,\nmaster_ssl=1,\nmaster_log_file=\'db2196-bin.002532\',\nmaster_log_pos=347987607,\nmaster_user=\'repl2024\',\nmaster_password=\'REDACTED\'"'] on 1 hosts: db1220.eqiad.wmnet

Here below are the summaries (there will be one per section) that will be saved into Phabricator (can be fine-tuned in the formatting ofc, probably shortening the FQDNs to hostnames and making monospace various bits, TBD :D ).

    1. sre.switchdc.databases.prepare
  • x1 (PASS)
    • Validated replication topology for section x1 between MASTER_FROM db1220.eqiad.wmnet and MASTER_TO db2196.codfw.wmnet
    • MASTER_TO db2196.codfw.wmnet STOP SLAVE.
    • MASTER_TO db2196.codfw.wmnet MASTER_USE_GTID=no.
    • MASTER_TO db2196.codfw.wmnet START SLAVE.
    • MASTER_TO db2196.codfw.wmnet Ignoring wrong SLAVE STATUS in DRY-RUN mode Using_Gtid=Slave_Pos, expected no instead.
    • Disabled GTID on MASTER_TO db2196.codfw.wmnet
    • MASTER_TO db2196.codfw.wmnet STOP SLAVE.
    • MASTER_TO db2196.codfw.wmnet stopped pt-heartbeat.
    • MASTER_TO db2196.codfw.wmnet Ignoring MASTER STATUS is not stable in DRY-RUN
    • MASTER_FROM db1220.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2196.codfw.wmnet', binlog='db2196-bin.002533', position=394662451, port=3306) and user repl2024
    • MASTER_FROM db1220.eqiad.wmnet START SLAVE
    • MASTER_FROM db1220.eqiad.wmnet skipping replication from MASTER_TO db2196.codfw.wmnet verification
    • MASTER_TO db2196.codfw.wmnet started pt-heartbeat.
    • MASTER_TO db2196.codfw.wmnet START SLAVE.
    • MASTER_TO db2196.codfw.wmnet replication from MASTER_FROM db1220.eqiad.wmnet verified
sre.switchdc.databases.finalize
  • x1 (PASS)
    • Validated replication topology for section x1 between MASTER_TO db1220.eqiad.wmnet and MASTER_FROM db2196.codfw.wmnet
    • MASTER_TO db1220.eqiad.wmnet has no replication set, skipping.
    • MASTER_TO db2196.codfw.wmnet heartbeat server IDs to delete are: []
    • MASTER_FROM db2196.codfw.wmnet STOP SLAVE.
    • MASTER_FROM db2196.codfw.wmnet MASTER_USE_GTID=slave_pos.
    • MASTER_FROM db2196.codfw.wmnet START SLAVE.
    • Ignoring failed check for GTID change in DRY-RUN mode.
    • Enabled GTID on MASTER_FROM db2196.codfw.wmnet

The procedure did not test 100% of the code (like there are no rows to delete in the heartbeat table), but I guess at some point we could do some live test of it or in a controlled environment.
@Marostegui if you could look at the code and the logs to validate the procedure is sound and does what you want before your sabbatical that would be great!

Final update before vacations :)

I've update the above patch and performed a DRY-RUN test on both cookbooks forcing to run on only one section (x1):

  • test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t AAA eqiad codfw
  • test-cookbook -c 1059052 --dry-run sre.switchdc.databases.finalize -t AAA codfw eqiad This is with DC inverted as otherwise would not pass the initial validation

@Marostegui you can inspect the logs of the runs on cumin2002:/home/volans/cookbooks_testing/logs/sre/switchdc/databases

You would not find the CHANGE MASTER query in the logs due to the hack to avoid leaking to the logs the replication password (to be properly fixed when I'm back). Here's the actually run query:

Executing commands ['/usr/local/bin/mysql --socket /run/mysqld/mysqld.sock --batch --execute "CHANGE MASTER TO master_host=\'db2196.codfw.wmnet\',\nmaster_port=3306,\nmaster_ssl=1,\nmaster_log_file=\'db2196-bin.002532\',\nmaster_log_pos=347987607,\nmaster_user=\'repl2024\',\nmaster_password=\'REDACTED\'"'] on 1 hosts: db1220.eqiad.wmnet

Here below are the summaries (there will be one per section) that will be saved into Phabricator (can be fine-tuned in the formatting ofc, probably shortening the FQDNs to hostnames and making monospace various bits, TBD :D ).

    1. sre.switchdc.databases.prepare
  • x1 (PASS)
    • Validated replication topology for section x1 between MASTER_FROM db1220.eqiad.wmnet and MASTER_TO db2196.codfw.wmnet
    • MASTER_TO db2196.codfw.wmnet STOP SLAVE.
    • MASTER_TO db2196.codfw.wmnet MASTER_USE_GTID=no.
    • MASTER_TO db2196.codfw.wmnet START SLAVE.
    • MASTER_TO db2196.codfw.wmnet Ignoring wrong SLAVE STATUS in DRY-RUN mode Using_Gtid=Slave_Pos, expected no instead.
    • Disabled GTID on MASTER_TO db2196.codfw.wmnet
    • MASTER_TO db2196.codfw.wmnet STOP SLAVE.
    • MASTER_TO db2196.codfw.wmnet stopped pt-heartbeat.
    • MASTER_TO db2196.codfw.wmnet Ignoring MASTER STATUS is not stable in DRY-RUN
    • MASTER_FROM db1220.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2196.codfw.wmnet', binlog='db2196-bin.002533', position=394662451, port=3306) and user repl2024
    • MASTER_FROM db1220.eqiad.wmnet START SLAVE
    • MASTER_FROM db1220.eqiad.wmnet skipping replication from MASTER_TO db2196.codfw.wmnet verification
    • MASTER_TO db2196.codfw.wmnet started pt-heartbeat.
    • MASTER_TO db2196.codfw.wmnet START SLAVE.
    • MASTER_TO db2196.codfw.wmnet replication from MASTER_FROM db1220.eqiad.wmnet verified

I haven't checked the code but the order of the steps look good to me to prepare before the switch.

sre.switchdc.databases.finalize
  • x1 (PASS)
    • Validated replication topology for section x1 between MASTER_TO db1220.eqiad.wmnet and MASTER_FROM db2196.codfw.wmnet
    • MASTER_TO db1220.eqiad.wmnet has no replication set, skipping.
    • MASTER_TO db2196.codfw.wmnet heartbeat server IDs to delete are: []
    • MASTER_FROM db2196.codfw.wmnet STOP SLAVE.
    • MASTER_FROM db2196.codfw.wmnet MASTER_USE_GTID=slave_pos.
    • MASTER_FROM db2196.codfw.wmnet START SLAVE.
    • Ignoring failed check for GTID change in DRY-RUN mode.
    • Enabled GTID on MASTER_FROM db2196.codfw.wmnet

I need to check the code in more details to see what this finalize step because there are some things that I don't get.
If this is the cookbook to run once the DC switchover has finished (so a day after it - as the code says) the order doesn't seem correct

If we are switching from eqiad to codfw that means that after the switch codfw is primary and eqiad is secondary.
After the switch, db1120 would have replication enabled, and the only needed thing would be to:

  • db2196, remove circular replication so: stop slave; reset slave all; so no need to enable GTID there. Primary masters do not replicate from anyone
  • db1120 (now a normal slave), enable GTID (it should have replication enabled already, coming from the prepare step)

After testing cookbooks in T372893:
@Marostegui spotted a SQL syntax issue which I fixed, but it broke CI somehow. and I've spotted a micro case sensitivity issue on MasterUseGTID

cookbooks.sre.switchdc.databases for the switch from test-s4 to test-s4 started by arnaudb@cumin1002 completed:

  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • Failed to enable GTID on db2230.codfw.wmnet, current value: Slave_Pos

it appears that

I've spotted slight issue on class MasterUseGTID(Enum):: [...]NO = "no"

applies as well on that value

@Volans: Testing platform is still up and running so we can try and tie the remaining few loose ends, lmk if and how I can help!

sre.switchdc.databases.finalize
  • x1 (PASS)
    • Validated replication topology for section x1 between MASTER_TO db1220.eqiad.wmnet and MASTER_FROM db2196.codfw.wmnet
    • MASTER_TO db1220.eqiad.wmnet has no replication set, skipping.
    • MASTER_TO db2196.codfw.wmnet heartbeat server IDs to delete are: []
    • MASTER_FROM db2196.codfw.wmnet STOP SLAVE.
    • MASTER_FROM db2196.codfw.wmnet MASTER_USE_GTID=slave_pos.
    • MASTER_FROM db2196.codfw.wmnet START SLAVE.
    • Ignoring failed check for GTID change in DRY-RUN mode.
    • Enabled GTID on MASTER_FROM db2196.codfw.wmnet

I need to check the code in more details to see what this finalize step because there are some things that I don't get.
If this is the cookbook to run once the DC switchover has finished (so a day after it - as the code says) the order doesn't seem correct

If we are switching from eqiad to codfw that means that after the switch codfw is primary and eqiad is secondary.
After the switch, db1120 would have replication enabled, and the only needed thing would be to:

  • db2196, remove circular replication so: stop slave; reset slave all; so no need to enable GTID there. Primary masters do not replicate from anyone
  • db1120 (now a normal slave), enable GTID (it should have replication enabled already, coming from the prepare step)

As I mentioned in my previous message, the finalize one is run inverted:

test-cookbook -c 1059052 --dry-run sre.switchdc.databases.finalize -t AAA codfw eqiad This is with DC inverted as otherwise would not pass the initial validation

I executed:

test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t T371351 eqiad codfw

and got:

DRY-RUN: Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet

later, I wanted to test this erroring, so I did:

test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t T371351 codfw eqiad

(note the wrong dc direction)

But it validated correctly:

DRY-RUN: Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet

Shouldn't it fail, because I asked to switchover in the wrong direction, and it should fail and warn rather than silently autocorrect? As otherwise, if for some reason replication was flowing in the wrong direction, it would just add the other link, rather than warn and let the operator either continue, fix or abort.

I was able to make it fail when setting up circual replication, as it did:

DRY-RUN: Failed to run cookbooks.sre.switchdc.databases.prepare.PrepareSection.validate: Expected all replicas of MASTER_TO db2230.codfw.wmnet to be in the same datacenter, got db1125.eqiad.wmnet instead

I think this is ok, although I would alter slightly the error message/add a dedicated check to not touch the topology if it was already circular replication.

This is not a blocker, but noting everything weird I see. I will keep testing the rest.

cookbooks.sre.switchdc.databases for the switch from eqiad to codfw started by jynus@cumin1002

cookbooks.sre.switchdc.databases for the switch from eqiad to codfw started by jynus@cumin1002 completed:

  • test-s4 (PASS)
    • Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet
    • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
    • MASTER_TO db2230.codfw.wmnet MASTER_USE_GTID=no.
    • MASTER_TO db2230.codfw.wmnet START SLAVE.
    • Disabled GTID on MASTER_TO db2230.codfw.wmnet

cookbooks.sre.switchdc.databases for the switch from eqiad to codfw started by jynus@cumin1002

cookbooks.sre.switchdc.databases for the switch from eqiad to codfw started by jynus@cumin1002 completed:

  • test-s4 (PASS)
    • Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet
    • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
    • MASTER_TO db2230.codfw.wmnet MASTER_USE_GTID=no.
    • MASTER_TO db2230.codfw.wmnet START SLAVE.
    • Disabled GTID on MASTER_TO db2230.codfw.wmnet

I am going to create a dedicated task for production testing, to avoid also noise here and on IRC: T374972

The other thing I saw after T371351#10153483 is that on the next step, if I run twice the disabling of GTID, there is no error or warning.

In this case, I think it is ok to not be a fatal error, as it would be ok to disable gtid and even leave it disabled for some time (it is not a huge error), but I would like to see at least a warning that it was disabled in the first place (of if execution aborted before, but not ok if it was disabled for a long time). I think this should be a check to do beforehand (even if not an error).

I executed:

test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t T371351 eqiad codfw

and got:

DRY-RUN: Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet

later, I wanted to test this erroring, so I did:

test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t T371351 codfw eqiad

(note the wrong dc direction)

But it validated correctly:

DRY-RUN: Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet

Shouldn't it fail, because I asked to switchover in the wrong direction, and it should fail and warn rather than silently autocorrect? As otherwise, if for some reason replication was flowing in the wrong direction, it would just add the other link, rather than warn and let the operator either continue, fix or abort.

@jcrespo that's expected due to the patch to make it run for test-s4, we bypassed the input DCs, so dc_from and dc_to are not really used as we have forced master_from and master_to to be the specific ones.

FYI those are the tests performed:

  • MASTER_FROM is RW
  • MASTER_TO is RO
  • the replication master of MASTER_TO is MASTER_FROM
  • all the hosts replicating from MASTER_TO are in the same DC of MASTER_TO

Is there anything else I could add as validation step for the prepare cookbook?

Thank you, I hadn't thought of that, it makes sense.

Is there anything else I could add as validation step for the prepare cookbook?

See T371351#10153616. Although please note, I am using this ticket as my notes step-by-step, and given the time not all may be necessary/big blocker. Will summarize it at the end when I have gone over multiple iterations/failure scenarios.

one additional comment about the process, not necessarily the script, is that the post-maintenance script is confusing, as it will be ran post-maintenance, but the parameters will be in the direction of the maintenance (but replication will be flowing in the previous direction).

While the topology check will take care of that, I wonder if there could be a way to prevent the confusion.

The idea was that this is all part of the A->B datacenter switchover and the finalize steps are still part of that process hence to be called with the same flow A->B. If you prefer otherwise we can invert it and/or improve the help messages.

Another totally alternative approach could be the one to check which one is the primary DC for mediawiki and get the DC names from there automatically without having the user to pass them.

I found an actual bug: this is failing:

Failed to run cookbooks.sre.switchdc.databases.finalize.FinalizeSection.clean_heartbeat: Failed to run 'DELETE FROM heartbeat WHERE server_id=180360463' on db1125.eqiad.wmnet

Either:

  • it returns it is failing but it is not (e.g. it deletes 0 rows) [confirmed it is not the case]
  • it is not running it on the right db (heartbeat) or not using heartbeat.heartbeat as a full qualifier
  • it is using a user without super user permisions [confirmed it is not the case, as read only is 0 here, but could be another persmission issue]

My guess would be #2

Re:

The idea was that this is all part of the A->B datacenter switchover and the finalize steps are still part of that proces

I think I would leave it as is, but I would add somewhere noticeable "TO_DC IS EXPECTED TO BE THE PRIMARY" or something like that.

Change #1059052 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases: new cookbooks

https://gerrit.wikimedia.org/r/1059052

Change #1073750 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] re.switchdc.databases.prepare: reduce wait time

https://gerrit.wikimedia.org/r/1073750

Change #1073750 merged by jenkins-bot:

[operations/cookbooks@master] re.switchdc.databases.prepare: reduce wait time

https://gerrit.wikimedia.org/r/1073750

Change #1073757 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases: fix Phabricator message

https://gerrit.wikimedia.org/r/1073757

Change #1073757 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases: fix Phabricator message

https://gerrit.wikimedia.org/r/1073757

Change #1074127 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases.prepare: add check

https://gerrit.wikimedia.org/r/1074127

Change #1074128 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.switchdc.databases: update Phabricator more

https://gerrit.wikimedia.org/r/1074128

There is some minor usability issue (but could be confusing under pressure), I get this text:

==> Run on section test-s4 was manually aborted. Continue with the remaining sections or abort completely?

However, if it is the last or the only section, it doesn't make much sense, as it would do the same, basically. Maybe just changing the wording if there are no more sections left even if you want to keep the pause?

What do you think?

I cannot at the moment reproduce the bad scenario, because this is the binlog from, eg. x1:

# at 1199094
#241017  9:53:45 server id 180360966  end_log_pos 1199136 CRC32 0x45c248cc      GTID 180360966-180360966-115215619 trans
/*!100001 SET @@session.gtid_seq_no=115215619*//*!*/;
START TRANSACTION
/*!*/;
# at 1199136
# at 1199400
#241017  9:53:45 server id 180360966  end_log_pos 1199400 CRC32 0xd37d7cf7      Annotate_rows:
#Q> REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2024-10-17T09:53:45.001510', '180360966', 'db2196-bin.003776', '1199094', NULL, NULL, 'x1', 'codfw')
#241017  9:53:45 server id 180360966  end_log_pos 1199474 CRC32 0xe050a53c      Table_map: `heartbeat`.`heartbeat` mapped to number 23
# at 1199474
#241017  9:53:45 server id 180360966  end_log_pos 1199642 CRC32 0x14ae218b      Update_rows: table id 23 flags: STMT_END_F
### UPDATE `heartbeat`.`heartbeat`
### WHERE
###   @1='2024-10-17T09:53:44.001370' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */
###   @2=180360966 /* INT meta=0 nullable=0 is_null=0 */
###   @3='db2196-bin.003775' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
###   @4=1048427332 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */
###   @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */
###   @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */
###   @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### SET
###   @1='2024-10-17T09:53:45.001510' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */
###   @2=180360966 /* INT meta=0 nullable=0 is_null=0 */
###   @3='db2196-bin.003776' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
###   @4=1199094 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */
###   @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */
###   @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */
###   @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */
# Number of rows: 1
# at 1199642
#241017  9:53:45 server id 180360966  end_log_pos 1199673 CRC32 0xc6f0d777      Xid = 3891994318
COMMIT/*!*/;
# at 1199673

And this is the binlog from test-s4, even after changing the format, doing flush tables and restarting pt-heartbeat-wikimedia:

# at 417291
#241017 10:35:31 server id 171978825  end_log_pos 417333 CRC32 0x3f3b8917       GTID 171978825-171978825-13743836 trans
/*M!100001 SET @@session.gtid_seq_no=13743836*//*!*/;
START TRANSACTION
/*!*/;
# at 417333
# at 417412
#241017 10:35:31 server id 171978825  end_log_pos 417412 CRC32 0x025cddf4       Annotate_rows:
#Q> INSERT INTO test (s) VALUES ('ca977ebccb5f6ea211903ebf')
#241017 10:35:31 server id 171978825  end_log_pos 417464 CRC32 0x32152a67       Table_map: `test`.`test` mapped to number 428
# at 417464
#241017 10:35:31 server id 171978825  end_log_pos 417532 CRC32 0x92c81f62       Write_rows: table id 428 flags: STMT_END_F
### INSERT INTO `test`.`test`
### SET
###   @1=877001 /* INT meta=0 nullable=0 is_null=0 */
###   @2='ca977ebccb5f6ea211903ebf' /* VARSTRING(1000) meta=1000 nullable=1 is_null=0 */
###   @3=1729161331 /* TIMESTAMP(0) meta=0 nullable=0 is_null=0 */
# Number of rows: 1
# at 417532
#241017 10:35:31 server id 171978825  end_log_pos 417563 CRC32 0xa5142826       Xid = 22234452
COMMIT/*!*/;
# at 417563
#241017 10:35:32 server id 171978825  end_log_pos 417605 CRC32 0x2cb0630f       GTID 171978825-171978825-13743837 trans
/*M!100001 SET @@session.gtid_seq_no=13743837*//*!*/;
START TRANSACTION
/*!*/;
# at 417605
#241017 10:35:32 server id 171978825  end_log_pos 417922 CRC32 0x93269457       Query   thread_id=1737128       exec_time=0     error_code=0    xid=0
SET TIMESTAMP=1729161332/*!*/;
REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2024-10-17T10:35:32.000810', '171978825', 'db1125-bin.000026', '417563', NULL, NULL, 'test-s4', 'eqiad')
/*!*/;
# at 417922
#241017 10:35:32 server id 171978825  end_log_pos 417953 CRC32 0xfaf53238       Xid = 22234456
COMMIT/*!*/;
# at 417953

You can see that we are in ROW format (because of the test traffic I created), but heartbeat is using STATEMENT, not ROW. Which is good, but not what I need for testing the issue. I will do a restart of the server, but this may be the source of the issue of T375144 (?)

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

There is some minor usability issue (but could be confusing under pressure), I get this text:

==> Run on section test-s4 was manually aborted. Continue with the remaining sections or abort completely?

However, if it is the last or the only section, it doesn't make much sense, as it would do the same, basically. Maybe just changing the wording if there are no more sections left even if you want to keep the pause?

What do you think?

I've updated the last CR of the chain with this, I've also added a "progress" indicator in a log line and the ask confirmation.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

We should keep the check part. What I am not sure is if to keep the logic but checking the configuration instead or it is too much. I accept suggestions if it is just best to fix the root issue and simplify the script, as we cannot preview future issues.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

ACtually, I responded myself while testing the script. My original proposed solution doesn't work because it solves it for the from server, but moves it to the secondary- through replication . The only fix when this issue happens is to do it without sql_log_bin on all hosts of one side- which I did manually during the incident but it is not a good solution unattended, and may not be able to be done reliably. We need to drop the insertion and monitor heartbeat is using STATEMENT always.

Change #1074127 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases.prepare: add check

https://gerrit.wikimedia.org/r/1074127

Change #1074128 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases: update Phabricator more

https://gerrit.wikimedia.org/r/1074128

As soon as the other patches are merged, this is done for me IMHO. Core section ordering can be discussed afterwards.

All pending patches have been tested and merged. Resolving.

@Volans can we document this steps in the DC switchover page on wikitech? Thanks!

@Marostegui I've added some notes in those two pages and removed one paragraph that I think was obsolete due to active/active mediawiki. I think though that some of the steps listed there might be outdatad.

Let me know if you want to go in more detail there or the help message of the cookbook is enough (and also self-updated in the future).

Diffs:
https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&diff=2242496&oldid=2233628
https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2FSwitch_Datacenter&diff=2242499&oldid=1932299

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad was aborted for section test-s4:
test-s4 (FAIL)

  • Expected all replicas of MASTER_TO db1125.eqiad.wmnet to be in the same datacenter, got db2230.codfw.wmnet instead
  • Execution for this section was manually aborted

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (FAIL)

  • MASTER_FROM db2230.codfw.wmnet should be read only
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
  • MASTER_TO db1125.eqiad.wmnet has no replication set.
  • MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db2230.codfw.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db2230.codfw.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db2230.codfw.wmnet and MASTER_FROM db1125.eqiad.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet RESET SLAVE ALL.
  • MASTER_TO db2230.codfw.wmnet has no replication set.
  • MASTER_FROM db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_FROM db1125.eqiad.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db1125.eqiad.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.