[go: up one dir, main page]

Page MenuHomePhabricator

Switchover s7 master db1136 -> db1181
Closed, ResolvedPublic

Description

When: Tuesday 28th June at 06:00 AM UTC

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s7.dblist

Checklist:

NEW primary: db1181
OLD primary: db1136

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1136.eqiad.wmnet h=db1181.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s7 T311033" 'A:db-section-s7'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1181 set-weight 0
sudo dbctl config commit -m "Set db1181 with weight 0 T311033"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1136 db1181
  • Disable puppet on both nodes
sudo cumin 'db1136* or db1181*' 'disable-puppet "primary switchover T311033"'

Failover:

  • Log the failover:
!log Starting s7 eqiad failover from db1136 to db1181 - T311033
  • Set section read-only:
sudo dbctl --scope eqiad section s7 ro "Maintenance until 06:15 UTC - T311033"
sudo dbctl config commit -m "Set s7 eqiad as read-only for maintenance - T311033"
  • Check s7 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1136 db1181
echo "===== db1136 (OLD)"; sudo db-mysql db1136 -e 'show slave status\G'
echo "===== db1181 (NEW)"; sudo db-mysql db1181 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s7 set-master db1181
sudo dbctl --scope eqiad section s7 rw
sudo dbctl config commit -m "Promote db1181 to s7 primary and set section read-write T311033"
  • Restart puppet on both hosts:
sudo cumin 'db1136* or db1181*' 'run-puppet-agent -e "primary switchover T311033"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1181 heartbeat -e "delete from heartbeat where file like 'db1136%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1181
events_coredb_slave.sql on the new slave db1136
sudo dbctl instance db1136 set-candidate-master --section s7 true
sudo dbctl instance db1181 set-candidate-master --section s7 false
(dborch1001): sudo orchestrator-client -c untag -i db1181 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1136 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's7';"
  • (If needed): Depool db1136 for maintenance.
sudo dbctl instance db1136 depool
sudo dbctl config commit -m "Depool db1136 T311033"
  • Apply outstanding schema changes to db1136 (if any) T311033#8031845
  • Update/resolve this ticket.

Event Timeline

Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Ready to In progress on the DBA board.

db1181 (candidate master) needs a reboot for {T310485}

db1181 (candidate master) needs a reboot for {T310485}

Done

Change 808801 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1181 to s7 master

https://gerrit.wikimedia.org/r/808801

Change 808802 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update s7-master CNAME

https://gerrit.wikimedia.org/r/808802

Change 808801 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1181 to s7 master

https://gerrit.wikimedia.org/r/808801

Change 809049 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1136: Disable notifications

https://gerrit.wikimedia.org/r/809049

Mentioned in SAL (#wikimedia-operations) [2022-06-28T06:00:25Z] <marostegui> Starting s7 eqiad failover from db1136 to db1181 - T311033

Change 808802 merged by Marostegui:

[operations/dns@master] wmnet: Update s7-master CNAME

https://gerrit.wikimedia.org/r/808802

This was done.
Read only start: 06:00:51
Read only stop: 06:01:36

Total read only time: 45 seconds

All done, the pending schema changes will be done on their own tasks.

Change 809049 merged by Marostegui:

[operations/puppet@production] db1136: Disable notifications

https://gerrit.wikimedia.org/r/809049

Marostegui updated the task description. (Show Details)