[go: up one dir, main page]

Page MenuHomePhabricator

Volans (Riccardo Coccioli)
SRE

Projects (13)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Feb 10 2016, 11:25 AM (457 w, 3 d)
Availability
Available
IRC Nick
volans
LDAP User
Volans
MediaWiki User
RCoccioli (WMF) [ Global Accounts ]

Recent Activity

Thu, Nov 14

Volans added a comment to T378854: an-presto1018.eqiad.wmnet: DRAC is down.

Did you go through https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands ?

Thu, Nov 14, 5:50 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), SRE, ops-eqiad, DC-Ops
Volans added a comment to T379887: Brief hardware history on server metadata.

Could a Phabricator search with the hostname for open+closed tasks cover the needs?

Thu, Nov 14, 9:08 AM · Data-Persistence-SRE

Mon, Nov 11

Volans added a comment to T310009: Make it easier to create a new requestctl object.

With the new requestctl web UI I think it would be very useful if the current requestctl generator ( https://superset.wikimedia.org/requestctl-generator?q= ) would be adapted to work with the new web UI or (even better) ditched and get the same functionality embedded into requestctl web UI directly just providing the URL of a superset dashboard with filters.

Mon, Nov 11, 1:36 PM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool
Volans added a comment to T377699: [EPIC] FY 24/25 WE 4.3.7 Roll out a user-friendly web application that enables assisted editing and creation of requestctl rules.

For 3. an EASY way to link the superset dashboard from a given requestctl rule is to generate links of this type:

Mon, Nov 11, 1:26 PM · User-Joe, Epic, conftool, Traffic

Thu, Nov 7

Volans claimed T379258: Add an ownership field to cookbooks..
Thu, Nov 7, 4:50 PM · SRE-tools, Infrastructure-Foundations
Volans claimed T379259: Outdated cookbooks cleanup.
Thu, Nov 7, 4:50 PM · Infrastructure-Foundations, SRE-tools
Volans added a comment to T371351: Automate the pre/post switchover tasks related to databases.

@Marostegui I've added some notes in those two pages and removed one paragraph that I think was obsolete due to active/active mediawiki. I think though that some of the steps listed there might be outdatad.

Thu, Nov 7, 4:36 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

I think we could put the code directly into the move-vlan cookbook, if the host is present in dbctl, update it. I don't see too much use for an update-ip specific cookbook just for databases, but let me know if you see other use cases for it.

Thu, Nov 7, 4:10 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations

Wed, Nov 6

Volans added a comment to T379072: sre.netbox.update-extras hits KeyError with logging.

In our netbox config we have for the logging formatters:

'django.server': {
        '()': 'django.utils.log.ServerFormatter',
        'format': '[%(server_time)s] %(message)s',
}
Wed, Nov 6, 9:22 AM · Infrastructure-Foundations, SRE
Volans added a comment to T336485: Setup zero touch provisioning (ZTP) for network devices.

I don't see that forced in /etc/ssh/ssh_config though. Also when I tried sudo ssh fasw2-c1b-eqiad.mgmt.eqiad.wmnet manually I was prompted to accept the host key. So I did that for both of them but even afterwards the cookbook reported the same thing. So we somehow need to tell cumin for these network devices to accept the host key when it connects.

Wed, Nov 6, 8:12 AM · Patch-For-Review, SRE, Infrastructure-Foundations, netops, SRE-tools

Mon, Nov 4

Volans triaged T378835: Test 1G NIC compatibility, default to TFTP in sre.hosts.reimage cookbook as Medium priority.

@bking we had a brief chat in the I/F meeting today about this. We think that this would mostly be a step backward instead of forward.

Mon, Nov 4, 4:17 PM · DC-Ops, Infrastructure-Foundations

Wed, Oct 30

Volans updated subscribers of T378667: IRC recent changes provider fails in Huggle after recent irc.wikimedia.org upgrade.
Wed, Oct 30, 11:12 PM · SRE, Infrastructure-Foundations, Huggle
Volans added a comment to T378572: Improve sre.mysql.depool cookbook.

I've tried on es7 eqiad master, that is pooled with a weight of 10 and get's only 4.7% of requests.

Wed, Oct 30, 12:36 PM · Patch-For-Review, Data-Persistence-SRE
Volans added a comment to T378572: Improve sre.mysql.depool cookbook.

I've some comments/questions on the above:

Wed, Oct 30, 9:00 AM · Patch-For-Review, Data-Persistence-SRE
ABran-WMF awarded T378572: Improve sre.mysql.depool cookbook a Burninate token.
Wed, Oct 30, 8:57 AM · Patch-For-Review, Data-Persistence-SRE
Volans created T378572: Improve sre.mysql.depool cookbook.
Wed, Oct 30, 8:44 AM · Patch-For-Review, Data-Persistence-SRE

Mon, Oct 28

Volans triaged T378039: exception raised for "sre.dns.admin show" as Medium priority.
Mon, Oct 28, 2:21 PM · Infrastructure-Foundations, SRE-tools, SRE
Volans added a comment to T378346: Create cookbook to set up ganeti host network.

Should it be part of the sre.ganeti.addnode cookbook or done at reimage time?

Mon, Oct 28, 11:48 AM · netops, Infrastructure-Foundations, SRE
Devrepo awarded T378331: Puppet module hiera_lookup not working a Blobhaj token.
Mon, Oct 28, 9:23 AM · Infrastructure-Foundations, SRE-tools, Spicerack
Volans triaged T378331: Puppet module hiera_lookup not working as High priority.
Mon, Oct 28, 9:17 AM · Infrastructure-Foundations, SRE-tools, Spicerack

Wed, Oct 23

Volans closed T377738: Create a dbctl depool/pool cookbook as Resolved.

The sre.mysql.pool and sre.mysql.depool cookbooks have been merged and are ready to be used. Future expansions will be handled separately from this task.

Wed, Oct 23, 10:24 AM · Data-Persistence-SRE

Mon, Oct 21

Volans triaged T377738: Create a dbctl depool/pool cookbook as Medium priority.
Mon, Oct 21, 2:47 PM · Data-Persistence-SRE
Volans closed T295774: WMCS VIPs: Netbox netmask inconsistencies as Resolved.

Tentatively resolving, I can't repro it anymore. If you encounter the same issue please re-open it.

Mon, Oct 21, 2:47 PM · Patch-For-Review, SRE, Infrastructure-Foundations, SRE-tools
Volans placed T336485: Setup zero touch provisioning (ZTP) for network devices up for grabs.
Mon, Oct 21, 2:44 PM · Patch-For-Review, SRE, Infrastructure-Foundations, netops, SRE-tools
Volans closed T295774: WMCS VIPs: Netbox netmask inconsistencies, a subtask of T295762: Netbox - PuppetDB audit 2021-11, as Resolved.
Mon, Oct 21, 2:43 PM · SRE, Infrastructure-Foundations, SRE-tools, netops
Volans placed T328593: redfish: minimum version support up for grabs.

De-assigning from me as I've not worked on it or plan to do so soon.

Mon, Oct 21, 2:42 PM · Infrastructure-Foundations, SRE-tools
Volans added a comment to T360029: Integrate dbctl IP changes as part of VLAN changes. .

Now that we have dbctl support in Spicerack it should be doable to add the step to modify the IP in dbctl when needed.

Mon, Oct 21, 2:40 PM · conftool, Data-Persistence, SRE, Infrastructure-Foundations

Fri, Oct 18

Volans closed T371351: Automate the pre/post switchover tasks related to databases as Resolved.

All pending patches have been tested and merged. Resolving.

Fri, Oct 18, 10:20 AM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans closed T371351: Automate the pre/post switchover tasks related to databases, a subtask of T370962: Southward Datacenter Switchover (September 2024), as Resolved.
Fri, Oct 18, 10:20 AM · Patch-For-Review, Datacenter-Switchover, serviceops
Volans added a comment to T377534: Prepare/deploy new IPs for codfw cp nodes.

This seems a perfect opportunity to re-evaluate why we're hardcoding those values in Puppet's hiera and evaluate if that can be avoided.
Some possible options could be to gather them via a PuppetDB query, from DNS or exporting them from Netbox.

Fri, Oct 18, 7:00 AM · Traffic, netops, Infrastructure-Foundations

Oct 17 2024

Volans added a comment to T371351: Automate the pre/post switchover tasks related to databases.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Oct 17 2024, 2:21 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans added a comment to T371351: Automate the pre/post switchover tasks related to databases.

There is some minor usability issue (but could be confusing under pressure), I get this text:

==> Run on section test-s4 was manually aborted. Continue with the remaining sections or abort completely?

However, if it is the last or the only section, it doesn't make much sense, as it would do the same, basically. Maybe just changing the wording if there are no more sections left even if you want to keep the pause?

What do you think?

Oct 17 2024, 2:20 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover

Oct 16 2024

Volans added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

The cause of that dry-run failure was the added check of replication working of MASTER_FROM from MASTER_TO added here

Oct 16 2024, 5:07 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 15 2024

Volans added a comment to T348730: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye.

A forced reboot via mgmt seems to have put back in a working state for now.

Oct 15 2024, 3:33 PM · Infrastructure-Foundations, SRE

Oct 11 2024

Volans added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

@jcrespo I had it almost finished yesterday but then I had to step out, I've sent the patches. If you test with the test-cookbook using as CR the last one (1079537) you'll be also testing all the other pending improvements that were done but not yet merged.
The last one allows to test it also on a custom section, so you can pass --section test-s4 and it should do the right thing.

Oct 11 2024, 3:00 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 10 2024

Volans added a comment to T375144: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication.

Thanks @jcrespo for the detailed request. I'll get to it. Only one question, are you sure we want to use REPLACE and not INSERT? I thought that replace contributed to the issue.

Oct 10 2024, 2:00 PM · SRE-OnFire, Sustainability, Data-Persistence-SRE, DBA

Oct 7 2024

Volans closed T362893: Spicerack support for dbctl as Resolved.

This has been released and tested. Resolving.

Oct 7 2024, 1:40 PM · DBA, Infrastructure-Foundations, conftool, Spicerack, SRE-tools
Volans added a comment to T375014: Support listing pooled / active authdns hosts (rather than all).

@ssingh what do you think of the above draft patch proposal? If that works for you I'll complete it and get it included into spicerack.

Oct 7 2024, 9:21 AM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
Volans added a comment to T376596: spicerack mysql_legacy: support fetch metrics for instance.

Spicerack has support for prometheus, why not getting the metrics directly from there in the cookbooks?

Oct 7 2024, 9:04 AM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack, Data-Persistence-SRE, DBA

Oct 2 2024

Volans added a comment to T376291: Authdns: automate reverse DNS zone delegation for k8s pod IP ranges.

Is there plan to try to get away from the very long hardcoded lists in hiera?
How often do you expect the data to change? This might affect how we want to trigger the changes.

Oct 2 2024, 3:42 PM · Patch-For-Review, Traffic, Infrastructure-Foundations, SRE

Sep 30 2024

Volans added a comment to T362824: Q#:rack/setup/install dbproxy200[5-8].

The above was me aborting the leftover execution of the cookbook that have been left in waiting for user input.

Sep 30 2024, 3:12 PM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops

Sep 25 2024

Volans added a comment to T375590: SREBatchBase: Print allowed aliases in help message.

With the current API that's not possible because allowed_aliases is an instance property (not a class property) of the runner class, not the cookbook class.
This means that we need an instance of the runner to get the list of allowed aliases and to get that instance we need an instance of spicerack that is passed only to the get_runner method and not the argument_parser one.

Sep 25 2024, 9:20 AM · Infrastructure-Foundations, SRE-tools

Sep 24 2024

Volans added a comment to T375285: sre.discovery.datacenter should handle depooled authdns hosts.

If this is not super urgent, do you think it could wait an "upstream" solution in spicerack as discussed in T375014?

Sep 24 2024, 8:33 AM · Patch-For-Review, Datacenter-Switchover, serviceops

Sep 23 2024

Volans added a comment to T375382: Post pc1013 crash.

AFAIK pc1015 should be the candidate host if we want to fail it over, from dbctl:

"note": "Hot spare for pc4 and cold spare for pc3",
Sep 23 2024, 2:06 PM · Wikimedia-production-error, Sustainability (Incident Followup), SRE, DBA
Volans added a comment to T375014: Support listing pooled / active authdns hosts (rather than all).

Thanks for the summary @ssingh. I have a local proposal that will send out when ready. There is one main point to decide and is the "caching" time of this information:

Sep 23 2024, 1:25 PM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack

Sep 18 2024

Volans triaged T375014: Support listing pooled / active authdns hosts (rather than all) as Medium priority.

Thanks for the task. I think the main decision to make is how fresh the data needs to be. If we opt for refreshed every time it's called then we need to think well how to inject it into spicerack as it would not be possible to just have it as a helper in the Spicerack class.

Sep 18 2024, 9:37 AM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack

Sep 17 2024

Volans added a comment to T371351: Automate the pre/post switchover tasks related to databases.

The idea was that this is all part of the A->B datacenter switchover and the finalize steps are still part of that process hence to be called with the same flow A->B. If you prefer otherwise we can invert it and/or improve the help messages.

Sep 17 2024, 4:32 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans added a comment to T371351: Automate the pre/post switchover tasks related to databases.

I executed:

test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t T371351 eqiad codfw

and got:

DRY-RUN: Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet

later, I wanted to test this erroring, so I did:

test-cookbook -c 1059052 --dry-run sre.switchdc.databases.prepare -t T371351 codfw eqiad

(note the wrong dc direction)

But it validated correctly:

DRY-RUN: Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet

Shouldn't it fail, because I asked to switchover in the wrong direction, and it should fail and warn rather than silently autocorrect? As otherwise, if for some reason replication was flowing in the wrong direction, it would just add the other link, rather than warn and let the operator either continue, fix or abort.

Sep 17 2024, 3:03 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans added a comment to T271139: Some WMCS clusters apparently do not support IPv6.

This is the updated list as of today: clouddb2002-dev,cloudlb2004-dev,clouddb[1013-1020].
I guess that the clouddb are expected and they all don't have the AAAA records. That leaves only cloudlb2004-dev as the outlier.

Sep 17 2024, 9:01 AM · cloud-services-team, Infrastructure-Foundations, IPv6, User-crusnov, SRE-tools

Sep 16 2024

Volans added a comment to T283204: Clarify 'wipe bootloader' step in sre.hosts.decommission.

As there were no agreement here on task and multiple years have passed we decided to close it. Feel free to reopen in case there is more consensus.

Sep 16 2024, 2:36 PM · Infrastructure-Foundations, SRE-tools
Volans closed T293209: Spicerack: add support for Alertmanager as Resolved.

The alertmanager support has been in place for a long time. Resolving. Any additional feature will be developed separately.

Sep 16 2024, 2:35 PM · User-fgiunchedi, Observability-Alerting, Infrastructure-Foundations, SRE-tools

Sep 10 2024

Volans added a comment to T374443: Move puppet-merge (bash script) to puppetserver1001.

In the optic of a cookbook to replace puppet merge I'd try to use https://doc.wikimedia.org/spicerack/master/api/spicerack.reposync.html#spicerack.reposync.RepoSync

Sep 10 2024, 1:18 PM · User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
Volans added a comment to T374351: Race condition on puppetdb in sre.hosts.rename cookbook.

I don't think it does anymore unfortunately...

Sep 10 2024, 10:24 AM · SRE-tools, Infrastructure-Foundations, serviceops-radar
Volans added a comment to T374351: Race condition on puppetdb in sre.hosts.rename cookbook.

While the above is totally true the probability that a rename+reimage happens exactly at the time of the timer that runs once a day is fairly low.

Sep 10 2024, 8:04 AM · SRE-tools, Infrastructure-Foundations, serviceops-radar

Sep 5 2024

Volans added a comment to T372666: Exclude legacy facts by default.

I need to recollect my old memories and check local branches, the hardest part IIRC are not the code changes but the grammar changes to support it. Do you have a timeline in mind? So to be able to plan accordingly.

Sep 5 2024, 10:01 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
Volans added a comment to T374073: Unified pattern for RemoteHosts accessors in Spicerack.

Thanks for the task, we'll evaluate the various options and come up with a final proposal.

Sep 5 2024, 9:39 AM · User-Elukey, Spicerack, Infrastructure-Foundations, SRE-tools

Sep 4 2024

Volans added a comment to T371889: Upgrade Netbox to 4.1.

Netbox 4.1 is out, published yesterday.

Sep 4 2024, 4:25 PM · netbox, Infrastructure-Foundations
Volans added a comment to T372666: Exclude legacy facts by default.

FYI Cumin's puppet backend too will need to be refactored to support structured facts.

Sep 4 2024, 10:07 AM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
Volans added a comment to T368257: generate_vrts_aliases failing on mx-in1001.

The patch was needed, the last error was at Sep 03 17:08:32. After that it run smoothly except for one run at Sep 03 20:10:57 that failed with ERROR:/usr/local/bin/vrts_aliases:Connection unexpectedly closed.

Sep 4 2024, 8:25 AM · Infrastructure-Foundations, Mail, vrts, collaboration-services

Sep 3 2024

Volans added a comment to T371899: Review how the debmonitor server processes hosts/images when starting fresh.

We've 3.2 in prod right now ( https://docs.djangoproject.com/en/3.2/ref/models/querysets/#bulk-update ) but yes this is an option, although save() will not be called and so we need to verify if that will skip some steps.

Sep 3 2024, 3:11 PM · User-Elukey, Infrastructure-Foundations
Volans added a comment to T368257: generate_vrts_aliases failing on mx-in1001.

@LSobanski the current failures are because there are 2 email addresses in the config that are managed by gsuite (I've redacted some part):

Sep 3 2024, 10:01 AM · Infrastructure-Foundations, Mail, vrts, collaboration-services

Sep 2 2024

Volans added a comment to T271136: Some Foundation clusters do not appear to support IPv6.

Updated list of ganeti hosts without AAAA records (all the others have them): ganeti[1009-1022,2009-2024]

Sep 2 2024, 3:40 PM · Infrastructure-Foundations, SRE, IPv6, SRE-tools, User-jbond
Volans added a comment to T371351: Automate the pre/post switchover tasks related to databases.
sre.switchdc.databases.finalize
  • x1 (PASS)
    • Validated replication topology for section x1 between MASTER_TO db1220.eqiad.wmnet and MASTER_FROM db2196.codfw.wmnet
    • MASTER_TO db1220.eqiad.wmnet has no replication set, skipping.
    • MASTER_TO db2196.codfw.wmnet heartbeat server IDs to delete are: []
    • MASTER_FROM db2196.codfw.wmnet STOP SLAVE.
    • MASTER_FROM db2196.codfw.wmnet MASTER_USE_GTID=slave_pos.
    • MASTER_FROM db2196.codfw.wmnet START SLAVE.
    • Ignoring failed check for GTID change in DRY-RUN mode.
    • Enabled GTID on MASTER_FROM db2196.codfw.wmnet

I need to check the code in more details to see what this finalize step because there are some things that I don't get.
If this is the cookbook to run once the DC switchover has finished (so a day after it - as the code says) the order doesn't seem correct

If we are switching from eqiad to codfw that means that after the switch codfw is primary and eqiad is secondary.
After the switch, db1120 would have replication enabled, and the only needed thing would be to:

  • db2196, remove circular replication so: stop slave; reset slave all; so no need to enable GTID there. Primary masters do not replicate from anyone
  • db1120 (now a normal slave), enable GTID (it should have replication enabled already, coming from the prepare step)
Sep 2 2024, 1:09 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover

Aug 1 2024

Volans updated subscribers of T371351: Automate the pre/post switchover tasks related to databases.

Final update before vacations :)

Aug 1 2024, 6:42 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans added a comment to T371351: Automate the pre/post switchover tasks related to databases.

Status update:
Some addition to spicerack were made in this patch. I would actually be tempted to rename Instance.run_vertical_query() to Instance.run_query() and replace it as it's the only way to sensibly parse some output (with the caveats explained in the docstring).
With the puppet patch we save the replication credentials to the cumin hosts in a way accessibile by the cookbooks it will be possible to start testing (with test-cookbook passing the DRY-RUN mode option) the two new cookbooks introduced as good draft but still WIP here for the pre and post steps required for the switch datacenter.

Aug 1 2024, 4:00 PM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans added a comment to T362893: Spicerack support for dbctl.

Status update:
The conftool improvements (here and here) enabled to support dry-run in spicerack for conftool, that gets added here (to be merged) and finally the possibility of dbctl wrapper in spicerack here (to be merged).

Aug 1 2024, 12:48 PM · DBA, Infrastructure-Foundations, conftool, Spicerack, SRE-tools
Volans merged task T371559: Degraded RAID on cp7015 into Restricted Task.
Aug 1 2024, 5:27 AM · SRE, ops-magru

Jul 31 2024

Volans added a comment to T369654: Q1:rack/setup/install db22[21-40].

@Papaul We've debugged this live during the tooling and automation office hours and we think it's a race condition due do the fact that late_command.sh stil defaults to puppet 5 if unable to read the version. Luca is changing that.

Jul 31 2024, 4:35 PM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
Volans added a comment to T369654: Q1:rack/setup/install db22[21-40].

@Papaul what's the timeline for deciding to not reimage anymore buster host so we can just have puppet 7 and solve all problems?

Jul 31 2024, 2:16 PM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops

Jul 30 2024

Volans updated the task description for T371132: Provision cookbook not setting serial console and other settings.
Jul 30 2024, 2:29 PM · Data-Persistence, User-Elukey, DC-Ops, Infrastructure-Foundations
Volans updated the task description for T371132: Provision cookbook not setting serial console and other settings.
Jul 30 2024, 2:28 PM · Data-Persistence, User-Elukey, DC-Ops, Infrastructure-Foundations
Volans updated the task description for T371132: Provision cookbook not setting serial console and other settings.
Jul 30 2024, 2:23 PM · Data-Persistence, User-Elukey, DC-Ops, Infrastructure-Foundations
Volans added a comment to T369654: Q1:rack/setup/install db22[21-40].

I've just cleaned the host from both puppetmaster and puppetserver to be sure it was not there and run:

sudo cookbook sre.hosts.reimage -t T369654 --os bookworm -p 7 --force-dhcp-tftp --new db2227
Jul 30 2024, 10:32 AM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
Volans created T371351: Automate the pre/post switchover tasks related to databases.
Jul 30 2024, 8:46 AM · Data-Persistence-SRE, DBA, Datacenter-Switchover
Volans added a comment to T369654: Q1:rack/setup/install db22[21-40].

The reimage cookbook did set it to puppet 7 each time:

$ grep "node=db2227.codfw.wmnet, command='printf" reimage-extended.log
2024-07-29 12:27:45,345 jhancock 2656784 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version'
2024-07-29 17:59:29,599 pt1979 3022287 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version'
2024-07-29 19:35:24,239 pt1979 3092430 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version'
Jul 30 2024, 8:43 AM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
Volans added a comment to T369654: Q1:rack/setup/install db22[21-40].

Has it been cleared from puppet5 between each reimage? After the first reimage if the host is in puppetdb the reimage cookbook will use the current puppet version of the host. That could explain it. Let me know if you want me to do something with it.

Jul 30 2024, 7:53 AM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
Volans added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

That's the same it had during the last reboot+reimage cycle, see T370304#10016765 . That's supposedly the NIC, that surely needs investigating, but at first sight doesn't look like the culprit to me (but it could be ofc).

Jul 30 2024, 7:41 AM · User-notice-archive, MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, Wikimedia-Incident, DBA, Wikimedia-production-error

Jul 29 2024

Volans added a comment to T371132: Provision cookbook not setting serial console and other settings.

Fix is in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1057927

Jul 29 2024, 10:46 PM · Data-Persistence, User-Elukey, DC-Ops, Infrastructure-Foundations
Volans renamed T371132: Provision cookbook not setting serial console and other settings from Provision cookbook not setting serial console on 450 and 650xs model to Provision cookbook not setting serial console and other settings.
Jul 29 2024, 10:46 PM · Data-Persistence, User-Elukey, DC-Ops, Infrastructure-Foundations
Volans added a comment to T371132: Provision cookbook not setting serial console and other settings.

I've noticed that from the logs we're not setting any additional values since May 31st, from SAL I saw that we didn't had reimages between the 31st and when we merged the above patch.

Jul 29 2024, 4:32 PM · Data-Persistence, User-Elukey, DC-Ops, Infrastructure-Foundations
Volans added a comment to T371132: Provision cookbook not setting serial console and other settings.

@Papaul can you give us some provision cookbook run that didn't set it so we could check the logs please? Hostname and date/time if they were run multiple times.

Jul 29 2024, 3:50 PM · Data-Persistence, User-Elukey, DC-Ops, Infrastructure-Foundations
Volans placed T357756: Cookbook sre.hardware.upgrade-firmware fails to get firmwares from Dell's website up for grabs.
Jul 29 2024, 2:39 PM · User-Elukey, Infrastructure-Foundations, DC-Ops, SRE-tools
Volans added a project to T371252: conftool and pyparsing requirements: conftool.
Jul 29 2024, 1:51 PM · Patch-For-Review, conftool, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
Volans added a comment to T371216: Route FR to esams.

FYI 86% of maxmind's french prefixes have the subdivisions information. Then how much they are accurate both in terms of geolocation and as a consequence routing for this specific case is all to be seen.

Jul 29 2024, 1:24 PM · Traffic
Volans closed T370029: cumin2002 db-switchover debug as Resolved.

Yep, I thought it was already :)

Jul 29 2024, 12:10 PM · DBA
Volans updated the task description for T367949: Spin down api_appserver and appserver clusters.
Jul 29 2024, 10:47 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Volans added a comment to T371079: Change icinga link to alerts.w.o in netbox device page.

My 2 cents would be to keep both for now as in icinga you can also see what are the icinga checks configured for that device.

Jul 29 2024, 10:17 AM · netbox, Infrastructure-Foundations
Volans added a comment to T371216: Route FR to esams.

If deemed useful those are the subdivisions in maxmind db for France (nda-only) and the related docs for gdnsd are here (assuming up to date ;)):
https://github.com/gdnsd/gdnsd/wiki/GdnsdPluginGeoip#geoip2-location-data-hierarchy

Jul 29 2024, 9:02 AM · Traffic

Jul 26 2024

Volans added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

During the last occurrence I gathered a full processlist in cumin1002:/home/volans/2024-07-26_01.18_s4_master.processlist for those interested.

Jul 26 2024, 12:06 AM · User-notice-archive, MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, Wikimedia-Incident, DBA, Wikimedia-production-error
Volans added a comment to T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis.

Some general timeline from the host PoV.

Jul 26 2024, 12:06 AM · User-notice-archive, MediaWiki-Platform-Team (Radar), Vuln-DoS, SecTeam-Processed, Security, Essential-Work, Content-Transform-Team-WIP, Wikimedia-Incident, DBA, Wikimedia-production-error

Jul 25 2024

Volans added a comment to T371049: prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat.

As a side note there were 2 other alerts that were failing and for those the solution was just to add the missing:

GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`
Jul 25 2024, 6:41 PM · DBA
Volans renamed T371049: prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat from prometheus-mysqld-exporter doesn't take fully support multi-instances for pt-heartbeat to prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat.
Jul 25 2024, 6:36 PM · DBA
Volans created T371049: prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat.
Jul 25 2024, 6:35 PM · DBA
Volans added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

EDIT: sorry I noticed the task was closed only afterwards. Created T371049 as a follow up.

Jul 25 2024, 6:34 PM · Patch-For-Review, DBA

Jul 24 2024

Volans added a comment to T370386: statograph_post errors with out of range float values since 2024-07-16.

Thanks for handling this while I was out!

Jul 24 2024, 7:24 PM · observability
Volans added a comment to T368023: Move the private Puppet repository to puppetserver1001.

Let's just make sure that requestctl works fine on the puppetmasters, it's installed but the pyparsing version on bookwork is technically outside of the range allowed in setup.py.

Jul 24 2024, 2:42 PM · Patch-For-Review, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
Volans added a comment to T370852: Migrate codfw row C & D database hosts to new Leaf switches.

@Ladsgroup that's looks very useful, I didn't know about it, is it mentioned anywhere? I can't find in wikitech.
Having glimpsed at it I also have some questions and potential concerns about the data it uses. Where would be a good place to raise them?

Jul 24 2024, 12:58 PM · DBA, ops-codfw, Infrastructure-Foundations, netops, SRE, DC-Ops
Volans added a comment to T370852: Migrate codfw row C & D database hosts to new Leaf switches.

Without too much previous experience from past migrations, I think we could tackle it per DB section (aka shard), moving all easily depoolable hosts in a section first and then leaving the harder to depool ones (like masters) for last.
Let me know if I should extract a distribution of hosts per section and row/rack

Jul 24 2024, 11:24 AM · DBA, ops-codfw, Infrastructure-Foundations, netops, SRE, DC-Ops
Volans added a comment to T370118: Register Citoid as a "friendly bot" (or alternatively verified bot) with Cloudflare.

@akosiaris with the wikimedia account we have we do have access to the Add Verified Bot form and potentially we could compile that one. The verification methods in that case are:

  • Reverse DNS
  • IP List
  • ASN
Jul 24 2024, 10:44 AM · serviceops, Goal, VisualEditor-MediaWiki-References, Editing-team (Kanban Board), VisualEditor, Citoid
Volans updated subscribers of T370808: Consider registering citoid as a verified or friendly bot with Cloudflare.

@joanna_borun I think this should probably be discussed in the next SRE I/F meeting for approval.

Jul 24 2024, 9:32 AM · Infrastructure-Foundations, Citoid, Editing-team