[go: up one dir, main page]

Page MenuHomePhabricator

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Aug 12 2016, 1:45 PM (431 w, 6 d)
Roles
Bot
Availability
Available
LDAP User
Unknown
MediaWiki User
Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity

Today

ops-monitoring-bot added a comment to T378988: codfw: (3x) aux-k8s-etcd nodes.

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-etcd2005.codfw.wmnet with OS bookworm completed:

  • aux-k8s-etcd2005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411220046_herron_3849689_aux-k8s-etcd2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Fri, Nov 22, 1:00 AM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T378988: codfw: (3x) aux-k8s-etcd nodes.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-etcd2005.codfw.wmnet with OS bookworm

Fri, Nov 22, 12:27 AM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T378988: codfw: (3x) aux-k8s-etcd nodes.

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-etcd2004.codfw.wmnet with OS bookworm completed:

  • aux-k8s-etcd2004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411212356_herron_3840382_aux-k8s-etcd2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Fri, Nov 22, 12:11 AM · vm-requests, SRE, Kubernetes

Yesterday

ops-monitoring-bot added a comment to T378988: codfw: (3x) aux-k8s-etcd nodes.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-etcd2004.codfw.wmnet with OS bookworm

Thu, Nov 21, 11:36 PM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T378988: codfw: (3x) aux-k8s-etcd nodes.

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-etcd2003.codfw.wmnet with OS bookworm completed:

  • aux-k8s-etcd2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411212255_herron_3831695_aux-k8s-etcd2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 11:09 PM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T378146: Q2:rack/setup/install es204[1-6].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed with errors:

  • es2041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es2041.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Thu, Nov 21, 11:06 PM · Data-Persistence-SRE, DBA, Patch-For-Review, SRE, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T378988: codfw: (3x) aux-k8s-etcd nodes.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-etcd2003.codfw.wmnet with OS bookworm

Thu, Nov 21, 10:38 PM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T378146: Q2:rack/setup/install es204[1-6].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm

Thu, Nov 21, 10:23 PM · Data-Persistence-SRE, DBA, Patch-For-Review, SRE, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T369308: Decommission clouddb2002-dev.codfw.wmnet.

cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: clouddb2002-dev.codfw.wmnet

  • clouddb2002-dev.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Thu, Nov 21, 5:40 PM · SRE, DC-Ops, ops-codfw, wikitech.wikimedia.org, Data-Persistence, cloud-services-team
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2157.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2157 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211643_cgoubert_3783156_wikikube-worker2157.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 5:03 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2140 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211626_cgoubert_3782080_wikikube-worker2140.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Nov 21, 4:47 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2157.codfw.wmnet with OS bookworm

Thu, Nov 21, 4:21 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2157.codfw.wmnet with OS bookworm executed with errors:

  • wikikube-worker2157 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker2157.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Thu, Nov 21, 4:20 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm

Thu, Nov 21, 4:04 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm executed with errors:

  • wikikube-worker2140 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker2140.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Thu, Nov 21, 4:03 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm

Thu, Nov 21, 3:24 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm executed with errors:

  • wikikube-worker2140 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker2140.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Thu, Nov 21, 3:23 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm

Thu, Nov 21, 3:16 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm executed with errors:

  • wikikube-worker2140 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker2140.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Thu, Nov 21, 3:16 PM · serviceops
ops-monitoring-bot added a comment to T380236: Refresh restbase202[1-3] w/ restbase203[6-8].

Icinga downtime and Alertmanager silence (ID=328d8f7f-4fde-44b1-abaa-53eda8f15600) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommissioning — T380236

restbase2021.codfw.wmnet
Thu, Nov 21, 3:10 PM · Cassandra
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2169.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2169 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211436_cgoubert_3757703_wikikube-worker2169.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 2:56 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:56 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2168.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2168 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211433_cgoubert_3757653_wikikube-worker2168.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 2:53 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2170.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2170 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211431_cgoubert_3757789_wikikube-worker2170.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 2:51 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2167.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2167 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211428_cgoubert_3757599_wikikube-worker2167.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 2:49 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm executed with errors:

  • wikikube-worker2140 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console wikikube-worker2140.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Thu, Nov 21, 2:47 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2166.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2166 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211425_cgoubert_3757560_wikikube-worker2166.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 2:47 PM · serviceops
ops-monitoring-bot added a comment to T363214: kafka-main100[6789] and kafka-main1010 implementation tracking.

Icinga downtime and Alertmanager silence (ID=9f9d188a-551c-412a-8d68-ca67db96a150) set by jynus@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Per claime's recommendation

kafka-main1001.eqiad.wmnet
Thu, Nov 21, 2:46 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2157.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:46 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2140.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:39 PM · serviceops
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin2002 for host kubestage1006.eqiad.wmnet with OS bookworm completed:

  • kubestage1006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211354_jayme_2722797_kubestage1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 2:11 PM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin2002 for host kubestage1005.eqiad.wmnet with OS bookworm completed:

  • kubestage1005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211347_jayme_2722015_kubestage1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 2:06 PM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2170.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:06 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2169.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:05 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2168.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:04 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2167.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:04 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2166.codfw.wmnet with OS bookworm

Thu, Nov 21, 2:04 PM · serviceops
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin2002 for host kubestage1006.eqiad.wmnet with OS bookworm

Thu, Nov 21, 1:34 PM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

Cookbook cookbooks.sre.hosts.rename started by jayme@cumin2002 from kubernetes1008 to kubestage1006 completed:

  • kubernetes1008 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new
Thu, Nov 21, 1:33 PM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin2002 for host kubestage1005.eqiad.wmnet with OS bookworm

Thu, Nov 21, 1:28 PM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

Cookbook cookbooks.sre.hosts.rename started by jayme@cumin2002 from kubernetes1007 to kubestage1005 completed:

  • kubernetes1007 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new
Thu, Nov 21, 1:25 PM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2160.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2160 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211258_cgoubert_3728890_wikikube-worker2160.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 1:18 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2164.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2164 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211255_cgoubert_3729280_wikikube-worker2164.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 1:16 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2162.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2162 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211252_cgoubert_3729126_wikikube-worker2162.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 1:11 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2165.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2165 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211249_cgoubert_3729365_wikikube-worker2165.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 1:10 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2163.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2163 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211246_cgoubert_3729215_wikikube-worker2163.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 1:05 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2158.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2158 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211242_cgoubert_3728774_wikikube-worker2158.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 1:02 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2161.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2161 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211239_cgoubert_3729032_wikikube-worker2161.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 12:58 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2156.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2156 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411211236_cgoubert_3728713_wikikube-worker2156.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Nov 21, 12:55 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2165.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:19 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2164.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:19 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2163.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:18 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2162.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:17 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2160.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:17 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2161.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:16 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2158.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:14 PM · serviceops
ops-monitoring-bot added a comment to T376966: wikikube-worker21[56-70] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2156.codfw.wmnet with OS bookworm

Thu, Nov 21, 12:13 PM · serviceops
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin2002 depool for host kubernetes[1007-1008].eqiad.wmnet completed:

  • kubernetes[1007-1008].eqiad.wmnet (PASS)
    • Host kubernetes[1007-1008].eqiad.wmnet depooled from wikikube-eqiad
Thu, Nov 21, 10:41 AM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T380043: Add 2 more nodes per DC to wikikube-staging.

depool host kubernetes[1007-1008].eqiad.wmnet by jayme@cumin2002 with reason: None

Thu, Nov 21, 10:40 AM · serviceops, Kubernetes
ops-monitoring-bot added a comment to T380451: Lumen codfw-ulsfo down (Nov 2024).
Automated diagnostic for Netbox circuit ID 102

Interface cr4-ulsfo:xe-0/1/1

  • admin-status: up
  • ⚠️ oper-status: down
  • interface-flapped: 2024-11-20 17:04:03 UTC (17:15:19 ago)
  • ⚠️ errors: {'input-errors': 1077, 'framing-errors': 1077, 'carrier-transitions': 136, 'output-errors': 24}
  • laser-output-power: 0.7030
  • laser-output-power-dbm: -1.53
  • rx-signal-avg-optical-power: 0.0004
  • ⚠️ rx-signal-avg-optical-power-dbm: -33.98
Thu, Nov 21, 10:19 AM · netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T380236: Refresh restbase202[1-3] w/ restbase203[6-8].

Icinga downtime and Alertmanager silence (ID=38479039-f507-4251-8172-d1957f1540a8) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommissioning — T380236

restbase2023.codfw.wmnet
Thu, Nov 21, 12:42 AM · Cassandra
ops-monitoring-bot added a comment to T380236: Refresh restbase202[1-3] w/ restbase203[6-8].

Icinga downtime and Alertmanager silence (ID=941362f7-c8d8-42d0-8eec-c2f1f00b7709) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommissioning — T380236

restbase2022.codfw.wmnet
Thu, Nov 21, 12:42 AM · Cassandra
ops-monitoring-bot added a comment to T380236: Refresh restbase202[1-3] w/ restbase203[6-8].

Icinga downtime and Alertmanager silence (ID=aa62de12-5de6-4c95-aca3-db5cc67a1e73) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommissioning — T380236

restbase2021.codfw.wmnet
Thu, Nov 21, 12:42 AM · Cassandra

Wed, Nov 20

ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye completed:

  • thanos-be2005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411202227_jhathaway_2560287_thanos-be2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 10:50 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Wed, Nov 20, 10:12 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye completed:

  • thanos-be2005 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411202147_jhathaway_2554797_thanos-be2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Wed, Nov 20, 10:11 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Wed, Nov 20, 9:32 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Wed, Nov 20, 9:31 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T378986: codfw: (2x) aux-k8s-ctrl nodes.

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm completed:

  • aux-k8s-ctrl2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411202051_herron_3604007_aux-k8s-ctrl2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 9:05 PM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Wed, Nov 20, 9:03 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Wed, Nov 20, 9:00 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T378146: Q2:rack/setup/install es204[1-6].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed with errors:

  • es2041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console es2041.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Wed, Nov 20, 8:47 PM · Data-Persistence-SRE, DBA, Patch-For-Review, SRE, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T378986: codfw: (2x) aux-k8s-ctrl nodes.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm

Wed, Nov 20, 8:32 PM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Wed, Nov 20, 8:30 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Wed, Nov 20, 8:30 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Wed, Nov 20, 8:08 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye executed with errors:

  • thanos-be2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console thanos-be2005.codfw.wmnet" to get a root shell, but depending on the failure this may not work.
Wed, Nov 20, 8:05 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T378146: Q2:rack/setup/install es204[1-6].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm

Wed, Nov 20, 8:03 PM · Data-Persistence-SRE, DBA, Patch-For-Review, SRE, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye

Wed, Nov 20, 7:47 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T378986: codfw: (2x) aux-k8s-ctrl nodes.

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm completed:

  • aux-k8s-ctrl2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201920_herron_3591728_aux-k8s-ctrl2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 7:35 PM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T378986: codfw: (2x) aux-k8s-ctrl nodes.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm

Wed, Nov 20, 7:04 PM · vm-requests, SRE, Kubernetes
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool for host wikikube-worker[2136-2139,2141-2155].codfw.wmnet completed:

  • wikikube-worker[2136-2139,2141-2155].codfw.wmnet (PASS)
    • Host wikikube-worker[2136-2139,2141-2155].codfw.wmnet pooled in wikikube-codfw
Wed, Nov 20, 1:57 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

pool host wikikube-worker[2136-2139,2141-2155].codfw.wmnet by cgoubert@cumin1002 with reason: None

Wed, Nov 20, 1:57 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2151.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2151 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201321_cgoubert_3531896_wikikube-worker2151.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 1:41 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2152.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2152 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201317_cgoubert_3531981_wikikube-worker2152.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 1:38 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2154.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2154 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201314_cgoubert_3532146_wikikube-worker2154.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 1:33 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2155.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2155 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201311_cgoubert_3532242_wikikube-worker2155.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 1:33 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2153.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2153 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201307_cgoubert_3532062_wikikube-worker2153.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 1:26 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2150.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2150 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201303_cgoubert_3531839_wikikube-worker2150.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 1:23 PM · serviceops
ops-monitoring-bot added a comment to T376737: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged).

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp7007.magru.wmnet with OS bullseye completed:

  • cp7007 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201250_sukhe_2463690_cp7007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 1:17 PM · Patch-For-Review, SRE, Traffic, ops-magru
ops-monitoring-bot added a comment to T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022.

Draining ganeti1017.eqiad.wmnet of running VMs

Wed, Nov 20, 1:01 PM · Ganeti, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022.

Draining ganeti1017.eqiad.wmnet of running VMs

Wed, Nov 20, 12:49 PM · Ganeti, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2155.codfw.wmnet with OS bookworm

Wed, Nov 20, 12:44 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2154.codfw.wmnet with OS bookworm

Wed, Nov 20, 12:43 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2153.codfw.wmnet with OS bookworm

Wed, Nov 20, 12:42 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2152.codfw.wmnet with OS bookworm

Wed, Nov 20, 12:42 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2151.codfw.wmnet with OS bookworm

Wed, Nov 20, 12:41 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2143.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2143 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201222_cgoubert_3509827_wikikube-worker2143.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 12:41 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2150.codfw.wmnet with OS bookworm

Wed, Nov 20, 12:41 PM · serviceops
ops-monitoring-bot added a comment to T377028: wikikube-worker21[36-55] implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2146.codfw.wmnet with OS bookworm completed:

  • wikikube-worker2146 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411201218_cgoubert_3510059_wikikube-worker2146.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Nov 20, 12:39 PM · serviceops