[go: up one dir, main page]

Page MenuHomePhabricator

Q#:rack/setup/install dbproxy200[5-8]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of dbproxy200[5-8]

Hostname / Racking / Installation Details

Hostnames: dbproxy2005, dbproxy2006, dbproxy2007, dbproxy2008
Racking Proposal: One per rack and if possible one per row.
Networking Setup: # of Connections:1/2 - Speed:1G/10G. - VLAN:Private/Public/Other(Specify) : AAAA records:N, Additional IP records (Cassandra)? Yes/No
Partitioning/Raid: HW Raid: N, Partman recipe and/or desired Raid Level: @Marostegui will take care of this
OS Distro: Bookworm
Sub-team Technical Contact: @Marostegui

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

dbproxy2005: xe-0/0/40
  • Receive in system on procurement task T361352 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
dbproxy2006:
  • Receive in system on procurement task T361352 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
dbproxy2007: xe-7/0/37
  • Receive in system on procurement task T361352 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
dbproxy2008:xe-4/0/17
  • Receive in system on procurement task T361352 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Related Objects

Event Timeline

RobH mentioned this in Unknown Object (Task).Apr 17 2024, 8:21 PM
RobH added a parent task: Unknown Object (Task).

@Jhancock.wm reminder, we do not need AAAA records on these hosts.

Change #1042820 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: New dbproxies

https://gerrit.wikimedia.org/r/1042820

Change #1042820 abandoned by Marostegui:

[operations/puppet@production] site.pp: New dbproxies

Reason:

vim did something weird

https://gerrit.wikimedia.org/r/1042820

Change #1042822 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: New dbproxy hosts

https://gerrit.wikimedia.org/r/1042822

Change #1042822 merged by Marostegui:

[operations/puppet@production] site.pp: New dbproxy hosts

https://gerrit.wikimedia.org/r/1042822

@Marostegui thank you for the reminder. I will be getting this racked on Friday most likely. also thank you for updating puppet files!

@Marostegui was there a preference for 1G or 10G on these servers?

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm completed:

  • dbproxy2005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407111909_pt1979_2208237_dbproxy2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

@Jhancock.wm i think you missed @Marostegui comment about not setting IPV6 for those hosts. I fixed it.

@Marostegui like we discussed this morning, I was able to install dbproxy2005 using the workaround of using the 1G NIC for the install and switch to 10G after the install. Please check if all looks good on dbproxy2005 so I can proceed with the others.

Thanks

@Papaul I cannot access the host via ssh remotely, but the host is up and has network. I've connected via supermicro idrac and I think it is related to the DNS

root@dbproxy2005:~# host dbproxy2005.codfw.wmnet
Host dbproxy2005.codfw.wmnet not found: 3(NXDOMAIN)


root@dbproxy2005:~# ip addre
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: ens1f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 7c:c2:55:97:5c:ce brd ff:ff:ff:ff:ff:ff
    altname enp138s0f0np0
    inet 10.192.23.11/24 brd 10.192.23.255 scope global ens1f0np0
       valid_lft forever preferred_lft forever
    inet6 2620:0:860:113:10:192:23:11/64 scope global
       valid_lft 2591992sec preferred_lft 604792sec
    inet6 fe80::7ec2:55ff:fe97:5cce/64 scope link
       valid_lft forever preferred_lft forever
3: ens1f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 7c:c2:55:97:5c:cf brd ff:ff:ff:ff:ff:ff
    altname enp138s0f1np1
4: enxbe3af2b6059f: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether be:3a:f2:b6:05:9f brd ff:ff:ff:ff:ff:ff
5: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 7c:c2:55:ac:1a:56 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
6: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 7c:c2:55:ac:1a:57 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1


root@dbproxy2005:~# ping cumin2002.codfw.wmnet
PING cumin2002.codfw.wmnet(cumin2002.codfw.wmnet (2620:0:860:103:10:192:32:49)) 56 data bytes
64 bytes from cumin2002.codfw.wmnet (2620:0:860:103:10:192:32:49): icmp_seq=1 ttl=60 time=0.340 ms
64 bytes from cumin2002.codfw.wmnet (2620:0:860:103:10:192:32:49): icmp_seq=2 ttl=60 time=0.385 ms

dbproxy2006 temp 1G -> B7 lsw port 47
dbproxy2007 temp 1G -> C7 asw port 43
dbproxy2008 temp 1G -> D4 asw port 43

@Marostegui thank you for checking. You are right looks like the host still has it's IPV6 or we remove it after the re-image in netbox.

2: ens1f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 7c:c2:55:97:5c:ce brd ff:ff:ff:ff:ff:ff
    altname enp138s0f0np0
    inet 10.192.23.11/24 brd 10.192.23.255 scope global ens1f0np0
       valid_lft forever preferred_lft forever
    inet6 2620:0:860:113:10:192:23:11/64 scope global
       valid_lft 2591992sec preferred_lft 604792sec
    inet6 fe80::7ec2:55ff:fe97:5cce/64 scope link
       valid_lft forever preferred_lft forever

i will try to re-image it again but in the main time can you try to remove that IP6s entry and reboot the server.
Thank you

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host dbproxy2005.codfw.wmnet with OS bookworm

@Papaul the interface wasn't in netbox anymore, but the DNS entry for that host is still gone.
I've tried to reimage the host but it gets stuck on the PXE boot and keeps rebooting upon timing out waiting for the boot:

Intel(R) Boot Agent XE (X550) v2.4.45
Copyright (C) 1997-2019, Intel Corporation

PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Intel Boot Agent.


Broadcom UNDI PXE-2.1 v226.0.135.0
Copyright (C) 2000-2023 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: 7C C2 55 97 5C CE  GUID: 07663200-04E3-11EF-8000-7CC255AC1A56
CLIENT IP: 10.192.23.11  MASK: 255.255.255.0  DHCP IP: 208.80.153.105
GATEWAY IP: 10.192.23.1

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
.................

Just talked to @Papaul - the reimage was expected to fail since the iface was moved back to the 10G one.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm completed:

  • dbproxy2005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407151358_pt1979_1933053_dbproxy2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

@Papaul dbproxy2005 looks good now - no ipv6 and I can reach it just fine. If you want to move it back to 10G that's great, and if you'd want to reimage the other hosts, that'd be great

Thanks!

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2006.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2006.codfw.wmnet with OS bookworm completed:

  • dbproxy2006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407161656_pt1979_3112148_dbproxy2006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm executed with errors:

  • dbproxy2007 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbproxy2007.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors:

  • dbproxy2008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbproxy2008.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors:

  • dbproxy2008 (FAIL)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbproxy2008.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors:

  • dbproxy2008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbproxy2008.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm executed with errors:

  • dbproxy2008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbproxy2008.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2008.codfw.wmnet with OS bookworm completed:

  • dbproxy2008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407171403_pt1979_4028800_dbproxy2008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2007.codfw.wmnet with OS bookworm completed:

  • dbproxy2007 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407171503_pt1979_4076091_dbproxy2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Papaul updated the task description. (Show Details)

@Marostegui this is ready for you. All the server are on 10G.I hope we fix this pxe boot issue on the 10G before we think about re-imaging those server otherwise we will have to put them back again on the 1G.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host dbproxy2005.codfw.wmnet with OS bookworm executed with errors:

  • dbproxy2005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" dbproxy2005.codfw.wmnet to get a root shellbut depending on the failure this may not work.

The above was me aborting the leftover execution of the cookbook that have been left in waiting for user input.