Cloudvirt1013 server has spontaneously shut down twice in the last week:
On 2019-12-22 at about 09:30UTC, restarted by iLO
On 2019-12-27 at 10:45UTC, restarted by me at the mgmt console in response to a page
Cloudvirt1013 server has spontaneously shut down twice in the last week:
On 2019-12-22 at about 09:30UTC, restarted by iLO
On 2019-12-27 at 10:45UTC, restarted by me at the mgmt console in response to a page
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Unknown Object (Task) | |||||
Resolved | • Cmjohnson | T138509 rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014) | |||
Resolved | Jclark-ctr | T241313 cloudvirt1013: server down for no reason (power issue?) | |||
Unknown Object (Task) |
Mentioned in SAL (#wikimedia-cloud) [2019-12-22T09:45:10Z] <arturo> cloudvirt1013 is back (did it alone) T241313
I'm not sure if it is true that I couldn't reach the iLO. I was trying install_console to cloudvirt1013.eqiad.wmnet instead of cloudvirt1013.mgmt.eqiad.wmnet.
Latest events in the iLO console:
$</>hpiLO-> show /system1/log1 status=0 status_tag=COMMAND COMPLETED Sun Dec 22 05:08:05 2019 /system1/log1 Targets record1 record2 record3 record4 record5 record6 record7 record8 record9 record10 record11 record12 record13 record14 record15 record16 record17 record18 record19 record20 record21 record22 record23 record24 record25 record26 record27 Properties Verbs cd version exit show delete $</>hpiLO-> show /system1/log1/record27 status=0 status_tag=COMMAND COMPLETED Sun Dec 22 05:08:30 2019 /system1/log1/record27 Targets Properties number=27 severity=Critical date=12/22/2019 time=09:38 description=ASR Detected by System ROM Verbs cd version exit show $</>hpiLO-> show /system1/log1/record26 status=0 status_tag=COMMAND COMPLETED Sun Dec 22 05:08:40 2019 /system1/log1/record26 Targets Properties number=26 severity=Caution date=12/22/2019 time=04:24 description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1, service information: 0x07). Action: 1. Remove additional devices. 2. Consult server troubleshooting guide. 3. Gather AHS log and contact Support Verbs cd version exit show $</>hpiLO-> show /system1/log1/record25 status=0 status_tag=COMMAND COMPLETED Sun Dec 22 05:09:06 2019 /system1/log1/record25 Targets Properties number=25 severity=Repaired date=10/24/2019 time=07:10 description=System Power Supplies Not Redundant Verbs cd version exit show
Change 560837 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud-vps: depool cloudvirt1013, pool cloudvirt1024
Change 560837 merged by Andrew Bogott:
[operations/puppet@production] cloud-vps: depool cloudvirt1013, pool cloudvirt1024
Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:07:48Z] <andrewbogott> migrating cyberbot-db-01 to cloudvirt1009 in response to T241313
Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:12:59Z] <andrewbogott> migrating osmit-test to cloudvirt1009 in response to T241313
Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:13:24Z] <andrewbogott> migrating deployment-aqs03 to cloudvirt1009 in response to T241313
I've drained all VMs off this server and put it in downtime until March 1st for investigation or repair. I don't have any good ideas about how to repair it.
cloudvirt1013, cloudvirt1014, and cloudvirt1023 are the only cloudvirts running
Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11)
cloudvirt1023 is held back as a spare, so not under load.
Kernel is probably unrelated, they're running that new kernel because of the post-crash reboot, were running the standard kernel before that.
Note to DCOps: This system is already drained of VMs. We may simply need to downtime it to shut down for troubleshooting.
Mentioned in SAL (#wikimedia-cloud) [2020-01-23T20:17:52Z] <jeh> cloudvirt1013 set icinga downtime and powering down for hardware maintenance T241313
313-hpe smart storage battery 1 Failure - battery shutdown event code: 0x400
action: restart system
Needs replacement bbu @wiki_willy can we order new one?
Sure, no problem @Jclark-ctr. I've opened up a procurement task via T243547 for @RobH to order a replacement bbu. Thanks, Willy
Replaced bbu no errrors at this time closing procurement task T243547 not needed at this time
Change 567024 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: eqiad1: repool cloudvirt1013
Mentioned in SAL (#wikimedia-cloud) [2020-01-24T12:52:52Z] <arturo> repooling cloudvirt1013 after HW got fixed (T241313)
Change 567024 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: eqiad1: repool cloudvirt1013
Mentioned in SAL (#wikimedia-cloud) [2020-01-24T15:10:53Z] <jeh> remove icinga downtime for cloudvirt1013 T241313