[go: up one dir, main page]

Page MenuHomePhabricator

Degraded RAID on puppetmaster1003
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host puppetmaster1003. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 2, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sda2[2](F) sdb2[1]
      185469952 blocks super 1.2 [2/1] [_U]
      bitmap: 0/2 pages [0KB], 65536KB chunk

md0 : active raid1 sda1[2](F) sdb1[1]
      48793600 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: ABran-WMF. · View Herald Transcript

Change #1073321 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove puppetmaster1003 from active Puppet 5 servers

https://gerrit.wikimedia.org/r/1073321

Change #1073321 merged by Muehlenhoff:

[operations/puppet@production] Remove puppetmaster1003 from active Puppet 5 servers

https://gerrit.wikimedia.org/r/1073321

Device: /dev/sda
ID_SERIAL=SSDSC2KG240G7R_PHYM812600ZH240AGN
ID_SERIAL_SHORT=PHYM812600ZH240AGN
ID_PATH=pci-0000:00:11.5-ata-3
ID_PATH_TAG=pci-0000_00_11_5-ata-3
Device: /dev/sdb
ID_SERIAL=SSDSC2KG240G7R_PHYM812601AH240AGN
ID_SERIAL_SHORT=PHYM812601AH240AGN
ID_PATH=pci-0000:00:11.5-ata-4
ID_PATH_TAG=pci-0000_00_11_5-ata-4
Jclark-ctr added a subscriber: Muehlenhoff.

checking hardware inventory confirmed failed sda is slot 0 @Muehlenhoff would you be good point of contact for rebuilding drive when drive is replaced?

Disk 0 on Embedded AHCI Controller 1
	
BlockSizeInBytes:
512 Bytes
BusProtocol:
SATA
Connector:
2
DeviceDescription:
Disk 0 on Embedded AHCI Controller 1
DeviceType:
PhysicalDisk
DriveFormFactor:
2.5 inch
FQDD:
Disk.Direct.0-0:AHCI.Embedded.1-1
FreeSizeInBytes:
0 Bytes
HotSpareStatus:
No
InstanceID:
Disk.Direct.0-0:AHCI.Embedded.1-1
LastSystemInventoryTime:
2018-05-30T23:37:53
LastUpdateTime:
2018-05-30T22:48:01
Manufacturer:
INTEL
ManufacturingDay:
31
ManufacturingWeek:
12
ManufacturingYear:
2008
MaxCapableSpeed:
6Gbs
MediaType:
Solid State Drive
Model:
SSDSC2KG240G7R
OperationName:
None
OperationPercentComplete:
0%
PPID:
TW0V6YD5ITT0083V05H4A02
PredictiveFailureState:
Smart Alert Absent
PrimaryStatus:
Unknown
RaidStatus:
Non-RAID
RAIDType:
Unknown
RemainingRatedWriteEndurance:
Unknown
Revision:
SCV1
RollupStatus:
Unknown
SASAddress:
55CD2E414F2786AA
SecurityState:
Not Capable
SerialNumber:
PHYM812500AP240AGN
SizeInBytes:
240057409536 Bytes
Slot:
0
SystemEraseCapability:
2
T10PICapability:
Not supported
UsedSizeInBytes:
0 Bytes
	Disk 1 on Embedded AHCI Controller 1
	
BlockSizeInBytes:
512 Bytes
BusProtocol:
SATA
Connector:
3
DeviceDescription:
Disk 1 on Embedded AHCI Controller 1
DeviceType:
PhysicalDisk
DriveFormFactor:
2.5 inch
FQDD:
Disk.Direct.1-1:AHCI.Embedded.1-1
FreeSizeInBytes:
0 Bytes
HotSpareStatus:
No
InstanceID:
Disk.Direct.1-1:AHCI.Embedded.1-1
LastSystemInventoryTime:
2018-05-30T23:37:53
LastUpdateTime:
2018-05-30T22:48:01
Manufacturer:
INTEL
ManufacturingDay:
31
ManufacturingWeek:
12
ManufacturingYear:
2008
MaxCapableSpeed:
6Gbs
MediaType:
Solid State Drive
Model:
SSDSC2KG240G7R
OperationName:
None
OperationPercentComplete:
0%
PPID:
TW0V6YD5ITT0083V05JAA02
PredictiveFailureState:
Smart Alert Absent
PrimaryStatus:
Unknown
RaidStatus:
Non-RAID
RAIDType:
Unknown
RemainingRatedWriteEndurance:
Unknown
Revision:
SCV1
RollupStatus:
Unknown
SASAddress:
55CD2E414F27873A
SecurityState:
Not Capable
SerialNumber:
PHYM812601AH240AGN
SizeInBytes:
240057409536 Bytes
Slot:
1
SystemEraseCapability:
2
T10PICapability:
Not supported
UsedSizeInBytes:
0 Bytes

@VRiley-WMF looks like you replaced drive T373888. with @MoritzMuehlenhoff so idrac hardware log is still listing old s/n

		2024-09-05 10:48:59 	USR0032 	The session for root from 10.64.48.98 using GUI is logged off.	
		2024-09-05 10:31:45 	PDR1000 	Drive 0 is installed in disk drive bay 1.	
		2024-09-05 10:31:35 	PDR1016 	Drive 0 is removed from disk drive bay 1.
This comment was removed by Jclark-ctr.

@Jclark-ctr Hi, indeed the drive was swapped last week. I took puppetmaster1003 out of active service this morning, if we have another disk we can swap in from a decom host, it can be replaced any time. I guess if the error repeats this would rather indicate an issue with the mainboard or disk slot or similar? Then we'd most likely need to decom it (and find some other fallback since we cannot yet get rid of the legacy Puppet 5 servers for another 3-6 months)

@MoritzMuehlenhoff So i did notice that idrac held onto previous drive serial number in inventory. We have had a number of issues with mdadm raids recently failing right after swapping aqs1013 ,aqs1014, prometheus1008 as examples we are servers we are having ongoing issues with. I am updating the idrac firmware from 3.18 to 7.0.0.0 but still seems to hold onto original harddrive serial#

@MoritzMuehlenhoff Sda has been replaced Again

New Disk

Device: /dev/sda
ID_SERIAL=SSDSC2KG240G7R_PHYM812600BL240AGN
ID_SERIAL_SHORT=PHYM812600BL240AGN
ID_PATH=pci-0000:00:11.5-ata-3
ID_PATH_TAG=pci-0000_00_11_5-ata-3
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb2[1]
      185469952 blocks super 1.2 [2/1] [_U]
      bitmap: 0/2 pages [0KB], 65536KB chunk

md0 : active raid1 sdb1[1]
      48793600 blocks super 1.2 [2/1] [_U]

unused devices: <none>
[Tue Sep 17 21:47:36 2024] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[Tue Sep 17 21:47:36 2024] sd 2:0:0:0: Attached scsi generic sg0 type 0
[Tue Sep 17 21:47:36 2024] sd 2:0:0:0: [sda] Attached SCSI disk

Hardware inventory in idrac still list old even after idrac reboot
SerialNumber
PHYM812500AP240AGN

Great, many thanks! I'll rebuild the RAID and then I'll add the server back to active duty. Hopefully it works now for longer than a week :-)

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sda2[2] sdb2[1]
      185469952 blocks super 1.2 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

md0 : active raid1 sda1[2] sdb1[1]
      48793600 blocks super 1.2 [2/2] [UU]

unused devices: <none>
Jclark-ctr claimed this task.