[go: up one dir, main page]

Page MenuHomePhabricator

DC-OpsGroup
ActivePublic

Members (7)

Watchers (1)

Details

Description

Tasks handled by the Wikimedia Foundation's datacenter operations team, which is a sub-team of the SRE department.

This project includes sub-project procurement, decommission-hardware, and every single datacenter site-specific project: ops-codfw, ops-drmrs, ops-eqdfw, ops-eqiad, ops-eqord, ops-esams , ops-eqsin, ops-ulsfo, & ops-magru .

This can be linked to via: https://phabricator.wikimedia.org/tag/dc-ops/

Please note any wikitech documentation handled by DC-Ops is linked off of https://wikitech.wikimedia.org/wiki/Dc-operations

SLAs

DC-Ops makes every attempt to resolve all tasks and requests in a timely manner. We've implemented the following SLA targets.

Please note none of these start until both the clarified start time and with proper project tags. See details for each type of task request in their section below. Please use templates listed below.

ProjectDays to ResolveSLA startTemplate
procurement90Date of Task filingProcurement Template
Racking/Installation30Arrival of Hardware to DC site
Hardware Failure / Repair10Date of Task filingHardware Failure Template
Decommission45When all sub-team steps are complete and task is assigned to on-siteServer Decommission Template

Hardware Repair

If you need to file a task requesting hardware troubleshooting, please use the File Hardware Failure Task link here or in the navbar on the left.

Troubleshooting includes hardware failures, raid re-configuration, etc...

A full runbook on how to troubleshoot hardware failures can be viewed here: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook

Requesting Hardware

If you have a budget line item, and want to file a request for pricing, please file your procurement request via this link. If you do not yet have a budget line for the request in this fiscal year, you can still file via that link, merely list that there is no budget allocation in that section of the task.

Once hardware has been ordered, a racking task must be entered using the form. This form may also be used if a system has to be moved and re-imaged.

Decommissioning Hardware

All hardware being returned to DC-Ops for processing into spares, or into decommission state and removed from the rack.

Any hardware no longer required for use should have a task filed for decommission via the pre-defined server decommission request form.

Netbox Reporting

The template for netbox report errors is here: https://phabricator.wikimedia.org/maniphest/task/edit/form/133/

Recent Activity

Today

phaultfinder updated the task description for T375218: PDU sensor over limit.
Sun, Sep 22, 4:15 AM · SRE, ops-eqiad, DC-Ops

Yesterday

phaultfinder updated the task description for T375218: PDU sensor over limit.
Sat, Sep 21, 6:35 PM · SRE, ops-eqiad, DC-Ops
Maintenance_bot added a project to T375328: ManagementSSHDown: SRE.
Sat, Sep 21, 5:29 PM · SRE, DC-Ops, ops-codfw
phaultfinder created T375328: ManagementSSHDown.
Sat, Sep 21, 5:22 PM · SRE, DC-Ops, ops-codfw
phaultfinder updated the task description for T375218: PDU sensor over limit.
Sat, Sep 21, 3:40 PM · SRE, ops-eqiad, DC-Ops
dcaro added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

Not yet, I'm still draining the C8 rack, early next week I'll have something

Sat, Sep 21, 1:41 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS
phaultfinder updated the task description for T375218: PDU sensor over limit.
Sat, Sep 21, 1:25 PM · SRE, ops-eqiad, DC-Ops
phaultfinder updated the task description for T375218: PDU sensor over limit.
Sat, Sep 21, 12:30 PM · SRE, ops-eqiad, DC-Ops
Maintenance_bot added a project to T375314: ManagementSSHDown: SRE.
Sat, Sep 21, 12:29 AM · SRE, DC-Ops, ops-eqiad
phaultfinder created T375314: ManagementSSHDown.
Sat, Sep 21, 12:26 AM · SRE, DC-Ops, ops-eqiad

Fri, Sep 20

Jclark-ctr closed T374652: Degraded RAID on aqs1014 as Resolved.

duplicate of T362841

Fri, Sep 20, 11:59 PM · SRE, DC-Ops, ops-eqiad
Jclark-ctr added a comment to T374540: Degraded RAID on prometheus1008.

If sdd was the drive replaced which assume from dmesg

Fri, Sep 20, 11:57 PM · SRE Observability (FY2024/2025-Q1), SRE, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T374215: db1246 crashed, doesn't reboot cleanly.

Screenshot 2024-09-20 at 7.44.07 PM.png (574×1 px, 210 KB)
I got this failure and will not go past. @VRiley-WMF have you gotten anywhere with dell?

Fri, Sep 20, 11:46 PM · SRE, ops-eqiad, DC-Ops, DBA
Jclark-ctr added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

@dcaro did you have an update with what servers and drives I can send? I will reach out over irc monday to discuss this also

Fri, Sep 20, 11:30 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS
Jclark-ctr added a comment to T372814: Put cloudcephosd10[39-41] into service.

@Andrew i see this ticket is in my name. is there something i need to do for this?

Fri, Sep 20, 11:29 PM · Patch-For-Review, SRE, ops-eqiad, cloud-services-team, DC-Ops
Jclark-ctr claimed T374901: Degraded RAID on puppetmaster1003.
Fri, Sep 20, 11:27 PM · SRE, ops-eqiad, DC-Ops
Jclark-ctr closed T374901: Degraded RAID on puppetmaster1003 as Resolved.
Fri, Sep 20, 11:27 PM · SRE, ops-eqiad, DC-Ops
Jclark-ctr added a comment to T374901: Degraded RAID on puppetmaster1003.
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sda2[2] sdb2[1]
      185469952 blocks super 1.2 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk
Fri, Sep 20, 11:26 PM · SRE, ops-eqiad, DC-Ops
Jclark-ctr updated subscribers of T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

@MoritzMuehlenhoff can you update puppet site.pp is missing these servers. also please verify preseed.yaml is updated
Thanks

Fri, Sep 20, 11:25 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host ganeti1040.eqiad.wmnet with OS bookworm

Fri, Sep 20, 11:02 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
Jclark-ctr closed T375304: ManagementSSHDown as Resolved.

duplicate

Fri, Sep 20, 10:54 PM · SRE, ops-eqiad, DC-Ops
Dwisehaupt added a comment to T369565: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002.

Can we get verification on the status of these hosts? Are they racked, cabled, and ready for buildout?

Fri, Sep 20, 10:42 PM · SRE, ops-eqiad, fundraising-tech-ops, DC-Ops
Dwisehaupt added a comment to T369947: Q1:rack/setup/install frban1002.

Can we get verification on the status of this host? Are they racked, cabled, and ready for build out?

Fri, Sep 20, 10:41 PM · SRE, fundraising-tech-ops, ops-eqiad, DC-Ops
Dwisehaupt added a comment to T369940: Q1:rack/setup/install fran1002.

Can we get verification on the status of this host? Are they racked, cabled, and ready for build out?

Fri, Sep 20, 10:41 PM · SRE, fundraising-tech-ops, ops-eqiad, DC-Ops
Dwisehaupt added a comment to T369922: Q1:rack/setup/install frdb1007.

Can we get verification on the status of these hosts? Are they racked, cabled, and ready for build out?

Fri, Sep 20, 10:40 PM · SRE, fundraising-tech-ops, ops-eqiad, DC-Ops
Dwisehaupt moved T375239: Decommission frack hosts: frlog2001 from In Progress to Watching on the fundraising-tech-ops board.
Fri, Sep 20, 10:30 PM · SRE, DC-Ops, ops-codfw, Patch-For-Review, decommission-hardware, fundraising-tech-ops
Dwisehaupt moved T375297: Decommission frack hosts: frpm2001 from Triage to Watching on the fundraising-tech-ops board.
Fri, Sep 20, 10:30 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, decommission-hardware, fundraising-tech-ops
Maintenance_bot added a project to T375239: Decommission frack hosts: frlog2001: SRE.
Fri, Sep 20, 10:29 PM · SRE, DC-Ops, ops-codfw, Patch-For-Review, decommission-hardware, fundraising-tech-ops
Maintenance_bot added a project to T375297: Decommission frack hosts: frpm2001: SRE.
Fri, Sep 20, 10:29 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, decommission-hardware, fundraising-tech-ops
phaultfinder updated the task description for T375218: PDU sensor over limit.
Fri, Sep 20, 10:00 PM · SRE, ops-eqiad, DC-Ops
Dwisehaupt placed T375239: Decommission frack hosts: frlog2001 up for grabs.
Fri, Sep 20, 9:37 PM · SRE, DC-Ops, ops-codfw, Patch-For-Review, decommission-hardware, fundraising-tech-ops
Dwisehaupt placed T375297: Decommission frack hosts: frpm2001 up for grabs.
Fri, Sep 20, 9:36 PM · SRE, ops-codfw, DC-Ops, Patch-For-Review, decommission-hardware, fundraising-tech-ops
Maintenance_bot added a project to T375304: ManagementSSHDown: SRE.
Fri, Sep 20, 8:29 PM · SRE, ops-eqiad, DC-Ops
phaultfinder created T375304: ManagementSSHDown.
Fri, Sep 20, 8:24 PM · SRE, ops-eqiad, DC-Ops
Jhancock.wm closed T372512: Q1:rack/setup/install logging-hd200[4-5] as Resolved.

@colewhite this is complete!

Fri, Sep 20, 7:29 PM · SRE, ops-codfw, observability, DC-Ops
Jhancock.wm updated the task description for T372512: Q1:rack/setup/install logging-hd200[4-5].
Fri, Sep 20, 7:28 PM · SRE, ops-codfw, observability, DC-Ops
Dwisehaupt moved T374741: decommission frban2001.frack.codfw.wmnet from Watching to Done on the fundraising-tech-ops board.
Fri, Sep 20, 7:24 PM · SRE, DC-Ops, ops-codfw, fundraising-tech-ops, decommission-hardware
Dzahn merged T375281: ManagementSSHDown - elastic1089 into T374897: ManagementSSHDown - elastic1089.
Fri, Sep 20, 6:59 PM · SRE, DC-Ops, ops-eqiad
Dzahn merged task T375281: ManagementSSHDown - elastic1089 into T374897: ManagementSSHDown - elastic1089.
Fri, Sep 20, 6:58 PM · SRE, DC-Ops, ops-eqiad
Dzahn renamed T375281: ManagementSSHDown - elastic1089 from ManagementSSHDown to ManagementSSHDown - elastic1089.
Fri, Sep 20, 6:57 PM · SRE, DC-Ops, ops-eqiad
Jhancock.wm updated the task description for T372512: Q1:rack/setup/install logging-hd200[4-5].
Fri, Sep 20, 5:47 PM · SRE, ops-codfw, observability, DC-Ops
phaultfinder updated the task description for T375218: PDU sensor over limit.
Fri, Sep 20, 5:25 PM · SRE, ops-eqiad, DC-Ops
Maintenance_bot added a project to T375281: ManagementSSHDown - elastic1089: SRE.
Fri, Sep 20, 4:29 PM · SRE, DC-Ops, ops-eqiad
phaultfinder created T375281: ManagementSSHDown - elastic1089.
Fri, Sep 20, 4:23 PM · SRE, DC-Ops, ops-eqiad
VRiley-WMF merged task T375130: ManagementSSHDown into T374897: ManagementSSHDown - elastic1089.
Fri, Sep 20, 4:08 PM · SRE, DC-Ops, ops-eqiad
VRiley-WMF merged T375130: ManagementSSHDown into T374897: ManagementSSHDown - elastic1089.
Fri, Sep 20, 4:08 PM · SRE, DC-Ops, ops-eqiad
Dwisehaupt added a comment to T373893: Relocate servers in C8 to make room for new Network devices .

@Jhancock.wm pay-lb2001 looks good. Thanks.

Fri, Sep 20, 4:00 PM · SRE, DC-Ops, fundraising-tech-ops, ops-codfw
Jhancock.wm updated the task description for T373893: Relocate servers in C8 to make room for new Network devices .
Fri, Sep 20, 2:41 PM · SRE, DC-Ops, fundraising-tech-ops, ops-codfw
Jhancock.wm added a comment to T373893: Relocate servers in C8 to make room for new Network devices .

@Dwisehaupt moved and powering up. Let me know if anything looks amiss.

Fri, Sep 20, 2:39 PM · SRE, DC-Ops, fundraising-tech-ops, ops-codfw
Dwisehaupt added a comment to T373893: Relocate servers in C8 to make room for new Network devices .

@Jhancock.wm It's all ready for you.

Fri, Sep 20, 2:30 PM · SRE, DC-Ops, fundraising-tech-ops, ops-codfw