[go: up one dir, main page]

Page MenuHomePhabricator

herron (Keith Herron)
Site Reliability Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (389 w, 3 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Yesterday

herron updated the task description for T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Fri, Nov 15, 6:49 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron closed T378989: eqiad: (2x) aux-k8s-worker nodes, a subtask of T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams, as Resolved.
Fri, Nov 15, 6:49 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron closed T378989: eqiad: (2x) aux-k8s-worker nodes as Resolved.

I think we're good here!

Fri, Nov 15, 6:49 PM · Kubernetes, vm-requests, SRE

Wed, Nov 13

herron updated the task description for T379678: Requesting access to deployment for dbrant.
Wed, Nov 13, 6:07 PM · SRE, SRE-Access-Requests
herron moved T379678: Requesting access to deployment for dbrant from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Hello! A few next-steps to move forward with this request:

Wed, Nov 13, 5:42 PM · SRE, SRE-Access-Requests
herron triaged T379678: Requesting access to deployment for dbrant as Medium priority.
Wed, Nov 13, 5:38 PM · SRE, SRE-Access-Requests

Tue, Nov 12

herron closed T379630: Grant Access to ldap/wmf for HArroyo-WMF as Resolved.

membership to ldap group wmf has been provisioned, thanks!

Tue, Nov 12, 3:47 PM · SRE, LDAP-Access-Requests
herron added a member for WMF-NDA: hector.arroyo.
Tue, Nov 12, 3:39 PM
herron closed T379409: Grant Access to ldap/wmf for khantstop as Resolved.

uid=khantstop has been added to ldap group wmf

Tue, Nov 12, 2:56 PM · SRE, LDAP-Access-Requests
herron added a member for WMF-NDA: Khantstop.
Tue, Nov 12, 2:47 PM

Fri, Nov 8

lmata awarded T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514) a Like token.
Fri, Nov 8, 8:17 PM · SRE Observability, sre-alert-triage
herron added a comment to T378989: eqiad: (2x) aux-k8s-worker nodes.
ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:C9d5dCz/suTrJQqv6NtW3q/R241NXTC1GL3JMLfaMY4.
Please contact your system administrator.
Add correct host key in /dev/null to get rid of this message.
Offending DSA key in /var/lib/ganeti/known_hosts:2
  remove with:
  ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti01.svc.eqiad.wmnet"
RSA host key for ganeti01.svc.eqiad.wmnet has changed and you have requested strict checking.
Host key verification failed.
Failure: command execution error:
Connection to console of instance aux-k8s-worker1004.eqiad.wmnet failed, please check cluster configuration

Looks like this VM got assigned to new ganeti host ganeti1045.eqiad.wmnet which might still be a work in progress T378921?

Fri, Nov 8, 3:20 PM · Kubernetes, vm-requests, SRE

Thu, Nov 7

herron added a comment to T378989: eqiad: (2x) aux-k8s-worker nodes.
ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:C9d5dCz/suTrJQqv6NtW3q/R241NXTC1GL3JMLfaMY4.
Please contact your system administrator.
Add correct host key in /dev/null to get rid of this message.
Offending DSA key in /var/lib/ganeti/known_hosts:2
  remove with:
  ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti01.svc.eqiad.wmnet"
RSA host key for ganeti01.svc.eqiad.wmnet has changed and you have requested strict checking.
Host key verification failed.
Failure: command execution error:
Connection to console of instance aux-k8s-worker1004.eqiad.wmnet failed, please check cluster configuration
Thu, Nov 7, 7:22 PM · Kubernetes, vm-requests, SRE

Mon, Nov 4

herron updated the task description for T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Mon, Nov 4, 3:44 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
herron created T378989: eqiad: (2x) aux-k8s-worker nodes.
Mon, Nov 4, 3:40 PM · Kubernetes, vm-requests, SRE
herron created T378988: codfw: (3x) aux-k8s-etcd nodes.
Mon, Nov 4, 3:40 PM · vm-requests, SRE, Kubernetes
herron created T378987: codfw: (4x) aux-k8s-worker nodes.
Mon, Nov 4, 3:40 PM · vm-requests, SRE, Kubernetes
herron created T378986: codfw: (2x) aux-k8s-ctrl nodes.
Mon, Nov 4, 3:40 PM · vm-requests, SRE, Kubernetes

Fri, Nov 1

colewhite awarded T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514) a Party Time token.
Fri, Nov 1, 9:21 PM · SRE Observability, sre-alert-triage

Thu, Oct 31

herron closed T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514) as Resolved.

(Fixed in T377703)

Thu, Oct 31, 6:23 PM · Data-Platform-SRE, sre-alert-triage
herron closed T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514) as Resolved.
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Beginning probe" probe=tcp timeout_seconds=3
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolving target address" target=10.64.16.86 ip_protocol=ip4
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolved target address" target=10.64.16.86 ip=10.64.16.86
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.827Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Dialing TCP with TLS"
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Successfully dialed"
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Probe succeeded" duration_seconds=0.050861874
Thu, Oct 31, 6:22 PM · SRE Observability, sre-alert-triage
herron closed T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514), a subtask of T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514), as Resolved.
Thu, Oct 31, 6:22 PM · SRE Observability, sre-alert-triage
herron added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Change #1084199 merged by Herron:

[operations/puppet@production] profile::syslog::centralserver: use prometheus cert for blackbox check

https://gerrit.wikimedia.org/r/1084199

Thu, Oct 31, 4:55 PM · SRE Observability, sre-alert-triage

Wed, Oct 30

herron added a project to T376790: Split the permission to access Logstash from the cn=wmf and cn=nda groups: SRE Observability.
Wed, Oct 30, 1:09 PM · SRE Observability, Infrastructure-Foundations, SRE

Tue, Oct 29

herron added a comment to T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).

Looked into this a bit since the silence expired, the service being probed is up but looks like the related prometheus blackbox exporter exporter is failing to load the configured certificate

Tue, Oct 29, 5:38 PM · SRE Observability, sre-alert-triage
herron added a subtask for T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514): T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514).
Tue, Oct 29, 5:35 PM · SRE Observability, sre-alert-triage
herron added a parent task for T377703: Alert in need of triage: ProbeDown (instance centrallog2002:6514): T359293: Alert in need of triage: ProbeDown (instance centrallog1002:6514).
Tue, Oct 29, 5:35 PM · Data-Platform-SRE, sre-alert-triage

Fri, Oct 25

herron updated subscribers of T378190: Thanos: set up trace sampling.

Looked into this a bit and from what I can tell the feature for OTEL sampling within Thanos was introduced in a later version (v0.32.0)

Fri, Oct 25, 3:11 PM · Patch-For-Review, Observability-Tracing
herron created T378190: Thanos: set up trace sampling.
Fri, Oct 25, 3:02 PM · Patch-For-Review, Observability-Tracing

Mon, Oct 21

herron closed T376904: Upgrade to Jaeger v1.62.0 as Resolved.

https://trace.wikimedia.org is now running 1.62.0.

Mon, Oct 21, 4:50 PM · Patch-For-Review, Observability-Tracing
herron closed T376904: Upgrade to Jaeger v1.62.0, a subtask of T375123: Archiving a trace more than once will duplicate all spans, as Resolved.
Mon, Oct 21, 4:50 PM · Observability-Tracing
herron created T377756: jaeger-query: setup method for testing new/test versions.
Mon, Oct 21, 4:48 PM · Observability-Tracing
herron updated the task description for T376904: Upgrade to Jaeger v1.62.0.
Mon, Oct 21, 4:46 PM · Patch-For-Review, Observability-Tracing

Fri, Oct 18

herron closed T377502: loki: "failed to initialize table ... too many open files" as Resolved.

Open file limit raised to 16384, we should be good for now

Fri, Oct 18, 2:44 PM · Observability-Metrics, Observability-Logging

Thu, Oct 17

herron added a comment to T377502: loki: "failed to initialize table ... too many open files".

as a stopgap I manually increased max files for the grafana-loki unit and the service recovered, will propose something to persist that as well

Thu, Oct 17, 7:48 PM · Observability-Metrics, Observability-Logging
herron triaged T377502: loki: "failed to initialize table ... too many open files" as High priority.
Thu, Oct 17, 7:47 PM · Observability-Metrics, Observability-Logging

Oct 16 2024

herron triaged T376904: Upgrade to Jaeger v1.62.0 as Medium priority.
Oct 16 2024, 2:44 PM · Patch-For-Review, Observability-Tracing

Oct 15 2024

herron closed T376740: Pyrra: ErrorBudgetBurn: <no value> as Resolved.

optimistically resolving now that the template has been updated

Oct 15 2024, 4:18 PM · User-herron, Observability-Metrics
herron closed T376740: Pyrra: ErrorBudgetBurn: <no value>, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Resolved.
Oct 15 2024, 4:15 PM · Patch-For-Review, User-herron, Observability-Metrics

Oct 9 2024

herron created T376805: PrometheusRuleEvaluationFailures: Prometheus rule evaluation failures .
Oct 9 2024, 1:51 PM · Observability-Metrics

Oct 8 2024

herron created T376740: Pyrra: ErrorBudgetBurn: <no value>.
Oct 8 2024, 5:54 PM · User-herron, Observability-Metrics
herron closed T376638: Pyrra: logstash-availability move to bool_gauge as Resolved.

https://slo.wikimedia.org/?search=logstash-availability has been onboarded to pyrra as a bool_gauge. It'll take some time for the recording rule data to populate, but I think we're good here!

Oct 8 2024, 2:21 PM · User-herron, Observability-Metrics
herron closed T376638: Pyrra: logstash-availability move to bool_gauge, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Resolved.
Oct 8 2024, 2:19 PM · Patch-For-Review, User-herron, Observability-Metrics

Oct 7 2024

herron added a comment to T376638: Pyrra: logstash-availability move to bool_gauge.

At the moment I'm experimenting with something like this

Oct 7 2024, 5:26 PM · User-herron, Observability-Metrics
herron created T376638: Pyrra: logstash-availability move to bool_gauge.
Oct 7 2024, 5:24 PM · User-herron, Observability-Metrics
herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
Oct 7 2024, 5:15 PM · Patch-For-Review, User-herron, Observability-Metrics

Oct 4 2024

herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
Oct 4 2024, 2:08 PM · Patch-For-Review, User-herron, Observability-Metrics

Oct 1 2024

herron triaged T376179: Thanos: enable tracing as Medium priority.
Oct 1 2024, 4:28 PM · Observability-Tracing

Sep 25 2024

herron closed T375284: pyrra: liftwing-articlequery-latency codfw slo renders blank page as Resolved.

Closing as the workaround above solves this well enough

Sep 25 2024, 1:41 PM · Patch-For-Review, User-herron, Observability-Metrics
herron closed T375284: pyrra: liftwing-articlequery-latency codfw slo renders blank page, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Resolved.
Sep 25 2024, 1:38 PM · Patch-For-Review, User-herron, Observability-Metrics

Sep 20 2024

herron added a comment to T375284: pyrra: liftwing-articlequery-latency codfw slo renders blank page.

The full queries with empty responses are:

Sep 20 2024, 4:39 PM · Patch-For-Review, User-herron, Observability-Metrics
herron created T375284: pyrra: liftwing-articlequery-latency codfw slo renders blank page.
Sep 20 2024, 4:37 PM · Patch-For-Review, User-herron, Observability-Metrics

Sep 10 2024

herron claimed T373995: CPU thermal throttling: saturation panel isn't working as expected.

Yes I agree, in fact I don't see results for cpu_throttled in Thanos at all

Sep 10 2024, 2:39 PM · SRE Observability (FY2024/2025-Q2)

Sep 9 2024

herron closed T269333: Switch default Grafana datasource to Thanos as Resolved.
Sep 9 2024, 4:40 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics

Sep 3 2024

herron added a comment to T269333: Switch default Grafana datasource to Thanos.

The Grafana default datasource has been switched again to Thanos. If needed, the patch to revert is https://gerrit.wikimedia.org/r/1069230

Sep 3 2024, 3:16 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics
herron updated the task description for T269333: Switch default Grafana datasource to Thanos.
Sep 3 2024, 3:14 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics

Aug 30 2024

herron added a comment to T269333: Switch default Grafana datasource to Thanos.

Change #1057882 merged by Herron:

[operations/puppet@production] grafana: set thanos as default datasource

https://gerrit.wikimedia.org/r/1057882

FTR, this was just reverted due to the issue @fgiunchedi mentioned above.

Aug 30 2024, 3:13 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics
herron added a comment to T371520: graph for hits to Linked Data Endpoint (Special:EntityData) is broken.

I’ve just saved new versions of wikidata-edits (schema version 34→38), wikidata-query-service-ui (34→38) and wikidata-special-entitydata (16→38 – no, that’s not a typo! it was ancient). So hopefully those should now stay working if the default is changed again.

Aug 30 2024, 3:09 PM · Wikidata, Wikidata Analytics

Aug 1 2024

herron moved T371244: VictorOps paged batphone immediately rather than after 5m from Inbox to Radar on the SRE Observability board.

Comparing with an example after hours incident like incident 4926 VO logs about 2 dozen contact attempts to the timeline when the incident is escalated to the batphone.

Aug 1 2024, 2:28 PM · SRE Observability, SRE-OnFire, SRE
herron awarded T366710: Switch k8s logs to their own kafka topics a Love token.
Aug 1 2024, 2:06 PM · SRE Observability (FY2024/2025-Q1), Observability-Logging

Jul 29 2024

lmata awarded T269333: Switch default Grafana datasource to Thanos a Party Time token.
Jul 29 2024, 5:35 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics
herron closed T269333: Switch default Grafana datasource to Thanos as Resolved.

The grafana default datasource has been switched to Thanos. If any unexpected issues occur the patch to revert is https://gerrit.wikimedia.org/r/1057882. There's also a sqlite db backup on the grafana filesystem with todays date, just in case.

Jul 29 2024, 2:23 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics
herron updated the task description for T269333: Switch default Grafana datasource to Thanos.
Jul 29 2024, 2:22 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics

Jul 26 2024

herron updated the task description for T269333: Switch default Grafana datasource to Thanos.
Jul 26 2024, 4:51 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics
herron added a comment to T269333: Switch default Grafana datasource to Thanos.

I've updated all dashboards with null panel datasources as of today. At this point the only panels that remain with null datasource are dashboard rows which are expected as they are the row containers for other panels.

Jul 26 2024, 4:51 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics

Jul 25 2024

herron added a comment to T370772: Prometheus eqiad/codfw hw expansion architecture options.

Agree! +1 for option 2 here as well.

Jul 25 2024, 3:15 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics

Jul 24 2024

herron claimed T269333: Switch default Grafana datasource to Thanos.
Jul 24 2024, 1:40 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics

Jul 19 2024

herron added a project to T369854: Occasional SLOMetricAbsent alerts: Observability-Metrics.

Had a closer look into the threshold tunables and I don't actually see a way to change this natively within Pyrra. As-is the "for" duration of SLOMetricAbsent alerts is 6m. Pyrra has options to enable/disable the absent alert, but maybe we can configure something to put this alert in a silence/inhibit waiting room for 30-60m before it alerts (or recovers on its own)

Jul 19 2024, 5:40 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics

Jul 18 2024

herron renamed T369854: Occasional SLOMetricAbsent alerts from Occasional SLOMetricAbsent false positives to Occasional SLOMetricAbsent alerts.
Jul 18 2024, 2:37 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics
herron edited projects for T369854: Occasional SLOMetricAbsent alerts, added: SRE Observability; removed SRE Observability (FY2024/2025-Q1).

SLOMetricAbsent for dead_letters_hits, varnish_sli_bad, trafficserver_backend_sli_bad and haproxy_sli_bad occurred today, at two different times.

Jul 18 2024, 2:37 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics

Jul 15 2024

herron updated the task description for T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Jul 15 2024, 6:05 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations

Jul 11 2024

herron added a comment to T369854: Occasional SLOMetricAbsent alerts.

Reviewing thanos-rule logs I'm seeing related discards with err="out of order sample"

Jul 11 2024, 5:56 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics
herron triaged T369854: Occasional SLOMetricAbsent alerts as Medium priority.
Jul 11 2024, 5:54 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics
herron created P66323 (An Untitled Masterwork).
Jul 11 2024, 5:51 PM

Jul 8 2024

herron added a comment to T352756: Gap in metrics rendered from Thanos Rules.

It occurred to me that deploying absent metric alerts for the metrics where we're seeing gaps would be a reasonable next step. That'd let us troubleshoot gaps in/closer to their broken state which should help toward better understanding the issue and steps to resolve manually. Plus, it'd of course help us respond faster and shrink the gaps. I'll work on a patch.

Jul 8 2024, 2:40 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics, Machine-Learning-Team

Jul 3 2024

herron added a comment to T368168: Re-evaluate logging cluster watermark settings.

Proposal SGTM for the near-term. Thank you for organizing this!

Jul 3 2024, 2:59 PM · Observability-Logging
herron updated the task description for T368168: Re-evaluate logging cluster watermark settings.
Jul 3 2024, 2:36 PM · Observability-Logging

Jul 1 2024

herron triaged T368953: Thanos Cache Tuning as Medium priority.
Jul 1 2024, 5:15 PM · Observability-Metrics

Jun 25 2024

herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Buster is looking fine with this deb as well. So I've gone ahead and uploaded 1.8.0 to bookworm-wikimedia, bullseye-wikimedia, and buster-wikimedia. Up next is a small canary upgrade before fully rolling out.

Jun 25 2024, 6:53 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations
herron updated the task description for T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Jun 25 2024, 6:50 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations
herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Nice, thanks for the pointer! It looks like export CGO_ENABLED=0 does the right thing. At least, with this set the package builds and installs successfully on my bullseye test host.

Jun 25 2024, 2:47 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations

Jun 24 2024

herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye-wikimedia, we're doing this for other exporters as well. buster might be tricky due to it's old libc6, but we can also ignore it, there's less than 150 hosts left and they can simply live the old IPMI monitoring.

Jun 24 2024, 5:32 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations

Jun 21 2024

herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Quick update: prometheus-ipmi-exporter-1.8.0 was a straightforward backport for bookworm https://gitlab.wikimedia.org/repos/sre/prometheus-ipmi-exporter/-/jobs/291532

Jun 21 2024, 6:33 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations
herron updated the task description for T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Jun 21 2024, 6:33 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations
herron added a comment to T352756: Gap in metrics rendered from Thanos Rules.

On the 6th in the thanos-rule logs I see a ton of errors while connecting to Prometheus nodes, and around 14 UTC a reload was issued on both titan active nodes.

Jun 21 2024, 3:07 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics, Machine-Learning-Team

Jun 20 2024

herron created T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Jun 20 2024, 4:41 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, Packaging, Infrastructure-Foundations
herron added a comment to T253810: Alert on ECC warnings in SEL.

ipmi_exporter now has support to collect generic SEL entries and export metrics from those: https://github.com/prometheus-community/ipmi_exporter/pull/179

Jun 20 2024, 4:36 PM · SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), User-MoritzMuehlenhoff
herron added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

Sounds like a plan!

Jun 20 2024, 2:54 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
herron added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

I think we may want to enable it deployment per deployment so SRE Observability can monitor the load on prometheus. @colewhite or @herron can we coordinate on this?

Jun 20 2024, 2:49 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics

Jun 18 2024

herron awarded T367466: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences a Love token.
Jun 18 2024, 2:03 PM · SRE-tools, Infrastructure-Foundations, Spicerack, Observability-Alerting
herron added a comment to T367790: Detect hardware failures/automatically create tickets for DC Ops.

Looking longer-term I think it'd be generally worthwhile to support more than baseline blackbox checks on the mgmt interfaces and I'm personally open to exploring something like the redfish exporter. But I think this would be a medium to large sized project since AIUI it will involve a decent sized chunk of setup effort on the hardware itself, and offhand I'm also not sure the percentage of hw in the fleet that will support the approach today. I'm also assuming we would run into some hw vendor oddities/bugs along the way.

Jun 18 2024, 1:46 PM · DC-Ops, Data-Platform

Jun 17 2024

herron added a comment to T359879: SLO dashboards for Lift Wing showing unexpected values.

@herron let's double check, maybe we can drop the secondary rules and keep going with the "regular" ones?

Example of the fix: https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-revert-risk-la-slo-s?orgId=1&from=2024-03-01%2000:00:00&to=2024-05-31%2023:59:59

Jun 17 2024, 2:23 PM · Machine-Learning-Team, Observability-Metrics

Jun 14 2024

herron closed T367053: Grant Access to wmf for Gonyeahialam as Resolved.

done

Jun 14 2024, 2:06 PM · SRE, LDAP-Access-Requests
herron added a member for WMF-NDA: gonyeahialam.
Jun 14 2024, 2:04 PM

Jun 13 2024

herron closed T367053: Grant Access to wmf for Gonyeahialam as Resolved.

Group membership has been provisioned, thanks!

Jun 13 2024, 4:11 PM · SRE, LDAP-Access-Requests

Jun 12 2024

herron moved T367295: Requesting access to private data-based dashboards for Jsn.sherman from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Hi @odimitrijevic @Milimetric @WDoranWMF @Ahoelzl could one of you please approve this request for analytics-privatedata-users? Thanks in advance!

Jun 12 2024, 6:12 PM · Data-Engineering, SRE, SRE-Access-Requests
herron updated the task description for T367295: Requesting access to private data-based dashboards for Jsn.sherman.
Jun 12 2024, 6:07 PM · Data-Engineering, SRE, SRE-Access-Requests
herron closed T360895: Memory upgrade request for prometheus200[56] as Resolved.

This was completed yesterday (during stashbot outage, this task unfortunately missed the !log)

Jun 12 2024, 1:45 PM · DC-Ops, SRE, ops-codfw, Observability-Metrics

Jun 11 2024

herron closed T367173: Requesting access to Kubernetes deployment for ebysans as Resolved.

The patch to provision this access has been merged, and will be fully propagated within the next 30 minutes. I'll transition this to resolved now, please reopen if any followup is needed. Thanks!

Jun 11 2024, 5:48 PM · SRE, Data-Engineering, SRE-Access-Requests
herron closed T365832: Requesting access to analytics-privatedata-users for Rae Adimer as Resolved.

The patch to provision this access has been merged and will fully propagate within the next 30 minutes. I'll transition this to resolved now, please re-open if any followup is needed. Thanks!

Jun 11 2024, 4:22 PM · SRE, SRE-Access-Requests