User Details
- User Since
- May 30 2017, 5:25 PM (389 w, 3 d)
- Availability
- Available
- IRC Nick
- herron
- LDAP User
- Herron
- MediaWiki User
- Unknown
Yesterday
I think we're good here!
Wed, Nov 13
Hello! A few next-steps to move forward with this request:
Tue, Nov 12
membership to ldap group wmf has been provisioned, thanks!
uid=khantstop has been added to ldap group wmf
Fri, Nov 8
Thu, Nov 7
ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the RSA key sent by the remote host is SHA256:C9d5dCz/suTrJQqv6NtW3q/R241NXTC1GL3JMLfaMY4. Please contact your system administrator. Add correct host key in /dev/null to get rid of this message. Offending DSA key in /var/lib/ganeti/known_hosts:2 remove with: ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti01.svc.eqiad.wmnet" RSA host key for ganeti01.svc.eqiad.wmnet has changed and you have requested strict checking. Host key verification failed. Failure: command execution error: Connection to console of instance aux-k8s-worker1004.eqiad.wmnet failed, please check cluster configuration
Mon, Nov 4
Fri, Nov 1
Thu, Oct 31
(Fixed in T377703)
Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Beginning probe" probe=tcp timeout_seconds=3 Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolving target address" target=10.64.16.86 ip_protocol=ip4 Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.825Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Resolved target address" target=10.64.16.86 ip=10.64.16.86 Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.827Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Dialing TCP with TLS" Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Successfully dialed" Oct 31 18:19:40 prometheus1005 prometheus-blackbox-exporter[2309163]: ts=2024-10-31T18:19:40.876Z caller=handler.go:184 module=tcp_rsyslog_receiver_ip4 target=[10.64.16.86]:6514 level=debug msg="Probe succeeded" duration_seconds=0.050861874
Wed, Oct 30
Tue, Oct 29
Looked into this a bit since the silence expired, the service being probed is up but looks like the related prometheus blackbox exporter exporter is failing to load the configured certificate
Fri, Oct 25
Looked into this a bit and from what I can tell the feature for OTEL sampling within Thanos was introduced in a later version (v0.32.0)
Mon, Oct 21
https://trace.wikimedia.org is now running 1.62.0.
Fri, Oct 18
Open file limit raised to 16384, we should be good for now
Thu, Oct 17
as a stopgap I manually increased max files for the grafana-loki unit and the service recovered, will propose something to persist that as well
Oct 16 2024
Oct 15 2024
optimistically resolving now that the template has been updated
Oct 9 2024
Oct 8 2024
https://slo.wikimedia.org/?search=logstash-availability has been onboarded to pyrra as a bool_gauge. It'll take some time for the recording rule data to populate, but I think we're good here!
Oct 7 2024
At the moment I'm experimenting with something like this
Oct 4 2024
Oct 1 2024
Sep 25 2024
Closing as the workaround above solves this well enough
Sep 20 2024
The full queries with empty responses are:
Sep 10 2024
Yes I agree, in fact I don't see results for cpu_throttled in Thanos at all
Sep 9 2024
Sep 3 2024
The Grafana default datasource has been switched again to Thanos. If needed, the patch to revert is https://gerrit.wikimedia.org/r/1069230
Aug 30 2024
Aug 1 2024
Comparing with an example after hours incident like incident 4926 VO logs about 2 dozen contact attempts to the timeline when the incident is escalated to the batphone.
Jul 29 2024
The grafana default datasource has been switched to Thanos. If any unexpected issues occur the patch to revert is https://gerrit.wikimedia.org/r/1057882. There's also a sqlite db backup on the grafana filesystem with todays date, just in case.
Jul 26 2024
I've updated all dashboards with null panel datasources as of today. At this point the only panels that remain with null datasource are dashboard rows which are expected as they are the row containers for other panels.
Jul 25 2024
Agree! +1 for option 2 here as well.
Jul 24 2024
Jul 19 2024
Had a closer look into the threshold tunables and I don't actually see a way to change this natively within Pyrra. As-is the "for" duration of SLOMetricAbsent alerts is 6m. Pyrra has options to enable/disable the absent alert, but maybe we can configure something to put this alert in a silence/inhibit waiting room for 30-60m before it alerts (or recovers on its own)
Jul 18 2024
SLOMetricAbsent for dead_letters_hits, varnish_sli_bad, trafficserver_backend_sli_bad and haproxy_sli_bad occurred today, at two different times.
Jul 15 2024
Jul 11 2024
Reviewing thanos-rule logs I'm seeing related discards with err="out of order sample"
Jul 8 2024
It occurred to me that deploying absent metric alerts for the metrics where we're seeing gaps would be a reasonable next step. That'd let us troubleshoot gaps in/closer to their broken state which should help toward better understanding the issue and steps to resolve manually. Plus, it'd of course help us respond faster and shrink the gaps. I'll work on a patch.
Jul 3 2024
Proposal SGTM for the near-term. Thank you for organizing this!
Jul 1 2024
Jun 25 2024
Buster is looking fine with this deb as well. So I've gone ahead and uploaded 1.8.0 to bookworm-wikimedia, bullseye-wikimedia, and buster-wikimedia. Up next is a small canary upgrade before fully rolling out.
Nice, thanks for the pointer! It looks like export CGO_ENABLED=0 does the right thing. At least, with this set the package builds and installs successfully on my bullseye test host.
Jun 24 2024
Jun 21 2024
Quick update: prometheus-ipmi-exporter-1.8.0 was a straightforward backport for bookworm https://gitlab.wikimedia.org/repos/sre/prometheus-ipmi-exporter/-/jobs/291532
Jun 20 2024
Sounds like a plan!
Jun 18 2024
Looking longer-term I think it'd be generally worthwhile to support more than baseline blackbox checks on the mgmt interfaces and I'm personally open to exploring something like the redfish exporter. But I think this would be a medium to large sized project since AIUI it will involve a decent sized chunk of setup effort on the hardware itself, and offhand I'm also not sure the percentage of hw in the fleet that will support the approach today. I'm also assuming we would run into some hw vendor oddities/bugs along the way.
Jun 17 2024
Jun 14 2024
done
Jun 13 2024
Group membership has been provisioned, thanks!
Jun 12 2024
Hi @odimitrijevic @Milimetric @WDoranWMF @Ahoelzl could one of you please approve this request for analytics-privatedata-users? Thanks in advance!
This was completed yesterday (during stashbot outage, this task unfortunately missed the !log)
Jun 11 2024
The patch to provision this access has been merged, and will be fully propagated within the next 30 minutes. I'll transition this to resolved now, please reopen if any followup is needed. Thanks!
The patch to provision this access has been merged and will fully propagate within the next 30 minutes. I'll transition this to resolved now, please re-open if any followup is needed. Thanks!