[go: up one dir, main page]

Page MenuHomePhabricator

Sustainability (Incident Followup)Milestone
ActivePublic

Watchers (4)

Details

Description

Action items that came out of the investigation and documentation for past Wikimedia production incidents. These action items reduce risk, shorten/reduce impact, or help prevent incidents in the future.

See also:

Recent Activity

Wed, Nov 13

BTullis added a comment to T379718: Research allowing read-only access to the superset api from requestctl's web UI.

I think that we can do this, but maybe we should try to avoid changing the production Superset instance, if we can avoid it.

Wed, Nov 13, 6:12 PM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool
Joe created T379718: Research allowing read-only access to the superset api from requestctl's web UI.
Wed, Nov 13, 10:27 AM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool

Mon, Nov 11

Joe added a comment to T310009: Make it easier to create a new requestctl object.

As an alternative, which I might actually prefer, all the process would remain server-side if we can grant access to the superset api via an api token or something similar to the web UI. I'll investigate if that's possible.

Mon, Nov 11, 5:32 PM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool
Joe added a comment to T310009: Make it easier to create a new requestctl object.

The way we could do this is something as follows:

  • Add the CORS headers to superset to allow making authenticated requests from requestctl.wikimedia.org
    • add Access-Control-Allow-Origin: requestctl.wikimedia.org
    • add Access-Control-Allow-Headers: Authorization, Cookie, User-Agent (this might get refined)
    • add Access-Control-Allow-Methods: GET
  • Add a page to hiddenparma that accepts a superset url as input. We use this input to fetch the url via cross-subdomain ajax, and execute the same logic used in the code for the current requestctl generator. The result is submitted via POST
  • Use the POSTed data to find if there are any matching patterns/ipblocks to the filters created. If that's the case, select them; otherwise offer to create them. Once the user has checked all components of the expression are either available or newly created, an expression creation form is opened, with a pre-filled expression.
Mon, Nov 11, 5:31 PM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool
Volans added a comment to T310009: Make it easier to create a new requestctl object.

With the new requestctl web UI I think it would be very useful if the current requestctl generator ( https://superset.wikimedia.org/requestctl-generator?q= ) would be adapted to work with the new web UI or (even better) ditched and get the same functionality embedded into requestctl web UI directly just providing the URL of a superset dashboard with filters.

Mon, Nov 11, 1:36 PM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool

Wed, Nov 6

gerritbot added a comment to T317799: Rate limiting for hotlinked images.

Change #1087615 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] haproxy: bwlim-by-path: also roll out to eqiad

https://gerrit.wikimedia.org/r/1087615

Wed, Nov 6, 3:03 AM · Patch-For-Review, SRE-Sprint-Week-Sustainability-March2023, Traffic, Sustainability (Incident Followup)

Tue, Nov 5

lmata moved T356788: thanos-query probedown due to OOM of both eqiad titan frontends from Inbox to Done on the SRE Observability (FY2024/2025-Q1) board.
Tue, Nov 5, 5:03 PM · SRE Observability (FY2024/2025-Q1), Sustainability (Incident Followup), SRE, observability

Fri, Nov 1

taavi moved T273959: cloud: monitor/alert on health of TLS certs used on shared front proxy setup from Unsorted to Web proxy on the Cloud-VPS board.
Fri, Nov 1, 7:01 PM · Toolforge, Cloud-VPS, cloud-services-team, Sustainability (Incident Followup)
Daimona added a comment to T214552: Jenkins build for MediaWiki should fail when "PHP Warning" is emitted.

Here we go again in 2024, when you can submit a change that redeclares an existing class with PHP just being like "oh, you're declaring the same class twice, but surely this isn't serious enough for me to stop you, please go ahead" and CI not batting an eye if you merge it. I think the CI solution is fine, but even better would be catching all warnings from the autoloader and turning them into fatals. As I noted in T378774#10283126, doing it unconditionally may be bad for performance ({{cn}}, though); but maybe we could do it for tests only.

Fri, Nov 1, 12:56 AM · Wikimedia-production-error, Developer Productivity, Sustainability (Incident Followup), Platform Engineering, Quibble, MediaWiki-Core-Tests, Continuous-Integration-Config
Daimona merged T378774: Early PHP warnings do not make CI jobs fail into T214552: Jenkins build for MediaWiki should fail when "PHP Warning" is emitted.
Fri, Nov 1, 12:52 AM · Wikimedia-production-error, Developer Productivity, Sustainability (Incident Followup), Platform Engineering, Quibble, MediaWiki-Core-Tests, Continuous-Integration-Config

Wed, Oct 30

thcipriani edited projects for T255197: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency, added: Release-Engineering-Team (Seen); removed Release-Engineering-Team (Onboarding 🚀).
Wed, Oct 30, 3:52 PM · Release-Engineering-Team (Seen), User-brennen, Sustainability (Incident Followup), Scap

Fri, Oct 25

hnowlan edited projects for T378038: create a place (whiteboard) where SRE advertises current site status / things for awareness, added: SRE-OnFire; removed SRE.
Fri, Oct 25, 3:55 PM · SRE-OnFire, Sustainability (Incident Followup)

Thu, Oct 24

Reedy moved T303433: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits from Backlog to Feature Requests/Improvements on the ConfirmEdit (CAPTCHA extension) board.
Thu, Oct 24, 1:23 AM · MediaWiki-Platform-Team (Radar), MW-1.39-notes (1.39.0-wmf.25; 2022-08-15), Stewards-and-global-tools, MediaWiki-extensions-CentralAuth, SecTeam-Processed, Sustainability (Incident Followup), ConfirmEdit (CAPTCHA extension), Platform Engineering, Wikimedia-Site-requests, Security
Dzahn added a parent task for T333143: Move Gerrit data out of root partition: T372804: setup gerrit2003 with gerrit service (gerrit on bookworm).
Thu, Oct 24, 12:42 AM · Sustainability (Incident Followup), Release-Engineering-Team, collaboration-services, Gerrit

Wed, Oct 23

Dzahn added a comment to T378038: create a place (whiteboard) where SRE advertises current site status / things for awareness.

Any SRE can feel free to edit the ticket description if I missed something or to clarify. This was just a follow-up trying to remember from my personal meeting notes.

Wed, Oct 23, 10:58 PM · SRE-OnFire, Sustainability (Incident Followup)
Dzahn updated the task description for T378038: create a place (whiteboard) where SRE advertises current site status / things for awareness.
Wed, Oct 23, 10:56 PM · SRE-OnFire, Sustainability (Incident Followup)
Dzahn created T378038: create a place (whiteboard) where SRE advertises current site status / things for awareness.
Wed, Oct 23, 10:55 PM · SRE-OnFire, Sustainability (Incident Followup)
jijiki moved T376795: mwscript-k8s creates too many resources from Incoming 🐫 to Doing 😎 on the serviceops board.
Wed, Oct 23, 10:56 AM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s

Tue, Oct 22

gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1078989 abandoned by RLazarus:

[operations/puppet@production] do not submit: T376795 cleanups

Reason:

Incident is over, no longer needed

https://gerrit.wikimedia.org/r/1078989

Tue, Oct 22, 11:16 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s

Mon, Oct 21

CDanis closed T344171: Reverse DNS for k8s pods IPs, a subtask of T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes, as Resolved.
Mon, Oct 21, 1:43 PM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
Joe edited parent tasks for T310009: Make it easier to create a new requestctl object, added: T377699: [EPIC] FY 24/25 WE 4.3.7 Roll out a user-friendly web application that enables assisted editing and creation of requestctl rules; removed: T369480: [EPIC] FY 24/25 WE 4.3.4 Improve our existing tooling to allow quicker reaction times to ongoing attacks..
Mon, Oct 21, 9:31 AM · Traffic, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), conftool

Fri, Oct 18

akosiaris added a project to T376795: mwscript-k8s creates too many resources: SRE-OnFire.
Fri, Oct 18, 2:39 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s

Oct 15 2024

gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1079314 merged by RLazarus:

[operations/puppet@production] deployment_server: Read Helm secrets in `mwscript-cleanup`

https://gerrit.wikimedia.org/r/1079314

Oct 15 2024, 8:37 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
jhathaway closed T310836: Upgrade Exim to 4.96 as Invalid.

We have migrated postfix

Oct 15 2024, 4:07 PM · SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), Infrastructure-Foundations, Mail
JMeybohm closed T376976: Remove memory limits from critical cluster components (calico) as Resolved.

Alert rules have been deployed last week. The limit removal has been rolled out today.

Oct 15 2024, 2:44 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
JMeybohm closed T376976: Remove memory limits from critical cluster components (calico), a subtask of T376795: mwscript-k8s creates too many resources, as Resolved.
Oct 15 2024, 2:43 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s

Oct 14 2024

joanna_borun assigned T376005: Juniper: regularly run `request system configuration rescue save` to ayounsi.
Oct 14 2024, 2:51 PM · SRE-OnFire, Sustainability (Incident Followup), Infrastructure-Foundations, netops
hnowlan placed T320398: Expand upon Kask/Sessionstore documentation up for grabs.
Oct 14 2024, 1:50 PM · SRE-Sprint-Week-Sustainability-March2023, serviceops, Sustainability (Incident Followup)
gerritbot added a comment to T376976: Remove memory limits from critical cluster components (calico).

Change #1079459 merged by JMeybohm:

[operations/deployment-charts@master] Remove memory limits from calico components in wikikube clusters

https://gerrit.wikimedia.org/r/1079459

Oct 14 2024, 9:49 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops

Oct 11 2024

JMeybohm added a comment to T376795: mwscript-k8s creates too many resources.

Adding 6k configmaps does not really do anything to calico, cert-manager, helm-state metrics. It might have an impact on the kubelet when the configmaps are actually used, though. I've not tested that but given the fact that this would spread across all kubelets I don't think it's an issue.

Oct 11 2024, 1:48 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
gerritbot added a comment to T376976: Remove memory limits from critical cluster components (calico).

Change #1079457 merged by jenkins-bot:

[operations/alerts@master] Add CalicoHighMemoryUsage alert

https://gerrit.wikimedia.org/r/1079457

Oct 11 2024, 11:30 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
JMeybohm added a comment to T376795: mwscript-k8s creates too many resources.

I've confirmed in the staging cluster that creating 1k network-policies (even if they don't apply to anything) bumps the avg calico-node memory usage from ~130 to ~390MB. CPU usage also increases quite significantly.

Oct 11 2024, 10:29 AM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
gerritbot added a comment to T376976: Remove memory limits from critical cluster components (calico).

Change #1079459 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Remove memory limits from calico components in wikikube clusters

https://gerrit.wikimedia.org/r/1079459

Oct 11 2024, 10:12 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
gerritbot added a project to T376976: Remove memory limits from critical cluster components (calico): Patch-For-Review.
Oct 11 2024, 10:06 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
gerritbot added a comment to T376976: Remove memory limits from critical cluster components (calico).

Change #1079457 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Add CalicoHighMemoryUsage alert

https://gerrit.wikimedia.org/r/1079457

Oct 11 2024, 10:05 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
Joe added a comment to T376976: Remove memory limits from critical cluster components (calico).

We 've already discussed this in a 1on1 and just for transparency's sake, this finds me in agreement.

Oct 11 2024, 8:11 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
JMeybohm claimed T376976: Remove memory limits from critical cluster components (calico).
Oct 11 2024, 8:05 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1079445 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-script: deduplicate resources

https://gerrit.wikimedia.org/r/1079445

Oct 11 2024, 7:53 AM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1079444 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mediawiki: deduplicate network policies and configmaps

https://gerrit.wikimedia.org/r/1079444

Oct 11 2024, 7:53 AM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
akosiaris added a comment to T376976: Remove memory limits from critical cluster components (calico).

We 've already discussed this in a 1on1 and just for transparency's sake, this finds me in agreement. The outage we had this week carried a signal that we don't want to lose, that is that memory usage exploded over the course of a few hours, which in itself is a signal that something else was amiss (which is what T376795: mwscript-k8s creates too many resources is about). At the same time, an outage is the worst possible messenger. So finding some other way to keep the signal, like the alert pointed out above SGTM.

Oct 11 2024, 7:15 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
JMeybohm edited projects for T376976: Remove memory limits from critical cluster components (calico), added: Kubernetes, Prod-Kubernetes; removed MW-on-K8s.
Oct 11 2024, 7:01 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops
JMeybohm created T376976: Remove memory limits from critical cluster components (calico).
Oct 11 2024, 7:01 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Sustainability (Incident Followup), serviceops

Oct 10 2024

gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1079314 restored by RLazarus:

[operations/puppet@production] deployment_server: Tweak mwscript-cleanup `helm list` pagination

https://gerrit.wikimedia.org/r/1079314

Oct 10 2024, 8:38 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1079314 abandoned by RLazarus:

[operations/puppet@production] deployment_server: Tweak mwscript-cleanup `helm list` pagination

Reason:

Scott correctly points out I've got this the wrong way around -- the more likely issue is I'm iterating over the list while destroying releases, so the offset is invalid. 🤦 Another, better fix to follow.

https://gerrit.wikimedia.org/r/1079314

Oct 10 2024, 8:27 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
JMeybohm added a comment to T376795: mwscript-k8s creates too many resources.

Turns out the object counts are already in Prometheus. Here's a quick plot on a dashboard: https://grafana.wikimedia.org/goto/u3dyc3kHg?orgId=1

Oct 10 2024, 5:10 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
CDanis added a comment to T376795: mwscript-k8s creates too many resources.

Turns out the object counts are already in Prometheus. Here's a quick plot on a dashboard: https://grafana.wikimedia.org/goto/u3dyc3kHg?orgId=1

Oct 10 2024, 4:34 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1079314 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Tweak mwscript-cleanup `helm list` pagination

https://gerrit.wikimedia.org/r/1079314

Oct 10 2024, 4:06 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s

Oct 9 2024

RLazarus added a comment to T376776: mw-scripts SAL integration.

A mwscript-k8s flag to log to SAL is on my to-do list -- I hadn't gotten around to filing a task, thanks.

Oct 9 2024, 8:48 PM · Sustainability (Incident Followup), MW-on-K8s, serviceops
gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1078988 merged by RLazarus:

[operations/puppet@production] deployment_server: Add `helm list` pagination to mwscript-cleanup

https://gerrit.wikimedia.org/r/1078988

Oct 9 2024, 5:20 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
gerritbot added a comment to T376795: mwscript-k8s creates too many resources.

Change #1078989 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] do not submit: T376795 cleanups

https://gerrit.wikimedia.org/r/1078989

Oct 9 2024, 5:14 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s