Milestone for Data Platform SRE work
Details
Oct 18 2024
Oct 15 2024
Change #1080265 merged by Btullis:
[operations/puppet@production] Revert "Lower the number of slots that the enwiki dump uses"
Change #1080265 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Revert "Lower the number of slots that the enwiki dump uses"
Sep 27 2024
Sep 17 2024
Sep 16 2024
Sep 11 2024
Sep 10 2024
Sep 9 2024
We have replicated the original Airflow dashboard, but for instances running in Kubernetes.
Change #1071213 merged by Brouberol:
[operations/deployment-charts@master] airflow: broaden collected metrics and tag them correctly
Change #1071619 merged by Brouberol:
[operations/deployment-charts@master] airflow: add missing configuration allowing it to read connnections from disk
Change #1071619 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: add missing configuration allowing it to read connnections from disk
Sep 6 2024
Change #1071213 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: broaden collected metrics and tag them correctly
Change #1071153 merged by jenkins-bot:
[operations/deployment-charts@master] airflow: enable visualizing logs of DAG runs in the webserver UI
Change #1071153 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: enable visualizing logs of DAG runs in the webserver UI
The anycast VIPs have now been added to the five realservers by bird.
btullis@cumin1002:~$ sudo cumin --no-progress A:cephosd 'ip a sh | egrep "(10.3.0.8|2a02:ec80:ff00:101::8)"' 5 hosts will be targeted: cephosd[1001-1005].eqiad.wmnet OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5 ===== NODE GROUP ===== (5) cephosd[1001-1005].eqiad.wmnet ----- OUTPUT of 'ip a sh | egrep ...80:ff00:101::8)"' ----- inet 10.3.0.8/32 scope global lo:anycast inet6 2a02:ec80:ff00:101::8/128 scope global ================ 100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'ip a sh | egrep ...80:ff00:101::8)"'. 100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. btullis@cumin1002:~$
Puppet runs cleanly and bird is active, so I think that we are ready to proceed to the next step. Namely.
Once it's looking good on the Bird/server side we can set the "bgp" flag for those hosts to 'true' in Netbox, and run Homer against the switches which should cause the routes to be announced.
Change #1071138 merged by Brouberol:
[operations/deployment-charts@master] airflow: configure metrics collection
Change #1071138 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: configure metrics collection
Change #1070950 merged by Btullis:
[operations/puppet@production] Add the anycast VIP for radosgw to DPE Ceph servers
Change #1070949 merged by Btullis:
[operations/puppet@production] Enable IPv6 for the envoyproxy on DPE Ceph servers
Change #1071077 merged by Brouberol:
[operations/deployment-charts@master] airflow: fix badly formatted Deployment separation
Change #1071077 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: fix badly formatted Deployment separation
Change #1070619 merged by Brouberol:
[operations/deployment-charts@master] airflow: deploy the scheduler via a separate Deployment
Sep 5 2024
Mentioned in SAL (#wikimedia-operations) [2024-09-05T21:57:55Z] <inflatador> bking@grafana1002 apply grizzly SLO dashboard updates slo-Search added slo-apigw updated P68729 T328330
Change #1060150 merged by Bking:
[operations/grafana-grizzly@master] search: add search update lag SLO
Just an update as I've been doing quite a bit of research: