[go: up one dir, main page]

Page MenuHomePhabricator

Data-Platform-SRE (2024.08.17 - 2024.09.06)Milestone
ArchivedPublic

Members (6)

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

Milestone for Data Platform SRE work

Recent Activity

Oct 18 2024

Maintenance_bot removed a project from T363001: Create a helm chart for airflow that is appropriate to our needs: Patch-For-Review.
Oct 18 2024, 4:30 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
brouberol added a subtask for T363001: Create a helm chart for airflow that is appropriate to our needs: T368738: Add pgbouncer to our airflow helm chart.
Oct 18 2024, 4:05 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
brouberol removed a subtask for T368737: Deploy airflow scheduler via helm chart: T368738: Add pgbouncer to our airflow helm chart.
Oct 18 2024, 4:05 PM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)

Oct 15 2024

Maintenance_bot removed a project from T373904: Lower the available slots for the dump of enwiki to lower presure on databases: Patch-For-Review.
Oct 15 2024, 1:31 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06), Dumps-Generation
gerritbot added a comment to T373904: Lower the available slots for the dump of enwiki to lower presure on databases.

Change #1080265 merged by Btullis:

[operations/puppet@production] Revert "Lower the number of slots that the enwiki dump uses"

https://gerrit.wikimedia.org/r/1080265

Oct 15 2024, 1:29 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06), Dumps-Generation
gerritbot added a project to T373904: Lower the available slots for the dump of enwiki to lower presure on databases: Patch-For-Review.
Oct 15 2024, 11:18 AM · Data-Platform-SRE (2024.08.17 - 2024.09.06), Dumps-Generation
gerritbot added a comment to T373904: Lower the available slots for the dump of enwiki to lower presure on databases.

Change #1080265 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Revert "Lower the number of slots that the enwiki dump uses"

https://gerrit.wikimedia.org/r/1080265

Oct 15 2024, 11:18 AM · Data-Platform-SRE (2024.08.17 - 2024.09.06), Dumps-Generation

Sep 27 2024

BTullis edited parent tasks for T368737: Deploy airflow scheduler via helm chart, added: T364389: Migrate the airflow scheduler components to Kubernetes; removed: T364387: Adapt Airflow auth and DAG deployment method.
Sep 27 2024, 3:39 PM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)

Sep 17 2024

RKemper closed T373145: Create new service catalog entries for wdqs-main and wdqs-scholarly, a subtask of T364368: Create separate pybal pools for wdqs graph split (main vs scholarly), as Resolved.
Sep 17 2024, 5:45 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06), Patch-For-Review, Discovery-Search, Wikidata-Query-Service, Wikidata

Sep 16 2024

brouberol closed T372281: Configure scheduled backups and WAL archiving to use our S3 endpoint, a subtask of T330152: Deploy ceph radosgw processes to data-engineering cluster, as Resolved.
Sep 16 2024, 12:13 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)

Sep 11 2024

brouberol closed T372787: Implement S3 based logging for Airflow tasks on dse-k8s, a subtask of T369634: Decide how to do DAG logging on dse-k8s, as Resolved.
Sep 11 2024, 12:06 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)

Sep 10 2024

Maintenance_bot removed a project from T330152: Deploy ceph radosgw processes to data-engineering cluster: Patch-For-Review.
Sep 10 2024, 2:30 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
BTullis added a parent task for T330152: Deploy ceph radosgw processes to data-engineering cluster: T324660: Install Ceph Cluster for Data Platform Engineering.
Sep 10 2024, 2:06 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)

Sep 9 2024

Maintenance_bot removed a project from T373026: Configure an Airflow/Datahub connection : Patch-For-Review.
Sep 9 2024, 2:31 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
brouberol added a comment to T369098: Enable metrics collection for airflow-k8s.

We have replicated the original Airflow dashboard, but for instances running in Kubernetes.

Sep 9 2024, 2:12 PM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T369098: Enable metrics collection for airflow-k8s.

Change #1071213 merged by Brouberol:

[operations/deployment-charts@master] airflow: broaden collected metrics and tag them correctly

https://gerrit.wikimedia.org/r/1071213

Sep 9 2024, 1:56 PM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T373026: Configure an Airflow/Datahub connection .

Change #1071619 merged by Brouberol:

[operations/deployment-charts@master] airflow: add missing configuration allowing it to read connnections from disk

https://gerrit.wikimedia.org/r/1071619

Sep 9 2024, 1:50 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a project to T373026: Configure an Airflow/Datahub connection : Patch-For-Review.
Sep 9 2024, 1:48 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T373026: Configure an Airflow/Datahub connection .

Change #1071619 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: add missing configuration allowing it to read connnections from disk

https://gerrit.wikimedia.org/r/1071619

Sep 9 2024, 1:48 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)

Sep 6 2024

gerritbot added a comment to T369098: Enable metrics collection for airflow-k8s.

Change #1071213 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: broaden collected metrics and tag them correctly

https://gerrit.wikimedia.org/r/1071213

Sep 6 2024, 2:04 PM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
Maintenance_bot removed a project from T373000: Suffix all our docker image tags with their sha256 checksum : Patch-For-Review.
Sep 6 2024, 1:31 PM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T368737: Deploy airflow scheduler via helm chart.

Change #1071153 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: enable visualizing logs of DAG runs in the webserver UI

https://gerrit.wikimedia.org/r/1071153

Sep 6 2024, 10:10 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
Gehel archived Data-Platform-SRE (2024.08.17 - 2024.09.06).
Sep 6 2024, 9:45 AM
gerritbot added a comment to T368737: Deploy airflow scheduler via helm chart.

Change #1071153 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: enable visualizing logs of DAG runs in the webserver UI

https://gerrit.wikimedia.org/r/1071153

Sep 6 2024, 9:37 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
BTullis added a comment to T330153: Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster.

The anycast VIPs have now been added to the five realservers by bird.

btullis@cumin1002:~$ sudo cumin --no-progress A:cephosd 'ip a sh | egrep "(10.3.0.8|2a02:ec80:ff00:101::8)"'
5 hosts will be targeted:
cephosd[1001-1005].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NODE GROUP =====
(5) cephosd[1001-1005].eqiad.wmnet
----- OUTPUT of 'ip a sh | egrep ...80:ff00:101::8)"' -----
    inet 10.3.0.8/32 scope global lo:anycast
    inet6 2a02:ec80:ff00:101::8/128 scope global 
================
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'ip a sh | egrep ...80:ff00:101::8)"'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
btullis@cumin1002:~$

Puppet runs cleanly and bird is active, so I think that we are ready to proceed to the next step. Namely.

Once it's looking good on the Bird/server side we can set the "bgp" flag for those hosts to 'true' in Netbox, and run Homer against the switches which should cause the routes to be announced.

Sep 6 2024, 9:27 AM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
BTullis renamed T372787: Implement S3 based logging for Airflow tasks on dse-k8s from Implement S3 based logging for Airflow on dse-k8s to Implement S3 based logging for Airflow tasks on dse-k8s.
Sep 6 2024, 9:13 AM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
brouberol moved T369098: Enable metrics collection for airflow-k8s from In Progress to Done on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.
Sep 6 2024, 9:13 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
brouberol closed T369098: Enable metrics collection for airflow-k8s as Resolved.
Sep 6 2024, 9:13 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T369098: Enable metrics collection for airflow-k8s.

Change #1071138 merged by Brouberol:

[operations/deployment-charts@master] airflow: configure metrics collection

https://gerrit.wikimedia.org/r/1071138

Sep 6 2024, 9:10 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T369098: Enable metrics collection for airflow-k8s.

Change #1071138 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: configure metrics collection

https://gerrit.wikimedia.org/r/1071138

Sep 6 2024, 8:52 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T330153: Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster.

Change #1070950 merged by Btullis:

[operations/puppet@production] Add the anycast VIP for radosgw to DPE Ceph servers

https://gerrit.wikimedia.org/r/1070950

Sep 6 2024, 8:50 AM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
brouberol moved T373837: Create airflow dags specific to data platform SRE to help us test our setup from In Progress to Done on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.
Sep 6 2024, 8:46 AM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
brouberol closed T373837: Create airflow dags specific to data platform SRE to help us test our setup as Resolved.
Sep 6 2024, 8:46 AM · Data-Platform-SRE (2024.08.17 - 2024.09.06)
brouberol moved T368737: Deploy airflow scheduler via helm chart from In Progress to Done on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.
Sep 6 2024, 8:46 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
brouberol closed T368737: Deploy airflow scheduler via helm chart as Resolved.
Sep 6 2024, 8:46 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T330153: Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster.

Change #1070949 merged by Btullis:

[operations/puppet@production] Enable IPv6 for the envoyproxy on DPE Ceph servers

https://gerrit.wikimedia.org/r/1070949

Sep 6 2024, 8:33 AM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
gerritbot added a comment to T368737: Deploy airflow scheduler via helm chart.

Change #1071077 merged by Brouberol:

[operations/deployment-charts@master] airflow: fix badly formatted Deployment separation

https://gerrit.wikimedia.org/r/1071077

Sep 6 2024, 7:37 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T368737: Deploy airflow scheduler via helm chart.

Change #1071077 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: fix badly formatted Deployment separation

https://gerrit.wikimedia.org/r/1071077

Sep 6 2024, 7:36 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
gerritbot added a comment to T368737: Deploy airflow scheduler via helm chart.

Change #1070619 merged by Brouberol:

[operations/deployment-charts@master] airflow: deploy the scheduler via a separate Deployment

https://gerrit.wikimedia.org/r/1070619

Sep 6 2024, 7:20 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)
CodeReviewBot added a comment to T368737: Deploy airflow scheduler via helm chart.

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/821

Sep 6 2024, 7:17 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06)

Sep 5 2024

RKemper changed the status of T372818: Verify and correct WDQS Grafana dashboard with graph split from Open to In Progress.
Sep 5 2024, 10:03 PM · Data-Platform-SRE (2024.09.28 - 2024.10.18), Wikidata-Query-Service, Wikidata
Stashbot added a comment to T328330: Create SLI / SLO on Search update lag.

Mentioned in SAL (#wikimedia-operations) [2024-09-05T21:57:55Z] <inflatador> bking@grafana1002 apply grizzly SLO dashboard updates slo-Search added slo-apigw updated P68729 T328330

Sep 5 2024, 9:59 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27), Patch-For-Review, Discovery-Search (Current work)
gerritbot added a comment to T328330: Create SLI / SLO on Search update lag.

Change #1060150 merged by Bking:

[operations/grafana-grizzly@master] search: add search update lag SLO

https://gerrit.wikimedia.org/r/1060150

Sep 5 2024, 9:50 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27), Patch-For-Review, Discovery-Search (Current work)
bking updated the task description for T372416: Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers.
Sep 5 2024, 9:01 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
bking added a comment to T372416: Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers.

Just an update as I've been doing quite a bit of research:

Sep 5 2024, 8:24 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
bking claimed T372416: Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers.
Sep 5 2024, 7:28 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
bking moved T372416: Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers from Backlog - operations to In Progress on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.
Sep 5 2024, 7:28 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
bking moved T373895: Reduce frequency of garbage collection alerts on cloudelastic from Backlog - operations to In Progress on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.
Sep 5 2024, 7:28 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27)
bking moved T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts from In Progress to Backlog - operations on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.
Sep 5 2024, 7:28 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Patch-For-Review
VirginiaPoundstone edited projects for T361347: Add documentation related to the kubernetes deployment to the MPIC service page , added: Data Products (Data Products Sprint 19); removed Data Products (Data products Sprint 18).
Sep 5 2024, 7:08 PM · Data-Platform-SRE (2024.09.06 - 2024.09.27), Data Products (Data Products Sprint 19), Metrics Platform