[go: up one dir, main page]

Page MenuHomePhabricator

Migrate Search Platform-owned helm charts to Calico Network Policies
Open, MediumPublic

Description

The following Helm charts (and their respective applications) need to be updated to use the new Calico network polices:

  • flink-app
    • cirrus-streaming-updater
    • mw-page-content-change-enrich
    • rdf-streaming-updater
  • flink-operator

Creating this ticket to:

  1. Update the charts
  2. Redeploy the applications

Event Timeline

bking updated the task description. (Show Details)
Gehel triaged this task as Medium priority.Mon, Aug 26, 3:40 PM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.

Change #1071648 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-app/rdf-streaming-updater: add calico network policies

https://gerrit.wikimedia.org/r/1071648

Change #1071648 merged by Bking:

[operations/deployment-charts@master] flink-app/rdf-streaming-updater: add calico network policies

https://gerrit.wikimedia.org/r/1071648

Change #1071706 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-app/rdf-streaming-updater: remove rdf-specific changes

https://gerrit.wikimedia.org/r/1071706

A few observations from pairing session with @RKemper :

  • Removing
kafka:
  allowed_clusters:
    - main-eqiad

from
deployment-charts/helmfile.d/services/rdf-streaming-updater/values-staging-commons.yaml and running helmfile -i -e staging --selector name=commons apply wasn't enough; we also had to delete the existing pod to get the network policies to properly apply. However, we tried flipping the changes back and forth and re-applying twice after that, and the policies worked immediately after that. So it's something to be aware of when we roll out this change in production, but not a huge deal.

Crossed out the above as inaccurate after another staging deployment today. Today's deployment failed due to incorrect network policies, but Balthazar's script helped me diagnose them quickly.

It appears we have a mismatch in labels between the calico policy, which uses selector: app == 'flink-app' and pods created by flink-operator, which labels app=flink-app-${RELEASE} . I'm guessing it will be possible to work around this in the flink chart, will start looking now.

Change #1071706 merged by Bking:

[operations/deployment-charts@master] flink-app/rdf-streaming-updater: remove rdf-specific changes

https://gerrit.wikimedia.org/r/1071706

Change #1071936 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: switch to calico-based network policies

https://gerrit.wikimedia.org/r/1071936

Change #1072236 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-app: create a new label for selecting Calico network policies

https://gerrit.wikimedia.org/r/1072236

Change #1072243 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: switch to calico-based network policies

https://gerrit.wikimedia.org/r/1072243

Change #1071936 abandoned by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: switch to calico-based network policies

Reason:

Superseded by 1072243

https://gerrit.wikimedia.org/r/1071936

Update: I deployed the latest patchset to staging and I can confirm that the selector is getting applied correctly to the Calico NetworkPolicy CRDs:

kubectl get networkpolicies.crd.projectcalico.org flink-app-commons-egress-external-services-kafka -o yaml | grep selector

  selector: chart-name == 'flink-app' && release == 'commons'

However, the expected label chart-name does not exist on the pods. Will try a few more things and update this ticket...

Mentioned in SAL (#wikimedia-operations) [2024-09-11T21:48:19Z] <inflatador> bking@deploy1003 test deploying flink operator in staging T373195

Mentioned in SAL (#wikimedia-operations) [2024-09-11T21:56:20Z] <inflatador> bking@deploy1003 test deploy of flink operator in staging cancelled with no changes T373195

@RKemper and I tried to manually add the "chart-name" label to running pods. This seemed to properly link the network policy to the pod, but it didn't actually seem to open up the firewall.

Going back to @Ottomata 's comment in the Flink Operator network policies , I think the problem is how the operator creates and labels pods when it creates a FlinkDeployment. I'm guessing we might have to change something in the flink operator chart as opposed to making changes to the pods within the flink-app chart.

Per pairing with @brouberol , we were able to change the calico policy selector *without* changing the flink operator chart. The change is confirmed working in staging. However, we will need to roll this out to production carefully (taking a savepoint/checkpoint before updating each environment).

Change #1072597 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: trigger a savepoint before firewall changes

https://gerrit.wikimedia.org/r/1072597

Change #1072236 merged by jenkins-bot:

[operations/deployment-charts@master] flink-app: customize calico label selector

https://gerrit.wikimedia.org/r/1072236

Change #1073842 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: remove references to old-style network policies

https://gerrit.wikimedia.org/r/1073842

Change #1072243 merged by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: switch to calico-based network policies

https://gerrit.wikimedia.org/r/1072243

Change #1074090 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-update: enable calico network policies

https://gerrit.wikimedia.org/r/1074090

Change #1074091 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: disable legacy network policies for kafka

https://gerrit.wikimedia.org/r/1074091

Change #1073842 abandoned by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: remove references to old-style network policies

Reason:

we already did this in I6a040e53d9fb21b6d0f6cae6b3c9fa9ef64633c6

https://gerrit.wikimedia.org/r/1073842

Change #1072597 abandoned by Bking:

[operations/deployment-charts@master] rdf-streaming-updater: trigger a savepoint before firewall changes

Reason:

Successfully migrated withinout savepoint

https://gerrit.wikimedia.org/r/1072597

Change #1074090 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-update: enable calico network policies

https://gerrit.wikimedia.org/r/1074090

Change #1074091 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: disable legacy network policies for kafka

https://gerrit.wikimedia.org/r/1074091