Kubernetes 1.26 upgrade #2458

tfriedel · 2023-06-13T12:47:42Z

This PR updates cortex to Kubernetes 1.26 and also updates most components to newer versions as described in versions.md.

An attempt to upgrade to Kubernetes 1.27 was made, but it was unsuccessful because of an open issue of the amazon-vpc-cni-k8s plugin with the Prometheus adapter.

Notes:

dockerd is deprecated after k8s 1.24 and was replaced with containerd. The only time docker is used in the k8s cluster is to check if images can be fetched. We disabled this functionality as it's not essential and it would be non-trivial to add. But if someone wants to fix this, please feel free to submit a patch.
We created ami mappings using go run build/generate_ami_mapping.go manager/manifests/ami.json public, however our AWS account can not access all regions, so we had to comment out regions that were not supported. Again, if someone wants to submit patch for this, it would be appreciated.
new ec2 instance types were added, but servicequotas.go and validateInstanceType() were not touched. Anyone interested in using the newer instance types may have to look into this. To be on the safe side, don't use the new instance types for now.
cluster-autoscaler was forked and patches were applied to a recent version. We switched to autoscaler/v2 api (from v2beta1 / v2beta2).
we had to install an ebs csi driver
AWS started using minimal base images that don't allow shell commands like cat, tar, sh etc (e.g. kube-proxy:v1.26.2-minimal-eksbuild.1 ). We had to adapt some scripts because of that and get configuration directly from kubernetes instead of the file system.
we ran into this issue: CustomResourceDefinition.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes. We fixed it by using kubectl apply --server-side.
in the linter script we had to disable looppointer, as it was giving us errors that looked like
looppointer: internal error: package "math" without types was imported which we couldn't resolve
during creation of a cluster there's a few error messages that look like ￮ configuring metrics E0610 20:44:44.408766 1054 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0610 20:44:44.563268. We assume it's a false alarm as metric collection seems to work, but if anyone has any insight into this, please let us know.

We ran 'make lint' and 'make test' and did manual testing with our model server over 2 days and have not noticed any issues yet. We did not run the e2e tests in the Makefile.

For some reason the circleCI script doesn't find the linter, even though it's installed and the PATH modification also looks correct. If someone knows how to fix it, please let us know.

Please test this thoroughly yourself before using it in production.

To use this version you will have to build self-hosted images. Follow the steps in
CONTRIBUTING.md up till "make images-all" and also read self-hosted-images.
Use go1.20.4 linux/amd64. A user tried go 1.21 and it didn't work with this version.

edit: we are running this version in production for a week and have not noticed any problems. We only use the realtime API.

checklist:

run make test and make lint
test manually (i.e. build/push all images, restart operator, and re-deploy APIs)
update examples
update docs and add any new files to summary.md (view in gitbook after merging)
cherry-pick into release branches if applicable
alert the dev team if the dev environment changed

…ules (included)

…heus operator / config reloader, fluentbit, cluster autoscaler, metrics server, neuron device plugin and scheduler

…ntry sdk and pyyaml

…io and gnostic

…ni-k8s@v1.12.6 with controller-runtime

…isabled checking for license comments

… default) and set container-runtime to containerd, as dockerd is not supported after kubernetes 1.24

…erd to containerd. While the cluster now starts, we can't use cortex deploy because it requires docker. Need to find a way to give it access to docker.

…rted with containerd instead of dockerd

CLAassistant · 2023-06-13T12:47:48Z

All committers have signed the CLA.

aleksandr-smechov · 2023-06-13T15:27:30Z

Awesome work 😎 thanks for keeping the project alive! I'll test this out on our setup this week.

…are created in the same account. We now create roles per cluster.

….15.0

tfriedel added 30 commits May 31, 2023 16:39

went through most of the steps in versions.md up to Non-versioned mod…

87f2e34

…ules (included)

updated nvidia device plugin to 0.14.0

e7ae142

updated Neuron device plugin and scheduler

380a25c

updated prometheus dcgm exporter, statsd exporter, prometheus, promet…

98d2ab3

…heus operator / config reloader, fluentbit, cluster autoscaler, metrics server, neuron device plugin and scheduler

incorporated changes in Prometheus Kubelet Exporter

ea5a449

updated Prometheus kube-state-metrics Exporter

80823c9

updated prometheus node exporter to 1.5.0

1aa8e19

update grafana from 8.0.4 to 9.5.2

98a0e52

updated event exporter

7a2c9e5

updated to alpine 3.18

02d5295

updated Python client dependencies to python 3.7 and newest ver of se…

1242e73

…ntry sdk and pyyaml

replaced Handler with ProbeHandler

f83a91c

changed cortex version to 0.42.2

9f2eedc

newest istio go-client changed to pointers

bd5402e

resolved some conflicting packages related to controller-runtime, ist…

7da2f36

…io and gnostic

changed to kubernetes 1.26 because of incompatibility of amazon-vpc-c…

38e64a9

…ni-k8s@v1.12.6 with controller-runtime

tests are passing now

c168ca1

formatting

4d5ca1a

update of go tools

16b652c

some fixes. golang version set to same as in other files

5633900

linter: disabled looppointer because of some errors I couldn't fix. d…

774986d

…isabled checking for license comments

set AMI Family to AmazonLinux2 (this was used in previous version per…

a097eda

… default) and set container-runtime to containerd, as dockerd is not supported after kubernetes 1.24

downgraded cluster-autoscaler to a version for kubernetes 1.26

aca7d70

downgraded promotheus operator to 0.62 because of some issues

7963093

use autoscaling v2 instead of b2beta2 + other fixes

d639980

remove mounting of /var/run/docker.sock because we switched from dock…

5384483

…erd to containerd. While the cluster now starts, we can't use cortex deploy because it requires docker. Need to find a way to give it access to docker.

k8s-device-plugin:v0.14.0-ubuntu20.04 -> k8s-device-plugin:v0.14.0

4f5b109

fix: nvidia device plugin was not installed

c594507

updated go version in CONTRIBUTING.md

f89cf45

disabled use of docker for validating docker images as it's not suppo…

7d63258

…rted with containerd instead of dockerd

get yaml, go-input and autosccaler from our fork at PEAT-AI

8858679

tfriedel added 3 commits June 13, 2023 15:32

formatting / make linter happy

1cf0e42

removed unneccesary file

6d56744

replaced all remaining references from k8s.gcr.io to registry.k8s.io

a1435ea

tfriedel marked this pull request as ready for review June 13, 2023 21:00

tfriedel added 4 commits June 16, 2023 20:03

fix for "failed to create iamserviceaccount(s)" in case two clusters …

81fb67b

…are created in the same account. We now create roles per cluster.

fix for ListTagsLogGroup, has been deprecated

a388d20

update python in manager to ver 3.10 because boto3 for 3.7 is deprecated

08c47ca

updated ami for gpu and updated nvidia device plugin from 0.14.0 to 0…

98b59ef

….15.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes 1.26 upgrade #2458

Kubernetes 1.26 upgrade #2458

Kubernetes 1.26 upgrade #2458

Are you sure you want to change the base?

Kubernetes 1.26 upgrade #2458

Conversation