[go: up one dir, main page]

Troubleshooting Click here for latest

How to address commonly encountered KEDA issues

Warning

You are currently viewing v"1.5" of the documentation and it is not the latest. For the most recent documentation, kindly click here.

Kubernetes Control plane is unable to communicate to Metric server?

If while setting up KEDA, you get an error: (v1beta1.external.metrics.k8s.io) status FailedDiscoveryCheck with a message: failing or missing response from https://POD-IP:6443/apis/external.metrics.k8s.io/v1beta1: Get "https://POD-IP:6443/apis/external.metrics.k8s.io/v1beta1": Address is not allowed.

One of the reason for this can be due to CNI like Cilium or any other.

Before you start

  • Make sure no network policies are blocking traffic and required CIDR’s are added

Check the status:

Find the api service name for the service keda/keda-metrics-apiserver:

kubectl get apiservice --all-namespaces

Check for the status of the api service found in previous step:

kubectl get apiservice <apiservicename> -o yaml

Example:

kubectl get apiservice v1beta1.external.metrics.k8s.io -o yaml

If the status is False, then there seems to be an issue and network might be the primary reason for it.

Solution for managed Kubernetes services:

In managed Kubernetes services you might solve the issue by updating deployment file of metric-apiserver as below.

    dnsPolicy: ClusterFirst
    hostNetwork: true

Eg: Modify useHostNetwork in values file.

Why is Kubernetes unable to get metrics from KEDA?

If while setting up KEDA, you get an error: (v1beta1.external.metrics.k8s.io) status FailedDiscoveryCheck with a message: no response from https://ip:443: Get https://ip:443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).

One of the reason for this can be that you are behind a proxy network.

Before you start

  • Make sure no network policies are blocking traffic

Check the status

Find the api service name for the service keda/keda-metrics-apiserver:

kubectl get apiservice --all-namespaces

Check for the status of the api service found in previous step:

kubectl get apiservice <apiservicename> -o yaml

Example:

kubectl get apiservice v1beta1.external.metrics.k8s.io -o yaml

If the status is False, then there seems to be an issue and proxy network might be the primary reason for it.

Solution for self-managed Kubernetes cluster

Find the cluster IP for the keda-metrics-apiserver and keda-operator-metrics:

kubectl get services --all-namespaces

In the /etc/kubernetes/manifests/kube-apiserver.yaml - add the cluster IPs found in the previous step in no_proxy variable.

Reload systemd manager configuration:

sudo systemctl daemon-reload

Restart kubelet:

sudo systemctl restart kubelet

Check the API service status and the pods now. Should work!

Solution for managed Kubernetes services

In managed Kubernetes services you might solve the issue by updating firewall rules in your cluster.

Google Kubernetes Engine (GKE)

E.g. in GKE private cluster add port 6443 (kube-apiserver) to allowed ports in master node firewall rules.

Also, if you are using Network Policies in your kube-system namespace, make sure they don’t block access for the konnectivity agent via port 6443. You can read more about konnectivity service.

In that case, you need to add a similar NetworkPolicy in the kube-system namespace:

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-from-konnectivity-agent-to-keda
  namespace: kube-system
spec:
  egress:
  - ports:
    - port: 6443
      protocol: TCP
    to:
    - ipBlock:
        cidr: ${KUBE_POD_IP_CIDR}
  podSelector:
    matchLabels:
      k8s-app: konnectivity-agent
  policyTypes:
  - Egress

Amazon Elastic Kubernetes Service (EKS)

E.g. Make sure the Cluster Security group can reach the Nodegroups on TCP 6443. For example, using the terraform eks module, this is achievable through the addtional nodegroup rules

module "eks" {
  source                               = "terraform-aws-modules/eks/aws"
  version                              = "19.5.1"
  ...
  create_node_security_group = true
  node_security_group_additional_rules = {
    keda_metrics_server_access = {
      description                   = "Cluster access to keda metrics"
      protocol                      = "tcp"
      from_port                     = 6443
      to_port                       = 6443
      type                          = "ingress"
      source_cluster_security_group = true
    }
  }

As of version 19.6.0 of the terraform-aws-modules/eks/aws module it is enough to have node_security_group_enable_recommended_rules option enabled(default) to get neccessary security group ingress rule.

Why is my ScaledObject paused?

When KEDA has upstream errors to get scaler source information it will keep the current instance count of the workload unless the fallback section is defined.

This behavior might feel like the autoscaling is not happening, but in reality, it is because of problems related to the scaler source.

You can check if this is your case by reviewing the logs from the KEDA pods where you should see errors in both our Operator and Metrics server. You can also check a status of the ScaledObject (READY and ACTIVE condition) by running following command:

$ kubectl get scaledobject MY-SCALED-OBJECT

Troubleshoot KEDA errors using profiling

In Golang we have the possibility to profile specific actions in order to determine what causes an issue. For example, if our keda-operator pod is keeps getting OOM after a specific time, using profilig we can profile the heap and see what operatios taking all of this space.

Golang support many profiling options like heap, cpu, goroutines and more… (for more info check this site https://pkg.go.dev/net/http/pprof).

In KEDA we provide the option to enable profiling on each component separately by enabling it using the Helm chart and providing a port (if not enabled then it won’t work).

profiling:
  operator:
    enabled: false
    port: 8082
  metricsServer:
    enabled: false
    port: 8083
  webhooks:
    enabled: false
    port: 8084

If not using the Helm chart then you can enable the profiling on each on of components by specifying the following argument in the respective container

--profiling-bind-address=":8082"

and it will be exposed on the port you specified.

After enabling it you can port-forward or expose the service and use tool like go tool pprof in order to get profiling data.

For more info look at this document https://go.dev/blog/pprof.

Why does Google Kubernetes Engine (GKE) 1.16 fail to fetch external metrics?

If you are running Google Kubernetes Engine (GKE) version 1.16, and are receiving the following error:

unable to fetch metrics from external metrics API: <METRIC>.external.metrics.k8s.io is forbidden: User "system:vpa-recommender" cannot list resource "<METRIC>" in API group "external.metrics.k8s.io" in the namespace "<NAMESPACE>": RBAC: clusterrole.rbac.authorization.k8s.io "external-metrics-reader" not found

You are almost certainly running into a known issue.

The workaround is to recreate the external-metrics-reader role using the following YAML:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-metrics-reader
rules:
- apiGroups:
  - "external.metrics.k8s.io"
  resources:
  - "*"
  verbs:
  - list
  - get
  - watch

The GKE team is currently working on a fix that they expect to have out in version >= 1.16.13.

Why does KEDA operator error with NoCredentialProviders

If you are running KEDA on AWS using IRSA or KIAM for pod identity and seeing the following error messages:

Events:
  Type     Reason                      Age                From           Message
  ----     ------                      ----               ----           -------
  Normal   KEDAScalersStarted          31s                keda-operator  Started scalers watch
  Normal   KEDAScaleTargetDeactivated  31s                keda-operator  Deactivated apps/v1.Deployment default/my-event-based-deployment from 1 to 0
  Normal   ScaledObjectReady           13s (x2 over 31s)  keda-operator  ScaledObject is ready for scaling
  Warning  KEDAScalerFailed            1s (x2 over 31s)   keda-operator  NoCredentialProviders: no valid providers in chain. Deprecated.
           For verbose messaging see aws.Config.CredentialsChainVerboseErrors

And the operator logs:

2021-11-02T23:50:29.688Z    ERROR    controller    Reconciler error    {"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "my-event-based-deployment-scaledobject", "namespace": "default"
, "error": "error getting scaler for trigger #0: error parsing SQS queue metadata: awsAccessKeyID not found"}

This means hat the KEDA operator is not receiving valid credentials, even before attempting to assume the IAM role associated with the scaleTargetRef.

Some things to check:

  • Ensure the keda-operator deployment has the iam.amazonaws.com/role annotation under deployment.spec.template.metadata not deployment.metadata - if using KIAM
  • Ensure the keda-operator serviceAccount is annotated eks.amazonaws.com/role-arn - if using IRSA
  • Check kiam-server logs, successful provisioning of credentials looks like: kube-system kiam-server-6bb67587bd-2f47p kiam-server {"level":"info","msg":"found role","pod.iam.role":"arn:aws:iam::1234567890:role/my-service-role","pod.ip":"100.64.7.52","time":"2021-11-05T03:13:34Z"}.
    • Use grep to filter the kiam-server logs, searching for the keda-operator pod ip.

Why is Helm not able to upgrade to v2.2.1 or above?

Our initial approach to manage CRDs through Helm was not ideal given it didn’t update existing CRDs.

This is a known limitation of Helm:

There is no support at this time for upgrading or deleting CRDs using Helm. This was an explicit decision after much community discussion due to the danger for unintentional data loss. Furthermore, there is currently no community consensus around how to handle CRDs and their lifecycle. As this evolves, Helm will add support for those use cases.

As of v2.2.1 of our Helm chart, we have changed our approach so that we automatically managing the CRDs through our Helm chart.

Due to this transition, it can cause upgrade failures if you started using KEDA before v2.2.1 and will cause errors during upgrades such as the following:

Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: CustomResourceDefinition “scaledobjects.keda.sh” in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key “app.kubernetes.io/managed-by”: must be set to “Helm”; annotation validation error: missing key “meta.helm.sh/release-name”: must be set to “keda”; annotation validation error: missing key “meta.helm.sh/release-namespace”: must be set to “keda”

In order to fix this, you will need to manually add the following attributes to our CRDs:

  • app.kubernetes.io/managed-by: Helm label
  • meta.helm.sh/release-name: keda annotation
  • meta.helm.sh/release-namespace: keda annotation

Future upgrades should work seamlessly.

Why is KEDA API metrics server failing when Istio is installed?

While setting up KEDA, you get an error: (v1beta1.external.metrics.k8s.io) status FailedDiscoveryCheck and you have Istio installed as service mesh in your cluster.

This can lead to side effects like not being able to delete namespaces in your cluster. You will see an error like:

NamespaceDeletionDiscoveryFailure - Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Check the setup

In the following we assume that KEDA is installed in the namespace keda.

Check the KEDA API service status

Find the api service name for the service keda/keda-metrics-apiserver:

kubectl get apiservice --all-namespaces

Check for the status of the api service found in previous step:

kubectl get apiservice <apiservicename> -o yaml

Example:

kubectl get apiservice v1beta1.external.metrics.k8s.io -o yaml

If the status is False, then there seems to be an issue with the KEDA installation.

Check Istio installation

Check if Istio is installed in your cluster:

kubectl get svc -n istio-system

If Istio is installed you get a result like:

NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                      AGE
istio-ingressgateway   LoadBalancer   100.65.18.245    34.159.50.243   15021:31585/TCP,80:31669/TCP,443:30464/TCP   3d
istiod                 ClusterIP      100.65.146.141   <none>          15010/TCP,15012/TCP,443/TCP,15014/TCP        3d

Check KEDA namespace labels

Check the KEDA namespace labels:

kubectl describe ns keda

If Istio injection is enabled there is no label stating istio-injection=disabled.

In this setup the sidecar injection of Istio prevents the api service of KEDA to work properly.

Solution

To prevent the side-car injection of Istio we must label the namespace accordingly. This can be achieved via setting the label istio-injection=disabled to the namespace:

kubectl label namespace keda istio-injection=disabled

Check that the label is set via kubectl describe ns keda

Install KEDA into the namespace keda and re-check the status of the api service which should now have the status True.