Skip to main content
Skip table of contents

DKP 2.5.0 Known Issues and Limitations

The following items are known issues with this release.

AWS additionalTags cannot contain spaces

Due to an upstream bug in the cluster-api-provider-aws component, it is not possible to specify tags with spaces in their name in the additionalTagssection of an AWSCluster. If you have any tags like this during an upgrade of the capi-components, you may receive a validation error, and will need to remove any such tags. This issue will be corrected in a future DKP release.

Use Static Credentials to Provision an Azure Cluster

Only static credentials can be used when provisioning an Azure cluster.

Containerd 1.4.13 File Limit Issue

In this version of DKP, we introduced containerd 1.6.17. The systemd unit for containerd 1.6.17 provided upstream removes all file number limits (LimitNOFILE=infinity). In our testing, we found that removing these limits broke some IO sensitive applications like Rook Ceph and HAProxy. Because of this, the KIB version included in this release sets the LimitNOFILE value in the containerd systemd unit to the value (1048576) that was used in previous containerd 1.4.13 version releases.

Intermittent Error Status when Creating EKS Clusters in the UI

When provisioning an EKS cluster through the UI, you may receive a brief error state because the EKS cluster may sporadically lose connectivity with the management cluster which results in the following symptoms:

  • The UI shows the cluster is in an error state.

  • The kubeconfig generated and retrieved from Kommander ceases to work.

  • Applications created on the management cluster may not be immediately federated to managed EKS clusters.

After a few moments, the error will resolve, without any action on your part. A new kubeconfig generated and retrieved from Kommander then works properly, and the UI shows that it is working again. In the meantime, you can continue to use the UI to work on the cluster such as deploy applications, create projects, and add roles.

Installation Issue in Pre-provisioned Environments

An issue with Rook Ceph’s deployment prevents pre-provisioned environments from installing this DKP version. To solve this issue, you must set up a minimum of 40 GB of raw storage for your worker nodes and customize your Rook Ceph installation as indicated in Install Kommander in a Pre-provisioned Environment.

Resolve issues with failed HelmReleases

An issue with the Flux helm-controller can cause HelmReleases to fail with the error message Helm upgrade failed: another operation (install/upgrade/rollback) is in progress. This can happen when the helm-controller is restarted while a HelmRelease is still upgrading, or installing.

Workaround

To ensure the HelmRelease error was caused by the helm-controller restarting, first try to suspend/resume the HelmRelease:

CODE
kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

This might resolve the issue. If not, continue with the following steps:   

You should see the HelmRelease attempting to reconcile, and then it either succeeds (with status: Release reconciliation succeeded) or it fails with the same error as before. 

If the HelmRelease is still in the failed state, it is likely related to the helm-controller restarting. For example, if the 'reloader' HelmRelease is the one that is stuck.

To resolve the issue, follow these steps:

  1. List secrets containing the affected HelmRelease name:

    CODE
    kubectl get secrets -n ${NAMESPACE} | grep reloader

    The output should look like this:

    CODE
    kommander-reloader-reloader-token-9qd8b                        kubernetes.io/service-account-token   3      171m
    sh.helm.release.v1.kommander-reloader.v1                       helm.sh/release.v1                    1      171m
    sh.helm.release.v1.kommander-reloader.v2                       helm.sh/release.v1                    1      117m           

    In this example, sh.helm.release.v1.kommander-reloader.v2 is the most recent revision.

  2. Find and delete the most recent revision secret, for example, sh.helm.release.v1.*.<revision>:

    CODE
    kubectl delete secret -n <namespace> <most recent helm revision secret name>
  3. Suspend and resume the HelmRelease to trigger a reconciliation:

    CODE
    kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
    kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

You should see the HelmRelease is reconciled and eventually the upgrade and install succeeds.

Limitations to Disk Resizing in vSphere

The DKP CLI flags --control-plane-disk-size and --worker-disk-size are unable to resize the root file system of VMs created using OS images. The flags work by resizing the primary disk of the VM. When the VM boots, the root file system is expanded to fill the disk, but that expansion does not work for some file systems, for example, for file systems contained in an LVM Logical Volume. To ensure your root file system has the size you expect, please see Create a vSphere Base OS Image | Disk-Size.

Error Status in Grafana Logging Dashboard with EKS Clusters

Currently, it is not possible to use FluentBit to collect Admin-level logs on a managed EKS cluster.

If you have these logs enabled, the following message appears when you access the Kubernetes Audit Dashboard in the Grafana Logging Dashboard:

CODE
Cannot read properties of undefined (reading '0')

Logging Operator Upgrade Error

There is a race condition that could result in the logging-operator-logging-fluentd using the incorrect image tag during upgrade from DKP 2.4.0 to DKP 2.5.0.

The image tag is eventually corrected by the logging-operator, however due to the nature of StatefulSets, the failing pod needs to be removed in order for the StatefulSet to continue rolling out the required updates.

  1. Run this command to find if any Fluentd pods are in the ImagePullBackOff state post-upgrade:

    CODE
    kubectl get pod -l app.kubernetes.io/name=fluentd,app.kubernetes.io/managed-by=logging-operator-logging,app.kubernetes.io/component=fluentd -n kommander
  2. If ImagePullBackOff is present in the output like in the example below, you will need to continue with these steps to resolve the issue.

    CODE
    NAME                                 READY   STATUS             RESTARTS   AGE
    logging-operator-logging-fluentd-0   3/3     Running            0          6m41s
    logging-operator-logging-fluentd-1   2/3     ImagePullBackOff   0          2m33s
  3. Delete the Fluentd pod that is in an ImagePullBackOff state. In this case, it is logging-operator-logging-fluentd-1:

    CODE
    kubectl delete pod -n kommander logging-operator-logging-fluentd-1
  4. The upgrade of the logging-operator-logging-fluentd StatefulSet now proceeds as normal. 

Nodepools Update Error with Knative

The following only applies to your environment if you have Knative installed and if its deployment is scaled to less than 5 pods.
An issue with the PodDisruptionBudget resource blocks the deletion of old nodes when upgrading from DKP 2.4.0 to DKP 2.5.0, which results in a failure of the DKP nodepools upgrade.

  1. If the dkp update nodepool command fails, check to see if PodDisruptionBudget with ALLOWED DISRUPTIONS equals 0 using the following command:

    CODE
    kubectl get pdb -n knative-serving
    NAME                             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
    activator-pdb                    80%             N/A               0                     22h
    webhook-pdb                      80%             N/A               0                     22h
    
  2. Obtain the list of pods in the knative-serving namespace containing PDB with the following command:

    CODE
    kubectl get pods -n knative-serving -l 'app in (webhook, activator)'
  3. The output should look similar to the following:

    CODE
    NAME                       READY   STATUS    RESTARTS   AGE
    webhook-XXXXXXXXX-XXXX     2/2     Running   0          5d21h
    activator-XXXXXXXXX-XXXX   2/2     Running   0          4d23h
  4. Delete the pods that contains the PodDisruptionBudget resource:

    CODE
    kubectl delete pod -n knative-serving activator-XXXX-XXX webhook-XXXXX-XXXX
  5. The upgrade of DKP and Knative now proceeds as normal. Re-run the dkp update nodepool command.

Rook Ceph Install Error

An issue may emerge when installing rook-ceph on vSphere clusters using RHEL operating systems.

This issue occurs during initial installation of rook-ceph, causing the object store used by Velero and Grafana Loki, to be unavailable. If the installation of Kommander component of DKP is unsuccessful due to rook-ceph failing, you might need to apply the following workaround.

  1. Run the following command to see if the cluster is affected by this issue.

    CODE
    kubectl describe CephObjectStores dkp-object-store -n kommander
  2. If the following output appears, this workaround needs to be applied:

    CODE
    Name:         dkp-object-store
    Namespace:    kommander
    ...
      Warning  ReconcileFailed     7m55s (x19 over 52m)
      rook-ceph-object-controller  failed to reconcile CephObjectStore
      "kommander/dkp-object-store". failed to create object store deployments: failed
      to configure multisite for object store: failed create ceph multisite for
      object-store ["dkp-object-store"]: failed to commit config changes after
      creating multisite config for CephObjectStore "kommander/dkp-object-store":
      failed to commit RGW configuration period changes%!(EXTRA []string=[]): signal: interrupt
  3. Kubectl exec into the rook-ceph-tools pod.

    CODE
    export WORKSPACE_NAMESPACE=<workspace namespace>
    CEPH_TOOLS_POD=$(kubectl get pods -l app=rook-ceph-tools -n ${WORKSPACE_NAMESPACE} -o name)
    kubectl exec -it -n ${WORKSPACE_NAMESPACE} $CEPH_TOOLS_POD bash
  4. Run the following commands to set dkp-object-store as the default zonegroup.
    (info) NOTE: The period update command may take a few minutes to complete

    CODE
    radosgw-admin zonegroup default --rgw-zonegroup=dkp-object-store
    radosgw-admin period update --commit
  5. Next, restart the rook-ceph-operator deployment for the CephobjectStore to be reconciled.

    CODE
    kubectl rollout restart deploy -n${WORKSPACE_NAMESPACE} rook-ceph-operator
  6. After running the commands above, the CephObjectStore should be Connected once the rook-ceph operator reconciles the object (this may take some time).

    CODE
    kubectl wait CephObjectStore --for=jsonpath='{.status.phase}'=Connected dkp-object-store -n ${WORKSPACE_NAMESPACE} --timeout 10m

Post Upgrade, Volume Cannot Attach to a Node Already Attached to Another Node

Due to an upstream issue, when you bring a new node up during a Kubernetes version upgrade and then delete the old node, an existing volume might not attach to the new node.

You will see this when a new pod that uses a volume does not become ready in the new node, and then an event that says something such as Volume <pvc/pv-id> is already exclusively attached to one node and can’t be attached to another.

This will be fixed in a future Kubernetes release, for example this is described in vSphere here. Different methods might be needed to resolve this manually, including this method to resolve on vSphere.

DKP 2.4.0 to DKP 2.50 rook-ceph-cluster Helm Release Upgrade Error

  1. If you see the rook-ceph-cluster HelmRelease with an error similar to this:

    CODE
    status:
      conditions:
      - lastTransitionTime: "2023-05-10T13:56:28Z"
        message: "Helm rollback failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster:
          Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
          with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\n\\nLast Helm logs:\\n\\nPatch
          CephCluster \\"dkp-ceph-cluster\\" in namespace kommander\\nerror updating the
          resource \\"dkp-ceph-cluster\\":\\n\\t cannot patch \\"dkp-ceph-cluster\\" with kind
          CephCluster: Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\nPatch CephObjectStore
          \\"dkp-object-store\\" in namespace kommander\\nerror updating the resource \\"dkp-object-store\\":\\n\\t
          cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal error
          occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\nwarning: Rollback \\"rook-ceph-cluster\\"
          failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster: Internal error
          occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
          with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused"
        reason: RollbackFailed
        status: "False"
        type: Ready
      - lastTransitionTime: "2023-05-10T13:56:26Z"
        message: "Helm upgrade failed: cannot patch \\"dkp-object-store\\" with kind CephObjectStore:
          Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\n\\nLast Helm logs:\\n\\nPatch
          Ingress \\"dkp-ceph-cluster-dashboard\\" in namespace kommander\\nPatch CephCluster
          \\"dkp-ceph-cluster\\" in namespace kommander\\nPatch CephObjectStore \\"dkp-object-store\\"
          in namespace kommander\\nerror updating the resource \\"dkp-object-store\\":\\n\\t
          cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal error
          occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\nwarning: Upgrade \\"rook-ceph-cluster\\"
          failed: cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal
          error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused"
        reason: UpgradeFailed
        status: "False"
        type: Released
      - lastTransitionTime: "2023-05-10T13:56:28Z"
        message: "Helm rollback failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster:
          Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
          with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\n\\nLast Helm logs:\\n\\nPatch
          CephCluster \\"dkp-ceph-cluster\\" in namespace kommander\\nerror updating the
          resource \\"dkp-ceph-cluster\\":\\n\\t cannot patch \\"dkp-ceph-cluster\\" with kind
          CephCluster: Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\nPatch CephObjectStore
          \\"dkp-object-store\\" in namespace kommander\\nerror updating the resource \\"dkp-object-store\\":\\n\\t
          cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal error
          occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused\\nwarning: Rollback \\"rook-ceph-cluster\\"
          failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster: Internal error
          occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
          with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
          failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
          dial tcp 10.99.16.90:443: connect: connection refused"
        reason: RollbackFailed
        status: "False"
        type: Remediated

     

  2. Reconcile the rook-ceph-cluster HelmRelease once the rook-ceph HelmRelease becomes ready using the commands below. The upgrade now proceeds as normal.

    CODE
    export WORKSPACE_NAMESPACE=<workspace namespace>
    kubectl -n ${WORKSPACE_NAMESPACE} patch helmrelease rook-ceph-cluster --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
    kubectl -n ${WORKSPACE_NAMESPACE} patch helmrelease rook-ceph-cluster --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.