DKP 2.6.2 Known Issues and Limitations
The following items are known issues with this release.
Known Issues
AWS additionalTags
cannot contain spaces
Due to an upstream bug in the cluster-api-provider-aws
component, it is not possible to specify tags with spaces in their name in the additionalTags
section of an AWSCluster
. If you have any tags like this during an upgrade of the capi-components
, you may receive a validation error, and will need to remove any such tags. This issue will be corrected in a future DKP release.
AWS Custom AMI Required for Kubernetes Version
Previous versions of DKP would default to using upstream AMIs published by the CAPA (Cluster API AWS) project when building AWS clusters if you did not specify your own AMI. However, those images are not currently available for the Kubernetes version used in the 2.6.2 patch release.
As a result, starting with this release of DKP, the behavior of the DKP create cluster aws
command has been changed. It no longer defaults to using the upstream AMIs and instead requires that you specify an AMI built using Konvoy Image Builder (KIB), or by explicitly requesting that it use the upstream images.
For more information on using a custom AMI in cluster creation or during the upgrade process, refer to these topics:
Use Static Credentials to Provision an Azure Cluster
Only static credentials can be used when provisioning an Azure cluster.
Containerd File Limit Issue
This version of DKP uses containerd 1.6.17. The systemd unit for containerd 1.6.17 provided upstream removes all file number limits (LimitNOFILE=infinity
). In our testing, we found that removing these limits broke some IO-sensitive applications such as Rook Ceph and HAProxy. Because of this, the KIB version included in this release sets the LimitNOFILE
value in the containerd systemd unit to the value (1048576
), which was used in previous containerd 1.4.13 version releases.
Intermittent Error Status when Creating EKS Clusters in the UI
When provisioning an EKS cluster through the UI, you might receive a brief error state because the EKS cluster can sporadically lose connectivity with the management cluster, which results in the following symptoms:
The UI shows the cluster is in an error state.
The kubeconfig generated and retrieved from Kommander ceases to work.
Applications created on the management cluster may not be immediately federated to managed EKS clusters.
After a few moments, the error self-resolves, without any action on your part. A new kubeconfig generated and retrieved from Kommander then works properly, and the UI shows that it is working again. In the meantime, you can continue to use the UI to work on the cluster such as deploy applications, create projects, and add roles.
Limitations to Disk Resizing in vSphere
The DKP CLI flags --control-plane-disk-size
and --worker-disk-size
are unable to resize the root file system of VMs, which are created using OS images. The flags work by resizing the primary disk of the VM. When the VM boots, the root file system expands to fill the disk, but that expansion does not work for some file systems. For example, for file systems contained in an LVM Logical Volume. To ensure your root file system has the size you expect, see Create a vSphere Base OS Image | Disk-Size.
Error Status in Grafana Logging Dashboard with EKS Clusters
Currently, it is not possible to use FluentBit to collect Admin-level logs on a managed EKS cluster.
If you have these logs enabled, the following message appears when you access the Kubernetes Audit Dashboard in the Grafana Logging Dashboard:
Cannot read properties of undefined (reading '0')
Spark Image Upstream Removal
Google has removed the images from their Spark registry upstream gcr.io/spark-operator/spark-py:v3.1.1
. Therefore, you are not able to pull these images for non-air-gapped environments. Refer also to our Deprecations notice for more information.
Rook Ceph Install Error
An issue might emerge when installing rook-ceph
on vSphere clusters using RHEL operating systems.
This issue occurs during initial installation of rook-ceph, causing the object store used by Velero and Grafana Loki, to be unavailable. If the installation of Kommander component of DKP is unsuccessful due to rook-ceph
failing, you might need to apply the following workaround:
Run this command to check if the cluster is affected by this issue.
CODEkubectl describe CephObjectStores dkp-object-store -n kommander
If this output appears, the workaround needs to be applied so continue with the next step. If you do not see this output, you can stop at this step.
CODEName: dkp-object-store Namespace: kommander ... Warning ReconcileFailed 7m55s (x19 over 52m) rook-ceph-object-controller failed to reconcile CephObjectStore "kommander/dkp-object-store". failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store ["dkp-object-store"]: failed to commit config changes after creating multisite config for CephObjectStore "kommander/dkp-object-store": failed to commit RGW configuration period changes%!(EXTRA []string=[]): signal: interrupt
Kubectl exec into the
rook-ceph-tools
pod.CODEexport WORKSPACE_NAMESPACE=<workspace namespace> CEPH_TOOLS_POD=$(kubectl get pods -l app=rook-ceph-tools -n ${WORKSPACE_NAMESPACE} -o name) kubectl exec -it -n ${WORKSPACE_NAMESPACE} $CEPH_TOOLS_POD bash
Run these commands to set
dkp-object-store
as the default zonegroup.
NOTE: Theperiod update
command may take a few minutes to completeCODEradosgw-admin zonegroup default --rgw-zonegroup=dkp-object-store radosgw-admin period update --commit
Next, restart the
rook-ceph-operator
deployment for theCephobjectStore
to be reconciled.CODEkubectl rollout restart deploy -n${WORKSPACE_NAMESPACE} rook-ceph-operator
After running the commands above, the
CephObjectStore
should beConnected
once therook-ceph
operator reconciles the object (this may take some time).CODEkubectl wait CephObjectStore --for=jsonpath='{.status.phase}'=Connected dkp-object-store -n ${WORKSPACE_NAMESPACE} --timeout 10m
Kommander Installation Configuration File Changes
In this release, the kommander-ui
, which provides the DKP Dashboard, is now deployed in the same manner as the other Platform Applications. Also, the ai-navigator-app
, which provides the AI Navigator, is deployed in the same manner. Therefore, the contents of the default Kommander installation configuration file, which is the file produced by the dkp install kommander --init
command, have changed. If you installed previous versions of DKP using a customized Kommander installation configuration file, we recommend:
Generate a new template for this release.
Reapply your customizations rather than reusing a file created by older DKP versions.
The
ai-navigator-app
will be enabled by default in the new configuration file
FluentD Logging Operator Error
An error may occur with the FluentD Operator not appearing when installing Kommander on a GCP Environment.
This is due to a limitation of the GCP API. As a result, you need to rename the Fluentd buffer name because the default name exceeds 63 characters and disable buffer volume metrics for FluentD in order for it to operate properly.
Follow these procedures to resolve this issue for the Management Cluster, and Managed or Attached Clusters.
Management Cluster on GCP with the Logging Stack enabled
Create the
kommander
namespace:CODEkubectl create namespace kommander
Create the logging-operator-logging-overrides ConfigMap
CODEcat <<EOF | kubectl apply -n kommander -f - apiVersion: v1 kind: ConfigMap metadata: name: logging-operator-logging-overrides data: values.yaml: | fluentd: # disable buffer metrics bufferVolumeMetrics: null bufferStorageVolume: pvc: source: # update this name to make the PV <= 63 chars long # logging-operator-logging-buf-logging-operator-logging-fluentd-0 claimName: buf spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi volumeMode: Filesystem EOF
Continue installing Kommander as normal.
Managed/Attached Cluster in GCP with Logging Stack enabled
Set the
WORKSPACE_NAMESPACE
environment variable to the name of your workspace’s namespace:CODEexport WORKSPACE_NAMESPACE=<workspace namespace>
Create the logging-operator-logging-overrides ConfigMap on the managed or attached cluster prior to enabling the logging-operator application:
CODEcat <<EOF | kubectl apply -n ${WORKSPACE_NAMESPACE} -f - apiVersion: v1 kind: ConfigMap metadata: name: logging-operator-logging-overrides data: values.yaml: | fluentd: # disable buffer metrics bufferVolumeMetrics: null bufferStorageVolume: pvc: source: # update this name to make the PV <= 63 chars long # logging-operator-logging-buf-logging-operator-logging-fluentd-0 claimName: buf spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi volumeMode: Filesystem EOF
Proceed with installing the logging-operator as normal.
Generate a Token or Enable the Konvoy Credentials Plugin for Proxied Clusters
If you have attached a network-restricted cluster and enabled a proxied access to make its resources available through the Management cluster, the “Generate Token” and “Konvoy credentials plugin instructions” options required for user authentication in the DKP UI require further steps.
Expand the set of instructions depending on the authentication configuration you want to configure:
A. Generate a Token
B. Konvoy credentials plugin instructions
Known Common Vulnerabilities and Exposures (CVE)
Starting with DKP 2.6, Catalog apps are scanned for CVEs. However, only the CVEs for app versions that are compatible with the default Kubernetes version, currently 1.26.6, are mitigated. For more information about the known CVEs for compatible catalog apps, see D2iQ Security Updates.
Potential Logging Interruption during Upgrade
The configuration setting enableRecreateWorkloadOnImmutableFieldChange
is enabled by default on the logging-operator
. This means the operator automatically triggers a recreation of the resource if any immutable fields are changed on the underlying fluent-bit/fluentd resources. Previously, any changes to immutable fields would fail silently and the failures would only be observable in the logging-operator
logs. This was done to handle a breaking change introduced in the logging-operator
, which added an additional selector label on the fluent-bit
DaemonSet
.
Because of this breaking change, during an upgrade to 2.6, fluent-bit
is restarted, and some log data in the fluent-bit
buffer might be lost if fluentd
is unavailable while fluent-bit
attempts to flush its buffer prior to the pods being terminated. To prevent issues related to data loss, we recommend you run >1 fluentd
pod. Alternatively, Fluent Bit can also be configured to use a hostPath
volume to store the buffer information, so it can be picked up again when Fluent Bit restarts. For more information about how to change the host path, see Fluent Bit log collector.
Gatekeeper Uninstallation Error
If you choose to disable Gatekeeper, you can run into an error where the app is still present on your clusters.
For pre-existing attached or managed clusters with Gatekeeper installed when they don’t want it to be, you need to manually cleanup Gatekeeper on those clusters, post upgrade.
Follow these steps to manually remove Gatekeeper:
For every Attached/Managed cluster that has Gatekeeper installed that you have disabled it on otherwise (via appdeployment), run the following commands to remove it:
NOTE: This needs to be run on the attached/managed cluster using the correct kubeconfig.
Set your namespace of your attached cluster's workspace on your attached cluster:CODEexport WORKSPACE_NAMESPACE=<workspace-name>
Delete the kustomizations:
CODEkubectl --kubeconfig <attached-cluster-kubeconfig> delete kustomizations -n ${WORKSPACE_NAMESPACE} gatekeeper-constraint-templates gatekeeper-constraints gatekeeper-release