Scaling the Logging Stack

Introduction

Depending on the application workloads you run on your clusters, you may find that the default settings for the DKP logging stack do not meet your needs. In particular, if your workloads produce lots of log traffic, you may find you need to adjust the logging stack components to properly capture all the log traffic. Follow the suggestions below to tune the logging stack components as needed.

Logging Operator

In a high log traffic environment, fluentd usually becomes the bottleneck of the logging stack. According to https://banzaicloud.com/docs/one-eye/logging-operator/operation/scaling/:

The typical sign of this is when fluentd cannot handle its buffer directory size growth for more than the configured or calculated (timekey + timekey_wait) flush interval.

For metrics to monitor, refer to https://docs.fluentd.org/monitoring-fluentd/monitoring-prometheus#metrics-to-monitor.

Grafana dashboard

In DKP, if the Prometheus Monitoring (kube-prometheus-stack) platform application is enabled, you can view the Logging Operator dashboard in the Grafana UI.

You can also improve fluentd throughput by disabling the buffering for loki clusterOutput.

Example Configuration

You can see an example configuration of the logging operator in Logging Stack Application Sizing Recommendations.

For more information, refer to:

https://docs.fluentd.org/deployment/performance-tuning-single-process

Grafana Loki

DKP deploys Loki in Microservice mode – this provides you with the highest flexibility in terms of scaling.

In a high log traffic environment, we recommend:

Ingester should be the first component to be considered for scaling up.
Distributor should be scaled up only when the existing Distributor is experiencing stress due to high computing resource usage.
- Usually, the number of Distributor pods should be much lower than the number of Ingester pods

Grafana dashboard

In DKP, if Prometheus Monitoring (kube-prometheus-stack) platform app is enabled, you can view the Loki dashboards in Grafana UI and here is one of the Loki dashboard:

Example Configuration

You can see an example config of Loki at Logging Stack Application Sizing Recommendations.

For more information, refer to:

Rook Ceph

Ceph is the default S3 storage provider. In DKP, a Rook Ceph Operator and a Rook Ceph Cluster are deployed together to have a Ceph Cluster.

Storage

The default configuration of Rook Ceph Cluster in DKP has a 33% overhead in data storage for redundancy. Meaning, if the data disks allocated for your Rook Ceph Cluster is 1000Gb, 750Gb will be used to store your data. Thus, it is important to account for that in planning the capacity of your data disks to prevent issues.

ObjectBucketClaim storage limit

ObjectBucketClaim has a storage limit option to prevent your S3 bucket from growing over a limit. In DKP this is enabled by default.

Thus, after you size up your Rook Ceph Cluster for more storage, it is important to also increase the storage limit of your ObjectBucketClaims of your grafana-loki and/or project-grafana-loki.

To change it for grafana-loki , provide an override configmap in rook-ceph-cluster platform app to override dkp.grafana-loki.maxSize

To change it for project-grafana-loki , provide an override configmap in project-grafana-loki platform app to override dkp.project-grafana-loki.maxSize

Example Configuration

You can see an example config at Rook Ceph Cluster Sizing Recommendations.

Ceph OSD CPU considerations

ceph-osd is the object storage daemon for the Ceph distributed file system. It is responsible for storing objects on a local file system and providing access to them over the network.

If you determine that the Ceph OSD component is the bottleneck, then you may wish to consider increasing the CPU allocated to it.

See this page for more info: https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/

Grafana dashboard

In DKP, if the Prometheus Monitoring (kube-prometheus-stack) platform app is enabled, you can view the Ceph dashboards in the Grafana UI. Below is one of the Ceph dashboard:

Audit Log

Overhead

Enabling audit logging requires additional computing and storage resources.
When you enable audit logging by enabling the kommander-fluent-bit AppDeployment, inbound traffic to the logging stack increases the log traffic by approximately 3-4 more times.

Thus, when enabling the audit log, consider scaling up all components in the logging stack mentioned above.

Fine-tuning audit log Fluent Bit

If you are certain that you only need to collect a subset of the logs that the default config makes the kommander-fluent-bit pods collect, you can add your own override configmap to kommander-fluent-bit with proper Fluent Bit INPUT , FILTER, OUTPUT settings. This helps reduce the audit log traffic.

To see the default configuration of Fluent Bit, see the Release Notes > Components and Applications.

For more information, refer to: