Kaptain

To print the status of the Kaptain helm installation:

helm status kaptain
CODE

To show deployments and pods in the Kaptain operator instance:

kubectl get deployments -n kubeflow

kubectl get pods -n kubeflow

kubectl describe pod <pod_name> -n kubeflow
CODE

Konvoy

To create the Konvoy diagnostics bundle, use:

konvoy diagnose --logs-all-namespaces --yes
CODE

Afterwards, check Konvoy troubleshooting techniques.

ML workloads

TFJobs

List all TFJobs runs in user namespace:

kubectl get tfjobs -n <namespace>
CODE

Get the details of TFJob run:

kubectl describe tfjob <job_name> -n <namespace>
CODE

List all TFJob’s pods:

kubectl get pods -l job-name=<job_name> -n <namespace>
CODE

Print the logs from TFJob’s pods:

kubectl logs -l job-name=<job_name> --prefix=true -n <namespace>
CODE

List Kubernetes Events associated with TFJob:

kubectl get events --field-selector involvedObject.kind=TFJob,involvedObject.name=<job_name>  -n <namespace>
CODE

Delete TFJob:

kubectl delete tfjob <job_name> -n <namespace>
CODE

PyTorchJobs

List all PyTorchJobs runs in user namespace:

kubectl get pytorchjob -n <namespace>
CODE

Get the details of PyTorchJob run:

kubectl describe pytorchjob <job_name> -n <namespace>
CODE

List all PyTorchJob’s pods:

kubectl get pods -l job-name=<job_name> -n <namespace>
CODE

Print the logs from PyTorchJob’s pods:

kubectl logs -l pytorch-job-name=<job_name> --prefix=true -n <namespace>
CODE

List Kubernetes Events associated with PyTorchJob:

kubectl get events --field-selector involvedObject.kind=PyTorchJob,involvedObject.name=<job_name> -n <namespace>
CODE

Delete PyTorchJob:

kubectl delete pytorchjob <job_name> -n <namespace>
CODE

Experiments

Get the details of an Experiment:

kubectl describe experiment <experiment_name> -n <namespace>
CODE

List all Experiments runs in user namespace:

kubectl get experiments -n <namespace>
CODE

Get the Experiment’s trials:

kubectl get trials -l experiment=<experiment_name> -n <namespace>
CODE

Inference Services

Get the details of Inference Service:

kubectl describe inferenceservice <inference_service_name> -n <namespace>
CODE

Get the Inference Service’s pods:

kubectl get pods -l serving.kubeflow.org/inferenceservice=<inference_service_name> -n <namespace>
CODE

Print Inference Service pod’s logs:

kubectl logs -l serving.kubeflow.org/inferenceservice=<inference_service_name> --prefix=true -c storage-initializer -n <namespace>
kubectl logs -l serving.kubeflow.org/inferenceservice=<inference_service_name> --prefix=true -c kfserving-container -n <namespace>
CODE

To find Knative revisions that are no longer being used, run this:

kubectl get revisions -l serving.kubeflow.org/inferenceservice=<inference_service_name> -l serving.knative.dev/routingState=reserve -n <namespace>
CODE

You can clean up revisions with:

kubectl delete revision <revision_name> -n <namespace>
CODE

This cleans up associated with those revisions deployment that have been scaled to 0.

Kubeflow Pipelines

List all pipeline runs in user namespace:

kubectl get workflows.argoproj.io -n <namespace>
CODE

Print the logs from all pipeline steps:

kubectl logs -l workflows.argoproj.io/workflow=<workflow_name> -c main --prefix=true -n <namespace>
CODE

Delete all completed pipeline runs:

kubectl delete workflows.argoproj.io -l workflows.argoproj.io/completed=true -n <namespace>
CODE

Delete all the pipeline runs with by final status (Succeeded or Failed):

kubectl delete -l workflows.argoproj.io -l workflows.argoproj.io/completed=true -l workflows.argoproj.io/phase=Succeeded -n <namespace>
CODE

Common issues

Most problems could be identified by checking the following, going from higher-level description to low level details:

  • description of the workload (TFJob, PyTorchJob, InferenceService etc.)

    • Events associated with the workload

    • Status of the workload

  • pod logs

Errors in job code

Check if the job is running:

$ kubectl get tfjob tfjob-example 
NAME            STATE     AGE
tfjob-example   Running   6m47s
CODE

Check the pod status of the job:

$ kubectl get pods -l job-name=tfjob-example
NAME                     READY   STATUS             RESTARTS   AGE
tfjob-example-chief-0    0/1     CrashLoopBackOff   6          7m10s
tfjob-example-worker-0   0/1     CrashLoopBackOff   6          7m10s
tfjob-example-worker-1   0/1     CrashLoopBackOff   6          7m10s
CODE

In this example, the job is in the Running state but the pods are in the Error or CrashLoopBackOff state. First start with the describe command which provide high-level view of the workload. Status and Events fields will give an overview of current status of the workload along with associated cluster events.

$ kubectl describe tfjob tfjob-sample
Name:         tfjob-sample
Namespace:    mynamespace
...
Status:
  Conditions:
    Last Transition Time:  2021-11-14T22:08:40Z
    Last Update Time:      2021-11-14T22:08:40Z
    Message:               TFJob tfjob-sample is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-11-14T22:08:49Z
    Last Update Time:      2021-11-14T22:08:49Z
    Message:               TFJob mynamespace/tfjob-sample is running.
    Reason:                TFJobRunning
    Status:                True
    Type:                  Running
  Replica Statuses:
    Chief:
      Active:  1
    Worker:
      Active:  2
  Start Time:  2021-11-14T22:08:41Z
Events:
  Type     Reason                   Age                   From         Message
  ----     ------                   ----                  ----         -------
  Normal   SuccessfulCreatePod      15m                   tf-operator  Created pod: tfjob-sample-chief-0
  Normal   SuccessfulCreateService  15m                   tf-operator  Created service: tfjob-sample-chief-0
  Normal   SuccessfulCreatePod      15m                   tf-operator  Created pod: tfjob-sample-worker-0
  Normal   SuccessfulCreatePod      15m                   tf-operator  Created pod: tfjob-sample-worker-1
  Normal   SuccessfulCreateService  15m                   tf-operator  Created service: tfjob-sample-worker-0
  Normal   SuccessfulCreateService  15m                   tf-operator  Created service: tfjob-sample-worker-1
  Warning  Error                    14m (x2 over 14m)     tf-operator  Error pod tfjob-sample-worker-1 container tensorflow exitCode: 1 terminated message:
  Normal   ExitedWithCode           14m (x2 over 14m)     tf-operator  Pod: mynamespace.tfjob-sample-worker-1 exited with code 1
  Warning  Error                    14m (x5 over 14m)     tf-operator  Error pod tfjob-sample-chief-0 container tensorflow exitCode: 1 terminated message:
  Normal   ExitedWithCode           14m (x2 over 14m)     tf-operator  Pod: mynamespace.tfjob-sample-worker-0 exited with code 1
  Normal   ExitedWithCode           14m (x6 over 14m)     tf-operator  Pod: mynamespace.tfjob-sample-chief-0 exited with code 1
  Warning  Error                    9m54s (x13 over 14m)  tf-operator  Error pod tfjob-sample-worker-0 container tensorflow exitCode: 1 terminated message:
  Warning  CrashLoopBackOff         118s                  tf-operator  Error pod tfjob-sample-worker-1 container tensorflow waiting message: back-off 5m0s restarting failed container=tensorflow pod=tfjob-sample-worker-1_mynamespace(13030281-b595-4486-89eb-1025455ec091)
CODE

From the events list, it is clear that all pods were created successfully and container tensorflow has started but terminated with code 1:

Events:
  Type     Reason                   Age                   From         Message
  ----     ------                   ----                  ----         -------
  Normal   SuccessfulCreatePod      15m                   tf-operator  Created pod: tfjob-sample-chief-0
  Normal   SuccessfulCreateService  15m                   tf-operator  Created service: tfjob-sample-chief-0
  Normal   SuccessfulCreatePod      15m                   tf-operator  Created pod: tfjob-sample-worker-0
  Normal   SuccessfulCreatePod      15m                   tf-operator  Created pod: tfjob-sample-worker-1
  Normal   SuccessfulCreateService  15m                   tf-operator  Created service: tfjob-sample-worker-0
  Normal   SuccessfulCreateService  15m                   tf-operator  Created service: tfjob-sample-worker-1
  Warning  Error                    14m (x2 over 14m)     tf-operator  Error pod tfjob-sample-worker-1 container tensorflow exitCode: 1 terminated message:
  Normal   ExitedWithCode           14m (x2 over 14m)     tf-operator  Pod: mynamespace.tfjob-sample-worker-1 exited with code 1
  Warning  Error                    14m (x5 over 14m)     tf-operator  Error pod tfjob-sample-chief-0 container tensorflow exitCode: 1 terminated message:
  Normal   ExitedWithCode           14m (x2 over 14m)     tf-operator  Pod: mynamespace.tfjob-sample-worker-0 exited with code 1
  Normal   ExitedWithCode           14m (x6 over 14m)     tf-operator  Pod: mynamespace.tfjob-sample-chief-0 exited with code 1
  Warning  Error                    9m54s (x13 over 14m)  tf-operator  Error pod tfjob-sample-worker-0 container tensorflow exitCode: 1 terminated message:
  Warning  CrashLoopBackOff         118s                  tf-operator  Error pod tfjob-sample-worker-1 container tensorflow waiting message: back-off 5m0s restarting failed container=tensorflow pod=tfjob-sample-worker-1_mynamespace(13030281-b595-4486-89eb-1025455ec091)
CODE

This implies that scheduling was successful - the scheduler was able to find enough cluster to schedule the pod and the image was pulled successfully.

The next step is to check the logs for the pod:

$ kubectl logs -l job-name=tfjob-sample --prefix=true
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:Using MirroredStrategy with devices ('/job:chief/task:0',)
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:Waiting for the cluster, timeout = inf
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:Cluster is ready.
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = ...
[pod/tfjob-sample-chief-0/tensorflow] Traceback (most recent call last):
[pod/tfjob-sample-chief-0/tensorflow]   File "trainer.py", line 131, in <module>
[pod/tfjob-sample-chief-0/tensorflow]     main()
[pod/tfjob-sample-chief-0/tensorflow]   File "trainer.py", line 54, in main
[pod/tfjob-sample-chief-0/tensorflow]     tf.reshape([1, 2, 3], [2, 2])
[pod/tfjob-sample-chief-0/tensorflow]  InvalidArgumentError: Input to reshape is a tensor with 3 values, but the requested shape has 4
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:Using MirroredStrategy with devices ('/job:worker/task:0',)
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:Waiting for the cluster, timeout = inf
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:Cluster is ready.
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = ...
[pod/tfjob-sample-worker-0/tensorflow] Traceback (most recent call last):
[pod/tfjob-sample-worker-0/tensorflow]   File "trainer.py", line 131, in <module>
[pod/tfjob-sample-worker-0/tensorflow]     main()
[pod/tfjob-sample-worker-0/tensorflow]   File "trainer.py", line 54, in main
[pod/tfjob-sample-worker-0/tensorflow]     tf.reshape([1, 2, 3], [2, 2])
[pod/tfjob-sample-worker-0/tensorflow] InvalidArgumentError: Input to reshape is a tensor with 3 values, but the requested shape has 4
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:Using MirroredStrategy with devices ('/job:worker/task:1',)
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:Waiting for the cluster, timeout = inf
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:Cluster is ready.
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = ...
[pod/tfjob-sample-worker-1/tensorflow] Traceback (most recent call last):
[pod/tfjob-sample-worker-1/tensorflow]   File "trainer.py", line 131, in <module>
[pod/tfjob-sample-worker-1/tensorflow]     main()
[pod/tfjob-sample-worker-1/tensorflow]   File "trainer.py", line 54, in main
[pod/tfjob-sample-worker-1/tensorflow]     tf.reshape([1, 2, 3], [2, 2])
[pod/tfjob-sample-worker-1/tensorflow] InvalidArgumentError: Input to reshape is a tensor with 3 values, but the requested shape has 4
CODE

In the logs above we see the source of the issue - an exception was raised during the running of the trainer code.

Missing Image

Check if the job is running:

$ kubectl get tfjob tfjob-example 
NAME            STATE     AGE
tfjob-example   Running   1m47s
CODE

Check the pod status of the job:

$ kubectl get pods -l job-name=tfjob-example
NAME                    READY   STATUS             RESTARTS   AGE
tfjob-sample-chief-0    1/1     Running            0          2m32s
tfjob-sample-worker-0   0/1     ImagePullBackOff   0          2m32s
tfjob-sample-worker-1   0/1     ImagePullBackOff   0          2m32s
CODE

The worker pods for the Job have ImagePullBackOff status.

We can get more details by describing the TFJob:

$ kubectl describe tfjob tfjob-example
Name:         tfjob-sample
Namespace:    mynamespace
...
API Version:  kubeflow.org/v1
Kind:         TFJob
Spec:
  Tf Replica Specs:
    Chief:
    ...
        Spec:
          Containers:
           ...
            Image:              mesosphere/kubeflow:mnist-sdk-example
            Image Pull Policy:  Always
            Name:               tensorflow
            ...
    Worker:
      Replicas:        2
      ...
        Spec:
          Containers:
          ...
            Image:  mesosphere/kubeflow:mnist-sdk-sample
            Name:   tensorflow
Status:
  Conditions:
    Last Transition Time:  2021-11-14T23:07:19Z
    Last Update Time:      2021-11-14T23:07:19Z
    Message:               TFJob tfjob-sample is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-11-14T23:07:25Z
    Last Update Time:      2021-11-14T23:07:25Z
    Message:               TFJob mynamespace/tfjob-sample is running.
    Reason:                TFJobRunning
    Status:                True
    Type:                  Running
  Replica Statuses:
    Chief:
      Active:  1
    Worker:
  Start Time:  2021-11-14T23:07:20Z
Events:
  Type     Reason                   Age                    From         Message
  ----     ------                   ----                   ----         -------
  Normal   SuccessfulCreatePod      10m                    tf-operator  Created pod: tfjob-sample-worker-0
  Normal   SuccessfulCreatePod      10m                    tf-operator  Created pod: tfjob-sample-worker-1
  Normal   SuccessfulCreateService  10m                    tf-operator  Created service: tfjob-sample-worker-0
  Normal   SuccessfulCreateService  10m                    tf-operator  Created service: tfjob-sample-worker-1
  Normal   SuccessfulCreatePod      10m                    tf-operator  Created pod: tfjob-sample-chief-0
  Normal   SuccessfulCreateService  10m                    tf-operator  Created service: tfjob-sample-chief-0
  Warning  ImagePullBackOff         9m32s (x2 over 9m47s)  tf-operator  Error pod tfjob-sample-worker-1 container tensorflow waiting message: Back-off pulling image "mesosphere/kubeflow:mnist-sdk-sample"
  Warning  ErrImagePull             9m31s (x7 over 10m)    tf-operator  Error pod tfjob-sample-worker-0 container tensorflow waiting message: rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/mesosphere/kubeflow:mnist-sdk-sample": failed to resolve reference "docker.io/mesosphere/kubeflow:mnist-sdk-sample": docker.io/mesosphere/kubeflow:mnist-sdk-sample: not found
  Warning  ErrImagePull             9m21s (x6 over 10m)    tf-operator  Error pod tfjob-sample-worker-1 container tensorflow waiting message: rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/mesosphere/kubeflow:mnist-sdk-sample": failed to resolve reference "docker.io/mesosphere/kubeflow:mnist-sdk-sample": docker.io/mesosphere/kubeflow:mnist-sdk-sample: not found
  Warning  ImagePullBackOff         4m2s (x11 over 9m47s)  tf-operator  Error pod tfjob-sample-worker-0 container tensorflow waiting message: Back-off pulling image "mesosphere/kubeflow:mnist-sdk-sample"
CODE

In the Events we can find the message that explains the cause of the issue: the image is not found in the registry:

Events:
  Type     Reason                   Age                    From         Message
...
  Warning  ErrImagePull             9m31s (x7 over 10m)    tf-operator  Error pod tfjob-sample-worker-0 container tensorflow waiting message: rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/mesosphere/kubeflow:mnist-sdk-sample": failed to resolve reference "docker.io/mesosphere/kubeflow:mnist-sdk-sample": docker.io/mesosphere/kubeflow:mnist-sdk-sample: not found
CODE

Insufficient resources

Check if the job is running:

$ kubectl get tfjob tfjob-example 
NAME            STATE     AGE
tfjob-example   Created   90s
CODE

Check the pod status of the job:

$ kubectl get pods -l job-name=tfjob-sample
NAME                    READY   STATUS    RESTARTS   AGE
tfjob-sample-chief-0    0/1     Pending   0          109s
tfjob-sample-worker-0   0/1     Pending   0          109s
tfjob-sample-worker-1   0/1     Pending   0          109s
CODE

All of the pods are in the Pending state.

 k describe tfjob tfjob-sample
Name:         tfjob-sample
Namespace:    mynamespace
API Version:  kubeflow.org/v1
Kind:         TFJob
...
Spec:
  Tf Replica Specs:
    Chief:
      Replicas:        1
...
        Spec:
          Containers:
...
            Image:  mesosphere/kubeflow:mnist-sdk-example
            Name:   tensorflow
            Resources:
              Requests:
                Memory:  128G
...
    Worker:
      Replicas:        2
...
        Spec:
          Containers:
...
            Image:  mesosphere/kubeflow:mnist-sdk-example
            Name:   tensorflow
            Resources:
              Requests:
                Memory:  128G
Status:
  Conditions:
    Last Transition Time:  2021-11-14T23:26:56Z
    Last Update Time:      2021-11-14T23:26:56Z
    Message:               TFJob tfjob-sample is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
  Replica Statuses:
    Chief:
    Worker:
  Start Time:  2021-11-14T23:26:57Z
Events:
  Type     Reason                   Age    From         Message
  ----     ------                   ----   ----         -------
  Normal   SuccessfulCreatePod      2m20s  tf-operator  Created pod: tfjob-sample-chief-0
  Normal   SuccessfulCreateService  2m20s  tf-operator  Created service: tfjob-sample-chief-0
  Normal   SuccessfulCreatePod      2m20s  tf-operator  Created pod: tfjob-sample-worker-0
  Normal   SuccessfulCreatePod      2m20s  tf-operator  Created pod: tfjob-sample-worker-1
  Normal   SuccessfulCreateService  2m19s  tf-operator  Created service: tfjob-sample-worker-0
  Normal   SuccessfulCreateService  2m19s  tf-operator  Created service: tfjob-sample-worker-1
  Warning  Unschedulable            2m19s  tf-operator  Error pod tfjob-sample-worker-1 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
  Warning  Unschedulable            2m19s  tf-operator  Error pod tfjob-sample-chief-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
  Warning  Unschedulable            2m19s  tf-operator  Error pod tfjob-sample-worker-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
CODE

In the Events we can find the message explains the cause of the issue, for example, insufficient memory to schedule the workload:

Events:
  Type     Reason                   Age                    From         Message
...
  Warning  Unschedulable            2m19s  tf-operator  Error pod tfjob-sample-worker-1 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
  Warning  Unschedulable            2m19s  tf-operator  Error pod tfjob-sample-chief-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
  Warning  Unschedulable            2m19s  tf-operator  Error pod tfjob-sample-worker-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
CODE

Limitations

Kubeflow Pipelines

Kubeflow Pipelines steps can fail if the main container exits too quickly and the Argo sidecar fails to collect artifacts. This can happen when the container image is not available on a node and needs to be pulled from the registry first. Retry the pipeline run or to pre-download the container image to the relevant nodes.

Using Kubeflow Fairing with Private Docker Registries

Kubeflow Fairing does not currently support Docker registries using self-signed TLS certificates, certificate chaining, or insecure (plaintext HTTP) registries. It is recommended to use the Kaptain SDK for building and pushing Docker images as a part of the model development process.

Spark

Running Spark on Spark in client mode from a notebook with Istio enabled is not supported. It is recommended to use the Spark Operator for running Spark applications.

Pocket Chrome Extension

Users who have the Google Chrome extension for Pocket installed may not be able to see large portions of the Kaptain UI. Disable the Pocket extension to ensure the Kaptain UI is completely visible.