Hyperparameter Tuning with Katib
Introduction
Hyperparameter tuning is the process of optimizing a model’s hyperparameter values in order to maximize the predictive quality of the model. Examples of such hyperparameters are the learning rate, neural architecture depth (layers) and width (nodes), epochs, batch size, dropout rate, and activation functions. These are the parameters that are set prior to training; unlike the model parameters (weights and biases), these do not change during the process of training the model.
Katib automates the process of hyperparameter tuning by running a pre-configured number of training jobs (known as trials) in parallel. Each trial evaluates a different set of hyperparameter configurations. Within each experiment it automatically adjusts the hyperparameters to find their optimal values with regard to the objective function, which is typically the model’s metric (e.g. accuracy, AUC, F1, precision). An experiment therefore consists of an objective, a search space for the hyperparameters, and a search algorithm. At the end of the experiment, Katib outputs the optimized values, which are also known as suggestions.
Whereas it is common to have training and test data sets in traditional (supervised) machine learning, in deep learning (esp. when combined with hyperparameter tuning), it is recommended to have a three-way split: training, validation (a.k.a. as development), and test. The training data set is, as always, to learn parameters (weights and biases) from data. The test data set is also known as the hold-out set and its sole purpose is to check the model's hypothesis of parameter values in terms of how well it generalizes to data it has never come across. The point of the validation data set is to cross-validate the model and tweak the hyperparameters. Since information from this data set is used to adjust the model, it is not an objective test of the model's generalizability. It is not unlike a teacher checking up on students:
- The training data set is the text book to learn the theory from
- The validation data set comprises the exercises to practice the theory
- The test data set is exam to assess the degree of learning vs lookup
What You’ll Learn
This notebook shows how you can create and configure an Experiment
for both TensorFlow
and PyTorch
training jobs.
In terms of Kubernetes, such an experiment is a custom resource handled by the Katib operator.
What You’ll Need
A Docker image with either a TensorFlow or PyTorch model that accepts hyperparameters as arguments. Please click on the links to see such models.
That’s it, so let’s get started!
How to Specify Hyperparameters in Your Models
In order for Katib to be able to tweak hyperparameters it needs to know what these are called in the model.
Beyond that, the model must specify these hyperparameters either as regular (command line) parameters or as environment variables.
Since the model needs to be containerized, any command line parameters or environment variables must to be passed to the container that holds your model.
By far the most common and also the recommended way is to use command line parameters that are captured with argparse
or similar; the trainer (function) then uses their values internally.
How to Expose Model Metrics as Objective Functions
By default, Katib collects metrics from the standard output of a job container by using a sidecar container.
In order to make the metrics available to Katib, they must be logged to stdout in the key=value
format.
The job output will be redirected to /var/log/katib/metrics.log
file.
This means that the objective function (for Katib) must match the metric’s key
in the models output.
It’s therefore possible to define custom model metrics for your use case.
In the sidecar (a.k.a. sidekick or decomposition) pattern, if you are not already familiar with it, a secondary (sidecar) container is attached to the primary workload inside a pod in Kubernetes. In many cases, pods run a single container, but peripheral services, such as networking services, monitoring, and logging, are required in all applications and services. With sidecars there is no need to re-implement basic but secondary tasks in each service or application. The sidecar has the same lifecycle as the primary application and it has access to the same resources. The sidecar is, however, isolated from the main container, which means it does not have to be implemented in the same technology. This means it can easily be reused across various workloads.
Katib does not care whether you use TensorFlow, PyTorch, MXNet, or any other framework for that matter. All it needs to do its job is a (parameterized) trainer container and the logs to grab the model's metrics from.
How to Create Experiments
Before we proceed, let’s set up a few basic definitions that we can re-use.
Note that you typically use (YAML) resource definitions for Kubernetes from the command line, but we shall show you how to do everything from a notebook, so that you do not have to exit your favourite environment at all!
Of course, if you are more familiar or comfortable with kubectl
and the command line, feel free to use a local CLI or the embedded terminals from the Jupyter Lab launch screen.
TF_EXPERIMENT_FILE = "katib-tfjob-experiment.yaml"
PYTORCH_EXPERIMENT_FILE = "katib-pytorchjob-experiment.yaml"
We also want to capture output from a cell with %%capture
that usually looks like some-resource created
.
To that end, let’s define a helper function:
import re
from IPython.utils.capture import CapturedIO
def get_resource(captured_io: CapturedIO) -> str:
"""
Gets a resource name from `kubectl apply -f <configuration.yaml>`.
:param str captured_io: Output captured by using `%%capture` cell magic
:return: Name of the Kubernetes resource
:rtype: str
:raises Exception: if the resource could not be created
"""
out = captured_io.stdout
matches = re.search(r"^(.+)\s+created", out)
if matches is not None:
return matches.group(1)
else:
raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")
TensorFlow: a TFJob Experiment
The TFJob
definition for this example is based on the MNIST with TensorFlow notebook.
To refresh our memory, the model accepts several arguments:
--batch-size
--buffer-size
--epochs
--steps
--learning-rate
--momentum
For our experiment, we want to focus on the learning rate and momentum of the SGD algorithm. You can add the other hyperparameters in a similar manner. Please note that discrete values (e.g. epochs) and categorical values (e.g. optimization algorithms) are supported, too.
The following YAML file describes an Experiment
object:
%%writefile $TF_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
name: katib-tfjob-experiment
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: random
parameters:
- name: --learning-rate
parameterType: double
feasibleSpace:
min: "0.3"
max: "0.4"
- name: --momentum
parameterType: double
feasibleSpace:
min: "0.6"
max: "0.7"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
# modify this property if you would like to use a custom image
image: mesosphere/kubeflow:mnist-tensorflow-2.2-1.0.1-0.5.0
imagePullPolicy: Always
command: ["python", "-u", "/mnist.py"]
args:
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}"
- "{{.Value}}"
{{- end}}
{{- end}}
# Comment out these resources when using only CPUs
resources:
limits:
nvidia.com/gpu: 1
Writing katib-tfjob-experiment.yaml
Please note that the Docker image that contains the model has to be set for the trialTemplate
configuration.
This experiment will create 12 trials with different sets of hyperparameter values passed to each training job.
It uses a random search to maximize the accuracy on the test data set.
A comment has been added where you can change the Docker image. The one listed should work, but you may want to try it with your own container registry.
- Split the data three ways
- Evaluate the model on both the validation and test data sets
- Add the model evaluation metric for the test data set as an entry in
additionalMetricNames
, which may contain multiple metrics of interest - Let Katib find the best hyperparameter values based on the validation data set (i.e.
objectiveMetricName
) - Extract the corresponding model evaluation metric for the test data set from the results
The Experiment
specification has the following sections to configure experiments:
spec.parameters
contains the list of hyperparameters that are used to tune the modelspec.objective
defines the metric to optimizespec.algorithm
defines which search algorithm to use for the tuning process
There are many more configuration options, but they are too numerous to go through here. Please have a look at the official documentation for more details.
PyTorch: a PyTorchJob experiment
We shall repeat the procedure for a PyTorchJob
.
This example is based on the MNIST with PyTorch notebook.
It accepts the following parameters relevant to training the model:
--batch-size
--epochs
--lr
(i.e. the learning rate)--gamma
For our experiment we wish to find the optimal learning rate in the range of [0.1, 0.7] with regard to the accuracy on the test data set.
This is logged as accuracy=<value>
, as can be seen in the original notebook for distributed training.
We run up to 12 trials with three such trials in parallel.
Again, we use a random search.
%%writefile $PYTORCH_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
name: katib-pytorchjob-experiment
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: random
parameters:
- name: --lr
parameterType: double
feasibleSpace:
min: "0.1"
max: "0.7"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
# modify this property if you would like to use a custom image
image: mesosphere/kubeflow:mnist-pytorch-1.0.1-0.5.0
imagePullPolicy: Always
command: ["python", "-u", "/mnist.py"]
args:
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}"
- "{{.Value}}"
{{- end}}
{{- end}}
# Comment out these resources when using only CPUs
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
# modify this property if you would like to use a custom image
image: mesosphere/kubeflow:mnist-pytorch-1.0.1-0.5.0
imagePullPolicy: Always
args:
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}"
- "{{.Value}}"
{{- end}}
{{- end}}
# Comment out these resources when using only CPUs
resources:
limits:
nvidia.com/gpu: 1
Writing katib-pytorchjob-experiment.yaml
Please note the subtle differences in the trialTemplate
: the kind
is either TFJob
or PyTorchJob
and the Docker images are obviously different.
How to Run and Monitor Experiments
Now we have specified our experiments, let’s run them!
You can either execute these commands on your local machine with kubectl
or you can run them from the notebook.
If you do run these locally, you cannot rely on cell magic, so you have to manually copy-paste the experiment name wherever you see $EXPERIMENT
.
If you intend to run the following command locally, you have to set the user namespace for all subsequent commands:
kubectl config set-context --current --namespace=<insert-namespace>
Please change the namespace to whatever has been set up by your administrator.
To submit our experiment, we execute:
%%capture kubectl_output --no-stderr
! kubectl apply -f $PYTORCH_EXPERIMENT_FILE
The cell magic grabs the output of the kubectl
command and stores it in an object named kubectl_output
.
From there we can use the utility function we defined earlier:
EXPERIMENT = get_resource(kubectl_output)
To see the status, we can then run:
! kubectl describe $EXPERIMENT
To get the list of created trials, use the following command:
! kubectl get trials.kubeflow.org -l experiment=katib-pytorchjob-experiment
NAME TYPE STATUS AGE
katib-pytorchjob-experiment-62b9lr7k Created True 2s
katib-pytorchjob-experiment-qcl4jkc6 Created True 2s
katib-pytorchjob-experiment-vnzgj7q6 Created True 2s
After the experiment is completed, use describe
to get the best trial results:
! kubectl describe $EXPERIMENT
The relevant section of the output looks like this:
Name: katib-pytorchjob-experiment
...
Status:
...
Current Optimal Trial:
Best Trial Name: katib-pytorchjob-experiment-jv4sc9q7
Observation:
Metrics:
Name: accuracy
Value: 0.9902
Parameter Assignments:
Name: --lr
Value: 0.5512569257804198
...
Trials: 6
Trials Succeeded: 6
...
Katib UI
So far, we have seen how to create and submit experiments via the command line or from within Jupyter notebooks. Katib provides a user interface, which allows you to create, configure, and monitor experiments in a browser. The Katib UI can be launched from Kubeflow’s central dashboard. Just click on “Katib” in the navigation menu on the left of the dashboard.
To access the experiment monitor, go to Menu → HP → Monitor in the hamburger menu next to “Katib” (black navigation bar at the top).
To see detailed information, such as trial results, metrics, and a plot, click on the experiment itself.