Using Kerberos with DC/OS Data Science Engine to retrieve and write data securely

Kerberos is an authentication system that allows DC/OS Data Science Engine to retrieve and write data securely to a Kerberos-enabled HDFS cluster. Long-running jobs will renew their delegation tokens (authentication credentials).

This guide assumes you have already set up a Kerberos-enabled HDFS cluster.

Configuring Kerberos with DC/OS Data Science Engine

DC/OS Data Science Engine and all Kerberos-enabled components need a valid krb5.conf configuration file. The krb5.conf file tells data-science-engine how to connect to your Kerberos key distribution center (KDC). You can specify properties for the krb5.conf file with the following options.

  "security": {
    "kerberos": {
      "enabled": true,
      "kdc": {
        "hostname": "<kdc_hostname>",
        "port": <kdc_port>
      "primary": "<primary_for_principal>",
      "realm": "<kdc_realm>",
      "keytab_secret": "<path_to_keytab_secret>"

Make sure your keytab file is in the DC/OS secret store, under a path that is accessible by the data-science-engine service.

Example: Using HDFS with Spark in a Kerberized Environment

Here is an example notebook of Tensorflow on Spark using HDFS as a storage backend in a Kerberized environment.

First of all, you need to make sure that HDFS service is installed and DC/OS Data Science Engine is configured with its endpoint. To read more about configuring an HDFS integration of DC/OS Data Science Engine, see the Using HDFS with DC/OS Data Science Engine section.

  1. Make sure HDFS Client service is installed and running with the “Kerberos enabled” option.

  2. Run the following commands to set up a directory on HDFS with proper permissions:

    # Suppose the HDFS Client version you are running is "2.6.0-cdh5.0.1", then command will be
    dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -mkdir -p /data-science-engine'
    # Suppose the name of the primary mentioned above is "jupyter"
    dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chown jupyter:jupyter /data-science-engine'
    dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chmod 700 /data-science-engine'
  3. Launch Terminal from the Notebook UI.

  4. Clone TensorFlow on Spark repository and download a sample dataset:

    rm -rf TensorFlowOnSpark && git clone
    rm -rf mnist && mkdir mnist
    curl -fsSL -O
    unzip -d mnist/
  5. List files in the target HDFS directory and remove it if it is not empty.

    hdfs dfs -ls -R /data-science-engine/mnist_kerberos && hdfs dfs -rm -R /data-science-engine/mnist_kerberos
  6. Generate sample data and save to HDFS.

    spark-submit \
      --verbose \
      $(pwd)/TensorFlowOnSpark/examples/mnist/ \
      --output /data-science-engine/mnist_kerberos/csv \
      --format csv
    hdfs dfs -ls -R /data-science-engine/mnist_kerberos
  7. Train the model and checkpoint it to the target directory in HDFS.

    You will need to specify two additional options to distribute the Kerberos ticket cache file to executors: --files <Kerberos ticket cache file> and --conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99". The Kerberos ticket cache file will be used by executors for authentication with Kerberized HDFS:

    spark-submit \
      --files /tmp/krb5cc_99 --conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99" \
      --verbose \
      --py-files $(pwd)/TensorFlowOnSpark/examples/mnist/spark/ \
      $(pwd)/TensorFlowOnSpark/examples/mnist/spark/ \
      --cluster_size 4 \
      --images /data-science-engine/mnist_kerberos/csv/train/images \
      --labels /data-science-engine/mnist_kerberos/csv/train/labels \
      --format csv \
      --mode train \
      --model /data-science-engine/mnist_kerberos/mnist_csv_model
  8. Verify that the model has been saved.

    hdfs dfs -ls -R /data-science-engine/mnist_kerberos/mnist_csv_model