Deploying Spark in a project

Getting started

To get started with creating and managing Spark workloads in a project, you first need to deploy the Spark Operator in the workspace where the project exists.

After deploying the Spark Operator, apply the Spark Operator specific custom resources. The Spark Operator works with the following kinds of custom resources:

  • SparkApplication

  • ScheduledSparkApplication

See Spark Operator API documentation for more details.

Example Deployment

If you need to manage these custom resources and RBAC resources across all clusters in a project, it is recommended you use Project Deployments which enables you to leverage GitOps to deploy the resources. Otherwise, you will need to create the resources manually in each cluster.

This example deployment walks you through deploying a Spark application in a project namespace. The result of this procedure is a running Spark application ready for use in your project’s namespace.

  1. Create your Project if you don’t already have one.

  2. Set the PROJECT_NAMESPACE environment variable to the name of your project’s namespace:

    export PROJECT_NAMESPACE=<project namespace>
    CODE
  3. Ensure the necessary RBAC resources referenced in your custom resources exist, otherwise the custom resources can fail. See the Spark Operator documentation for details.

    • This is an example of commands for you to create the RBAC resources needed in your project namespace:

      kubectl apply -f - <<EOF
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: spark-service-account
        namespace: ${PROJECT_NAMESPACE}
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
        namespace: ${PROJECT_NAMESPACE}
        name: spark-role
      rules:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["*"]
      - apiGroups: [""]
        resources: ["services"]
        verbs: ["*"]
      - apiGroups: [""]
        resources: ["configmaps"]
        verbs: ["*"]
      - apiGroups: [""]
        resources: ["persistentvolumeclaims"]
        verbs: ["*"]
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: RoleBinding
      metadata:
        name: spark-role-binding
        namespace: ${PROJECT_NAMESPACE}
      subjects:
      - kind: ServiceAccount
        name: spark-service-account
        namespace: ${PROJECT_NAMESPACE}
      roleRef:
        kind: Role
        name: spark-role
        apiGroup: rbac.authorization.k8s.io
      EOF
      CODE
  4. Set the SPARK_SERVICE_ACCOUNT environment variable to one of the following:

    1. ${PROJECT_NAMESPACE}, if you skipped the previous step to create RBAC resources.

      # This service account is automatically created when you create a project and has access to everything in the project namespace. 
      export SPARK_SERVICE_ACCOUNT=${PROJECT_NAMESPACE}
      CODE
    2. Or set it to spark-service-account

      export SPARK_SERVICE_ACCOUNT=spark-service-account
      CODE
  5. Apply the SparkApplication custom resource in your project namespace

    kubectl apply -f - <<EOF
    apiVersion: "sparkoperator.k8s.io/v1beta2"
    kind: SparkApplication
    metadata:
      name: pyspark-pi
      namespace: ${PROJECT_NAMESPACE}
    spec:
      type: Python
      pythonVersion: "3"
      mode: cluster
      image: "gcr.io/spark-operator/spark-py:v3.1.1"
      imagePullPolicy: Always
      mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
      sparkVersion: "3.1.1"
      restartPolicy:
        type: OnFailure
        onFailureRetries: 3
        onFailureRetryInterval: 10
        onSubmissionFailureRetries: 5
        onSubmissionFailureRetryInterval: 20
      driver:
        cores: 1
        coreLimit: "1200m"
        memory: "512m"
        labels:
          version: 3.1.1
        serviceAccount: ${SPARK_SERVICE_ACCOUNT}
      executor:
        cores: 1
        instances: 1
        memory: "512m"
        labels:
          version: 3.1.1
    EOF
    CODE

Delete Spark custom resources

Follow these steps to delete the Spark custom resources:

  1. View SparkApplications in all namespaces:

    kubectl get sparkapp -A
    CODE
  2. Delete a specific SparkApplication:

    kubectl -n ${PROJECT_NAMESPACE} delete sparkapp <name of sparkapplication>
    CODE

Resources