Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

Connecting to S3 storage

oliver-tarrant-tessella edited this page Feb 28, 2019 · 1 revision

Connecting to S3 storage

To connect to S3 storage it is needed to pass the credential keys to the spark pods. This can be done by hard coding them into the applications manifest. However, to have more control over how and where the secrets are used we have choosen to use kubernetes secrets to store the keys.

Kubernetes Secerets

Kubernetes secrets are a method to hold small amount of sensitive data to be used on a kubernetes cluster. A kubernetes user can create the sercet which then runs on the cluster. Pods running on the cluster then be configured to have mount this secret so that the data is accessiable in the pods or reference the data in the secret through use of environment variables. Access to secrets can be controlled through service accounts and the secret is not accessible from outside of the cluster without kubectl privileges with the correct correct service account. For more details on kubernetes secrets see here

Creating a secrets

Before using a secret it must first be loaded onto the kubernetes cluster. For connecting to S3 storage we want to use the following secret definition:

apiVersion: v1
kind: Secret
metadata:
 name: name-for-your-secrets
type: Opaque
data:
 accessKey: base64 encoded access key to s3 # echo -n "AKIAIOSFODNN7EXAMPLE" | base64
 secretKey: base64 encoded secret key to s3

Note, you can encode your keys using echo -n "YOURKEY" | base64

Once the definition has been written simply navigate to it in a terminal and run kubectl apply -f name_of_secret_file.yaml

Using the secret

Our aim for using secrets is to have a secure way to access S3 storage when running a spark job with the Piezo web app.

When running a spark job that requires S3 storage the driver and executor pods try to create a connection with the S3 interface and thus require knowledge of the keys. When the connection is created, the spark pods use the hadoop configuration settings. In particular they look for the environment variables AWS_ACCESS_KEY_ID and AWS_ACCESS_KEY_ID.

The simpliest way to use the data from our secret to form a connection is to set the environment variables for AWS_ACCESS_KEY_ID and AWS_ACCESS_KEY_ID within the spark pods directly from the secret. This hides any knowledge of the secrets from the user. To set the environment variables the following is required in the spec.driver and spec.executor sections of the manifest defining the spark application:

    envSecretKeyRefs:
      AWS_ACCESS_KEY_ID:
        name: name_of_your_secret
        key: accessKey
      AWS_SECRET_ACCESS_KEY:
        name: name_of_your_secret
        key: secretKey  

In the piezo web app this is all taken care of behind the scenes and the user requires no knowledge of the keys or the secret to run their application