There are several types of encryption, and each can engage different methods. Encryption can be used in the following scenarios:
- Encrypt credentials, like usernames, passwords, keys, or refresh tokens, in configurations
- Encrypt confidential data, like access tokens, in data files
- Encrypt credentials that are to be sent over network for authentication
- Encrypt data files that are to be sent out to cloud storage
For each type of encryption, there is a need for corresponding decryption.
Secrets in configuration are encrypted using Gobblin encryption utility.
- This encryption requires a master key that is stored in a location under
encrypt.key.loc
job property. - The encrypted secrets are enclosed within "ENC()".
- In runtime, DIL will call the encryption utility to decrypt the secretes enclosed within "ENC()" by pattern matching.
Currently, encryption is a manual process, i.e., the pipeline developer need to manually encrypt the secrets before putting into the configuration.
Decryption is automatically, but the detection of encrypted strings is limited to the following job properties.
From example:
ms.authentication={"method": "custom", "encryption": "", "header": "x-apikey", "token": "ENC(xxx)"}
source.conn.username=ENC(xxx)
source.conn.password=ENC(xxx)
state.store.db.password=ENC(xxx)
Confidential data can be encrypted. The encryption is at field level. This is an automatic process using Gobblin encryption utility API.
To encrypt a field before it is written to storage, include the field name in ms.encryption.fields.
For example:
ms.encryption.fields=["access_token"]
The field "access_token" will be encrypted on storage, and its value will be like "ENC(xxx)". The value can only be decrypted using the
master key stored in a location under encrypt.key.loc
.
When sending credentials to data system for authentication, the secrets can be encrypted using encryption methods acceptable to the data system. Currently, DIL only supports "base64" encryption when sending username and password for authentication. Because username and password might contain special characters, most data system using username/password, including those using user key and secret, require BASE64 encryption. See ms.authentication
BASE64 is a reversible encryption, so it have to be sent over secure network to be safe
For example:
ms.authentication={"method": "custom", "encryption": "base64", "header": "x-apikey", "token":"ENC(xxx)"}
In runtime, DIL will decrypt the "token" using Gobblin utility, and then encrypt it using BASE64 encryption.
ms.authentication={"method": "basic", "encryption": "base64", "header": "Authorization"}
In runtime, DILL will take credentials from source.conn.username
and source.conn.password
, decrypt them if encrypted, then
concatenate them to one string separated by ":", then encrypt the concatenated string using BASE64 encryption.
When ingesting data, the data systems could encrypt their data using GPG algorithm. Encrypted data can be decrypted using preprocessors, before being parsed by the extractor.
GPG Encryption/Decryption can be Symmetric or Asymmetric.
- Symmetric Decryption uses password only; no private key used.
- In Asymmetric Decryption, the source was encrypted using a public key, and optionally a password, the decryption need to use a private key, and the password if it was used.
DIL is able to decrypt a source stream if it uses one of the following GPG supported algorithms:
- 3DES
- IDEA (since versions 1.4.13 and 2.0.20)
- CAST5
- Blowfish
- Twofish
- AES-128, AES-192, AES-256
- Camellia-128, -192 and -256 (since versions 1.4.10 and 2.0.12)
The job configuration should use one of these ciphers to be accepted. And if no cipher is specified, CAST5 will be used.
See
For example:
ms.extract.preprocessors=org.apache.gobblin.multistage.preprocessor.GpgEncryptProcessor
ms.extract.preprocessor.parameters={"com.linkedin.cdi.preprocessor.GpgEncryptProcessor": {"keystore_path" :"/path/secret.gpg", "key_name": "999ABC", "keystore_password" : "ENC(password)"}}
In above example, key_name is required for encryption, it is a long type id, and it should be formatted as an HEX string.
Encryption accepts the same set of algorithms as decryption. Currently, encryption only works in FileDumpExtractor, and it will encrypt the whole file in once.