Skip to content

Configuration Properties Glossary

sahilTakiar edited this page Feb 9, 2015 · 97 revisions
  • Author: Sahil
  • Reviewer: Ziyang

Configuration properties are key, value pairs that are set in text files. They include system properties that control how Gobblin will pull data, and control what source Gobblin will pull the data from. Configuration files end in some user-specified suffix (by default text files ending in ".pull" or ".job" are recognized as configs files, although this is configurable). Each file represents some unit of work that needs to be done in Gobblin. For example, there will typically be a separate configuration file for each table that needs to be pulled from a database.

The first section of this document contains all the required properties needed to run a basic Gobblin job. The rest of the document is dedicated to other properties that can be used to configure Gobbin jobs. The GitHub repo also contains sample config files for specific sources. For example, there are sample config files to connect to MySQL databases and SFTP servers.

Table of Contents

Creating a Basic Properties File

In order to create a basic configuration property there is a small set of required properties that need to be set. The following properties are required to run any Gobblin job:

  • job.name - Name of the job
  • source.class - Fully qualified path to the Source class responsible for connecting to the data source
  • writer.staging.dir - The directory each task will write staging data to
  • writer.output.dir - The directory each task will commit data to
  • data.publisher.final.dir - The final directory where all the data will be published
  • state.store.dir - The directory where state-store files will be written

For more information on each property, check out the comprehensive list below.

If only these properties are set, then by default, Gobblin will write data in Avro format to the local filesystem. In order to write to HDFS, set the writer.fs.uri property to the URI of the HDFS NameNode that data should be written to. Since the default version of Gobblin writes data in Avro format, the writer expects Avro records to be passed to it. Thus, any data pulled from an external source must be converted to Avro records before it can be written out to the filesystem.

The source.class property is one of the most important properties in Gobblin. It specifies what Source class to use. The Source class is responsible for determining what work needs to be done during each run of the job, and specifies what Extractor to use in order to read over each sub-unit of data. Examples of Source classes are WikipediaSource and SimpleJsonSource, which can be found in the GitHub repository. For more information on Sources and Extractors, check out the Architecture page.

Job Launcher Properties

Gobblin jobs can be launched and scheduled in a variety of ways. They can be scheduled via a Quartz scheduler or through Azkaban. Jobs can also be run without a scheduler via the Command Line. For more information on launching Gobblin jobs, check out the Deployment page.

Common Job Launcher Properties

These properties are common to both the Job Launcher and the Command Line.

job.name

Description

The name of the job to run. This name must be unique within a single Gobblin instance.

Default Value

None

Required

Yes

job.group

Description

A way to group logically similar jobs together.

Default Value

None

Required

No

job.description

Description

A description of what the jobs does.

Default Value

None

Required

No

job.lock.dir

Description

Directory where job locks are stored. Job locks are used by the scheduler to ensure two executions of a job do not run at the same time. If a job is scheduled to run, Gobblin will first check this directory to see if there is a lock file for the job. If there is one, it will not run the job, if there isn't one then it will run the job.

Default Value

None

Required

No

job.lock.enabled

Description

If set to true job locks are enabled, if set to false they are disabled

Default Value

True

Required

No

job.runonce

Description

A boolean specifying whether the job will be only once, or multiple times. If set to true the job will only be run once even if a job.schedule is specified. If set to false and a job.schedule is specified then it will run according to the schedule. If set false and a job.schedule is not specified, it will run only once.

Default Value

False

Required

No

job.disabled

Description

Whether the job is disabled or not. If set to true, then Gobblin will not run this job.

Default Value

False

Required

No

SchedulerDaemon Properties

This class is used to schedule Gobblin jobs on Quartz. For more information on how to set the configuration parameters for jobs launched through the SchedulerDaemon, check out the Deployment page.

job.schedule

Description

Cron-Based job schedule. This schedule only applies to jobs that run using Quartz.

Default Value

None

Required

No

jobconf.dir

Description

When running in local mode, Gobblin will check this directory for any configuration files. Each configuration file should correspond to a separate Gobblin job, and each one should in a suffix specified by the jobconf.extensions parameter.

Default Value

None

Required

No

jobconf.extensions

Description

Comma-separated list of supported job configuration file extensions. When running in local mode, Gobblin will only pick up job files ending in these suffixes.

Default Value

pull,job

Required

No

jobconf.monitor.interval

Description

Controls how often Gobblin checks the jobconf.dir for new configuration files, or for configuration file updates. The parameter is measured in milliseconds.

Default Value

300000

Required

No

CliMRJobLauncher Properties

There are no configuration parameters specific to CliMRJobLauncher. This class is used to launch Gobblin jobs on Hadoop from the command line, the jobs are not scheduled. For more information on how to set the configuration parameters for jobs launched through the command line, check out the Deployment page.

Job Type Properties

Common Job Type Properties

launcher.type

Description

Job launcher type; one of LOCAL, MAPREDUCE, YARN. LOCAL mode runs on a single machine (LocalJobLauncher), MAPREDUCE runs on a Hadoop cluster (MRJobLauncher), and YARN runs on a YARN cluster (not implemented yet).

Default Value

LOCAL

Required

No

LocalJobLauncher Properties

There are no configuration parameters specific to LocalJobLauncher. The LocalJobLauncher will launch a Hadoop job on a single machine. If launcher.type is set to LOCAL then this class will be used to launch the job. Properties required by the MRJobLauncher class.

framework.jars

Description

Comma-separated list of jars the Gobblin framework depends on. These jars will be added to the classpath of the job, and to the classpath of any containers the job launches.

Default Value

None

Required

No

job.jars

Description

Comma-separated list of jar files the job depends on. These jars will be added to the classpath of the job, and to the classpath of any containers the job launches.

Default Value

None

Required

No

job.local.files

Description

Comma-separated list of local files the job depends on. These files will be available to any map tasks that get launched via the DistributedCache.

Default Value

None

Required

No

job.hdfs.files

Description

Comma-separated list of files on HDFS the job depends on. These files will be available to any map tasks that get launched via the DistributedCache.

Default Value

None

Required

No

mr.job.root.dir

Description

Working directory for a Gobblin Hadoop MR job. Gobblin uses this to write intermediate data, such as the workunit state files that are used by each map task. This has to be a path on HDFS.

Default Value

None

Required

Yes

mr.job.max.mappers

Description

Maximum number of mappers to use in a Gobblin Hadoop MR job. If no explicit limit is set then a map task for each workunit will be launched. If the value of this properties is less than the number of workunits created, then each map task will run multiple tasks.

Default Value

None

Required

No

Retry Properties

Properties that control how tasks and jobs get retried on failure.

workunit.retry.enabled

Description

Whether retries of failed work units across job runs are enabled or not.

Default Value

True

Required

No

workunit.retry.policy

Description

Work unit retry policy, can be one of {always, never, onfull, onpartial}.

Default Value

always

Required

No

task.maxretries

Description

Maximum number of task retries. A task will be re-tried this many times before it is considered a failure.

Default Value

5

Required

No

task.retry.intervalinsec

Description

Interval in seconds between task retries. The interval increases linearly with each retry. For example, if the first interval is 300 seconds, then the second one is 600 seconds, etc.

Default Value

300

Required

No

job.max.failures

Description

Maximum number of failures before an alert email is triggered.

Default Value

1

Task Execution Properties

These properties control how tasks get executed for a job. Gobblin uses thread pools in order to executes the tasks for a specific job. In local mode there is a single thread pool per job that executes all the tasks for a job. In MR mode there is a thread pool for each map task (or container), and all Gobblin tasks assigned to that mapper are executed in that thread pool.

taskexecutor.threadpool.size

Description

Size of the thread pool used by task executor for task execution. Each task executor will spawn this many threads to execute any Tasks that is has been allocated.

Default Value

10

Required

No

tasktracker.threadpool.coresize

Description

Core size of the thread pool used by task tracker for task state tracking and reporting.

Default Value

10

Required

No

tasktracker.threadpool.maxsize

Description

Maximum size of the thread pool used by task tracker for task state tracking and reporting.

Default Value

10

Required

No

taskretry.threadpool.coresize

Description

Core size of the thread pool used by the task executor for task retries.

Default Value

2

Required

No

taskretry.threadpool.maxsize

Description

Maximum size of the thread pool used by the task executor for task retries.

Default Value

2

Required

No

task.status.reportintervalinms

Description

Task status reporting interval in milliseconds.

Default Value

30000

Required

No

State Store Properties

state.store.dir

Description

Root directory where job and task state files are stored. The state-store is used by Gobblin to track state between different executions of a job. All state-store files will be written to this directory.

Default Value

None

Required

Yes

state.store.fs.uri

Description

File system URI for file-system-based state stores.

Default Value

file:///

Required

No

Metrics Properties

metrics.enabled

Description

Whether metrics collecting and reporting are enabled or not.

Default Value

True

Required

No

metrics.report.interval

Description

Metrics reporting interval in milliseconds.

Default Value

60000

Required

No

metrics.log.dir

Description

If this parameter is not present, metrics will not be written to a file. If it is present then metrics will be written to the folder specified by this key.

Default Value

None

Required

No

rest.server.host

Description
Default Value
Required

metrics.reporting.file.enabled

Description
Default Value
Required

metrics.reporting.jmx.enabled

Description
Default Value
Required

job.execinfo.server.enabled

Description
Default Value
Required

job.history.store.enabled

Description
Default Value
Required

job.history.store.url

Description
Default Value
Required

job.history.store.jdbc.driver

Description
Default Value
Required

job.history.store.user

Description
Default Value
Required

job.history.store.password

Description
Default Value
Required

Email Alert Properties

email.alert.enabled

Description

Whether alert emails are enabled or not. Email alerts are only sent out when jobs fail consecutively job.max.failures number of times.

Default Value

False

Required

No

email.notification.enabled

Description

Whether job completion notification emails are enabled or not. Notification emails are sent whenever the job completes, regardless of whether it failed or not.

Default Value

False

Required

No

email.host

Description

Host name of the email server.

Default Value

None

Required

Yes, if email notifications or alerts are enabled.

email.smtp.port

Description

SMTP port number.

Default Value

None

Required

Yes, if email notifications or alerts are enabled.

email.user

Description

User name of the sender email account.

Default Value

None

Required

No

email.password

Description

User password of the sender email account.

Default Value

None

Required

No

email.from

Description

Sender email address.

Default Value

None

Required

Yes, if email notifications or alerts are enabled.

email.tos

Description

Comma-separated list of recipient email addresses.

Default Value

None

Required

Yes, if email notifications or alerts are enabled.

Source Properties

Common Source Properties

These properties are common properties that are used among different Source implementations. Depending on what source class is being used, these parameters may or may not be necessary. These parameters are not tied to a specific source, and thus can be used in new source classes.

source.class

Description

Fully qualified name of the Source class. For example, com.linkedin.gobblin.example.wikipedia

Default Value

None

Required

Yes

source.entity

Description

Name of the source entity that needs to be pulled from the source. The parameter represents a logical grouping of data that needs to be pulled from the source. Often this logical grouping comes in the form a database table, a source topic, etc. In many situations, such as when using the QueryBasedExtractor, it will be the name of the table that needs to pulled from the source.

Default Value

None

Required

Required for QueryBasedExtractors, FileBasedExtractors.

source.timezone

Description

Timezone of the data being pulled in by the extractor. Examples include "PST" or "UTC".

Default Value

None

Required

Required for QueryBasedExtractors

source.max.number.of.partitions

Description

Maximum number of partitions to split this current run across. Only used by the QueryBasedSource and FileBasedSource.

Default Value

20

Required

No

source.skip.first.record

Description

True if you want to skip the first record of each data partition. Only used by the FileBasedExtractor.

Default Value

False

Required

No

extract.namespace

Description

Namespace for the extract data. The namespace will be included in the default file name of the outputted data.

Default Value

None

Required

No

source.conn.use.proxy.url

Description

The URL of the proxy to connect to when connecting to the source. This parameter is only used for SFTP and REST sources.

Default Value

None

Required

No

source.conn.use.proxy.port

Description

The port of the proxy to connect to when connecting to the source. This parameter is only used for SFTP and REST sources.

Default Value

None

Required

No

source.conn.username

Description

The username to authenticate with the source. This is parameter is only used for SFTP and JDBC sources.

Default Value

None

Required

No

source.conn.password

Description

The password to use when authenticating with the source. This is parameter is only used for JDBC sources.

Default Value

None

Required

No

source.conn.host

Description

The host URL to connect to.

Default Value

None

Required

Required for SftpExtractor, MySQLExtractor, and SQLServerExtractor.

source.conn.rest.url

Description

URL to connect to for REST requests. This parameter is only used for the Salesforce source.

Default Value

None

Required

No

source.conn.version

Description

Version number of communication protocol. This parameter is only used for the Salesforce source.

Default Value

None

Required

No

source.conn.timeout

Description

The timeout set for connecting to the source in milliseconds.

Default Value

500000

Required

No

source.conn.port

Description

The value of the port to connect to.

Default Value

None

Required

Required for SftpExtractor, MySQLExtractor, SqlServerExtractor.

extract.table.name

Description

Table name in Hadoop which is different table name in source.

Default Value

Source table name

Required

No

extract.is.full

Description

True if this pull should treat the data as a full dump of table from the source, false otherwise

Default Value

False

Required

No

extract.delta.fields

Description

List of columns that will be used as the delta field for the data.

Default Value

None

Required

No

extract.primary.key.fields

Description

List of columns that will be used as the primary key for the data.

Default Value

None

Required

No

extract.pull.limit

Description

This limits the number of records read by Gobblin. In Gobblin's extractor the readRecord() method is expected to return records until there are no more to pull, in which case it runs null. This parameter limits the number of times readRecord() is executed. This parameter is useful for pulling a limited sample of the source data for testing purposes.

Default Value

Unbounded

Required

No

extract.full.run.time

Description
Default Value
Required

QueryBasedExtractor Properties

The following table lists the query based extractor configuration properties.

source.querybased.watermark.type

Description

The format of the watermark that is used when extracting data from the source. Possible types are timestamp, date, hour, simple.

Default Value

timestamp

Required

Yes

source.querybased.start.value

Description

Value for the watermark to start pulling data from, also the default watermark if the previous watermark cannot be found in the old task states.

Default Value

None

Required

Yes

source.querybased.partition.interval

Description

Number of hours to pull in each partition.

Default Value

1

Required

No

source.querybased.hour.column

Description

Delta column with hour for hourly extracts (Ex: hour_sk)

Default Value

None

Required

No

source.querybased.skip.high.watermark.calc

Description

If it is true, skips high watermark calculation in the source and it will use partition higher range as high watermark instead of getting it from source.

Default Value

False

Required

No

source.querybased.query

Description

The query that the extractor should execute to pull data.

Default Value

None

Required

No

source.querybased.hourly.extract

Description

True if hourly extract is required.

Default Value

False

Required

No

source.querybased.extract.type

Description

"snapshot" for the incremental dimension pulls. "append_daily", "append_hourly" and "append_batch" for the append data append_batch for the data with sequence numbers as watermarks

Default Value

None

Required

No

source.querybased.end.value

Description

The high watermark which this entire job should pull up to. If this is not specified, pull entire data from the table

Default Value

None

Required

No

source.querybased.append.max.watermark.limit

Description

max limit of the high watermark for the append data. CURRENT_DATE - X CURRENT_HOUR - X where X>=1

Default Value

CURRENT_DATE for daily extract CURRENT_HOUR for hourly extract

Required

No

source.querybased.is.watermark.override

Description

True if this pull should override previous watermark with start.value and end.value. False otherwise.

Default Value

False

Required

No

source.querybased.low.watermark.backup.secs

Description

Number of seconds that needs to be backup from the previous high watermark. This is to cover late data. Ex: Set to 3600 to cover 1 hour late data.

Default Value

0

Required

No

source.querybased.schema

Description

Database name

Default Value

None

Required

No

source.querybased.is.specific.api.active

Description

True if this pull needs to use source specific apis instead of standard protocols. Ex: Use salesforce bulk api instead of rest api

Default Value

False

Required

No

source.querybased.skip.count.calc

Description

A boolean, if true then the QueryBasedExtractor will skip the source count calculation.

Default Value

False

Required

No

source.querybased.fetch.size

Description
Default Value
Required

source.querybased.is.metadata.column.check.enabled

Description
Default Value
Required

source.querybased.is.compression.enabled

Description
Default Value
Required

source.querybased.jdbc.resultset.fetch.size

Description
Default Value
Required

JdbcExtractor Properties

The following table lists the jdbc based extractor configuration properties.

source.conn.driver

Description

The fully qualified path of the JDBC driver used to connect to the external source.

Default Value

None

Required

Yes

source.column.name.case

Description

A enum specifying whether or not to convert the column names to a specific case before performing a query. Possible values are TOUPPER or TOLOWER.

Default Value

NOCHANGE

Required

No

FileBasedExtractor Properties

The following table lists the file based extractor configuration properties.

source.filebased.data.directory

Description

The data directory from which to pull data from.

Default Value

None

Required

Yes

source.filebased.files.to.pull

Description

A list of files to pull - this should be set in the Source class and the extractor will pull the specified files.

Default Value

None

Required

Yes

filebased.report.status.on.count

Description

The FileBasedExtractor will report it's status every time it processes the number of records specified by this parameter. The way it reports status is by logging out how many records it has seen.

Default Value

10000

Required

No

source.filebased.fs.uri

Description

The URI of the filesystem to connect to.

Default Value

None

Required

Required for HadoopExtractor.

source.filebased.preserve.file.name

Description

A boolean, if true then the original file names will be preserved when they are are written to the source.

Default Value

False

Required

No

source.schema

Description

The schema of the data that will be pulled by the source.

Default Value

None

Required

Yes

SftpExtractor Properties

source.conn.private.key

Description

File location of the private key used for key based authentication. This parameter is only used for the SFTP source.

Default Value

None

Required

Yes

source.conn.known.hosts

Description

File location of the known hosts file used for key based authentication.

Default Value

None

Required

Yes

Converter Properties

Properties for Gobblin converters.

converter.classes

Description

Comma-separated list of fully qualified names of the Converter classes. The order is important as the converters will be applied in this order.

Default Value

None

Required

No

CsvToJsonConverter Properties

This converter takes in text data separated by a delimiter (converter.csv.to.json.delimiter), and splits the data into a JSON format recognized by JsonIntermediateToAvroConverter.

converter.csv.json.delimiter

Description

The regex delimiter between CSV based files, only necessary when using the CsvToJsonConverter - e.g. ",", "/t" or some other regex

Default Value

None

Required

Yes

JsonIntermediateToAvroConverter Properties

This converter takes in JSON data in a specific schema, and converts it to Avro data.

converter.avro.date.format

Description

Source format of the date columns for Avro-related converters.

Default Value

None

Required

No

converter.avro.timestamp.format

Description

Source format of the timestamp columns for Avro-related converters.

Default Value

None

Required

No

converter.avro.time.format

Description

Source format of the time columns for Avro-related converters.

Default Value

None

Required

No

converter.avro.binary.charset

Description

Source format of the time columns for Avro-related converters.

Default Value

UTF-8

Required

No

converter.is.epoch.time.in.seconds

Description

A boolean specifying whether or not a epoch time field in the JSON object is in seconds or not.

Default Value

None

Required

Yes

converter.avro.max.conversion.failures

Description

This converter is will fail for this many number of records before throwing an exception.

Default Value

0

Required

No

AvroFilterConverter Properties

This converter takes in an Avro record, and filters out records by performing an equality operation on the value of the field specified by converter.filter.field and the value specified in converter.filter.value. It returns the record unmodified if the equality operation evaluates to true, false otherwise.

converter.filter.field

Description

The name of the field in the Avro record, for which the converter will filter records on.

Default Value

None

Required

Yes

converter.filter.value

Description

The value that will be used in the equality operation to filter out records.

Default Value

None

Required

Yes

AvroFieldRetrieverConverter Properties

This converter takes a specific field from an Avro record and returns its value.

converter.avro.extractor.field.path

Description

The field in the Avro record to retrieve. If it is a nested field, then each level must be separated by a period.

Default Value

None

Required

Yes

Fork Properties

Properties for Gobblin's fork operator.

fork.operator.class

Description

Fully qualified name of the ForkOperator class.

Default Value

com.linkedin.uif.fork.IdentityForkOperator

Required

No

fork.branches

Description

Number of fork branches.

Default Value

1

Required

No

fork.branch.name.${branch index}

Description

Name of a fork branch with the given index, e.g., 0 and 1.

Default Value

fork_${branch index}, e.g., fork_0 and fork_1.

Required

No

Quality Checker Properties

qualitychecker.task.policies

Description

Comma-separted list of fully qualified names of the TaskLevelPolicy classes that will run at the end of each Task.

Default Value

None

Required

No

qualitychecker.task.policy.types

Description

OPTIONAL implies the corresponding class in qualitychecker.task.policies is optional and if it fails the Task will still succeed, FAIL implies that if the corresponding class fails then the Task will fail too.

Default Value

OPTIONAL

Required

No

qualitychecker.row.policies

Description

Comma-separted list of fully qualified names of the RowLevelPolicy classes that will run on each record.

Default Value

None

Required

No

qualitychecker.row.policy.types

Description

OPTIONAL implies the corresponding class in qualitychecker.row.policies is optional and if it fails the Task will still succeed, FAIL implies that if the corresponding class fails then the Task will fail too, ERR_FILE implies that if the record does not pass the test then the record will be written to an error file.

Default Value

OPTIONAL

Required

No

qualitychecker.row.err.file

Description

The quality checker will write the current record to the location specified by this parameter, if the current record fails to pass the quality checkers specified by qualitychecker.row.policies; this file will only be written to if the quality checker policy type is ERR_FILE.

Default Value

None

Required

No

Writer Properties

writer.destination.type

Description

Writer destination type; currently only writing to HDFS is supported.

Default Value

HDFS

Required

No

writer.output.format

Description

Writer output format; currently only Avro is supported.

Default Value

AVRO

Required

No

writer.fs.uri

Description

File system URI for writer output.

Default Value

file:///

Required

No

writer.staging.dir

Description

Staging directory of writer output. All staging data that the writer produces will be placed in this directory, but all the data will be eventually moved to the writer.output.dir.

Default Value

None

Required

Yes

writer.output.dir

Description

Output directory of writer output. All output data that the writer produces will be placed in this directory, but all the data will be eventually moved to the final directory by the publisher.

Default Value

None

Required

Yes

writer.builder.class

Description

Fully qualified name of the writer builder class.

Default Value

com.linkedin.uif.writer.AvroDataWriterBuilder

Required

No

writer.file.path

Description

The Path where the writer will write it's data. Data in this directory will be copied to it's final output directory by the DataPublisher.

Default Value

None

Required

Yes

writer.file.name

Description

The name of the file the writer writes to.

Default Value

part

Required

Yes

writer.buffer.size

Description

Writer buffer size in bytes. This parameter is only applicable for the AvroHdfsDataWriter.

Default Value

4096

Required

No

writer.deflate.level

Description

Writer deflate level. Deflate is a type of compression for Avro data.

Default Value

9

Required

No

writer.codec.type

Description

This is used to specify the type of compression used when writing data out. Possible values are NOCOMPRESSION, DEFLATE, SNAPPY.

Default Value

DEFLATE

Required

No

Data Publisher Properties

data.publisher.type

Description

The fully qualified name of the DataPublisher class to run. The DataPublisher is responsible for publishing task data once all Tasks have been completed.

Default Value

None

Required

Yes

data.publisher.final.dir

Description

The final output directory where the data should be published.

Default Value

None

Required

Yes

data.publisher.replace.final.dir

Description

A boolean, if true and the the final output directory already exists, then the data will not be committed. If false and the final output directory already exists then it will be overwritten.

Default Value

None

Required

Yes

data.publisher.final.name

Description

The final name of the file that is produced by Gobblin. By default, Gobblin already assigns a unique name to each file it produces. If that default name needs to be overridden then this parameter can be used. Typically, this parameter should be set on a per workunit basis so that file names don't collide.

Default Value
Required

No

Generic Properties

These properties are used throughout multiple Gobblin components.

fs.uri

Description

Default file system URI for all file storage; over-writable by more specific configuration properties.

Default Value

file:///

Required

No

Clone this wiki locally