Source classes represents the protocol of connections with data systems. In the Gobblin framework, a source class actually acts in two roles:
- As the planner when the job starts by generating work units and initiating extractors
- As an agent in each work unit on behalf of the job when the work unit is picked up a task executor
In DIL the work unit generation function is unanimous across all protocols, hence it is handled by MultistageSource. The extractor is initiated with a connection object, and the connection object is tied to the protocols, hence the initiation is handled by separate sub-classes:
- For HTTP protocol, it is HttpSource
- For HDFS protocol, it is HdfsSource
- For JDBC protocol, it is JdbcSource
- For SFTP protocol, it is SftpSource
- For S3 protocol, it is S3SourceV2
Each subclass holds a set of job keys, so that the extractors can have proper execution context; therefore, the agent function is handled in sub-classes.