Skip to content
Henry Haiying Cai edited this page Jan 13, 2015 · 22 revisions

Gobblin Image

Gobblin is an unified data ingestion framework to bring significant amount of data from internal data sources (data generated on premise) and external data sources (data sourced from external web sites) into one central repository (HDFS) for analysis.

As companies are moving more and more towards a data-driven decision making business model, increasing number of business products are driven by business insights from data generated both internally on premise or sourced externally from public web sites or web services. Gobblin is developed to address ingesting those big data with ease:

  • Centralized data lake: standardized data formats, directory layouts;
  • Standardized catalog of lightweight transformations: security filters, schema evolution, type conversion, etc;
  • Data quality measurements and enforcement: schema validation, data audits, etc;
  • Scalable ingest: auto-scaling, fault-tolerance, etc;

Support Matrix

Gobblin supports the following combination of data sources and protocols:

  • The types of data sources: RDBMS(JDBC), files(HDFS/SFTP/LocalFS), REST(Salesforce), etc.;
  • The semantics of the data bundles: increments, appends, full dumps, etc.;
  • Deployment: standalone, Hadoop 1.2.1+, Hadoop 2.3.0+;

Quickstart Guides and Examples

Community

Useful Links

Clone this wiki locally