layout | title |
---|---|
page |
Frequently Asked Questions |
This software is licensed under the Apache Software License 2.0. A file named LICENSE.txt should have been included with the software.
Data is stored using either Avro, for record-oriented storage, or Parquet, for column-oriented storage.
Avro files are snappy-compressed and encoded using Avro's binary encoder, according to Avro's object container file spec. Avro meets the criteria for sane storage and operation of data. Specifically, Avro:
- has a binary representation that is compact.
- is language agnostic.
- supports compression of data.
- is splittable by MapReduce jobs, including when compressed.
- is self-describing.
- is fast to serialize/deserialize.
- is well-supported within the Hadoop ecosystem.
- is open source under a permissive license.
Parquet files are also compressed, binary-encoded files for efficient column-oriented data patterns, defined by the parquet file specification.
Protos do not define a standard for storing a set of protocol buffer encoded records in a file that supports compression and is also splittable by MapReduce.
See Why not protocol buffers?
See https://github.com/eishay/jvm-serializers/wiki. In other words, because it's terrible.
Absolutely! To get started, you're encouraged to read How to Contribute. In short, you must:
- Be able to (legally) complete, sign, and return a contributor license agreement.
- Follow the existing style and standards.
CDK has been renamed to Kite, this project. The main goal of Kite is to increase the accessibility of Apache Hadoop as a platform. This isn't specific to Cloudera, so we updated the name to correctly represent the project as an open, community-driven set of tools.