Skip to content

v1.0.0

Compare
Choose a tag to compare
@matthayes matthayes released this 04 Sep 19:41

This is not a backwards compatible release.

Additions:

  • Added SampleByKey, which provides a way to sample tuples based on certain fields.
  • Added Coalesce, which returns the first non-null value from a list of arguments like SQL's COALESCE.
  • Added BagGroup, which performs an in-memory group operation on a bag.
  • Added ReservoirSample
  • Added In filter func, which behaves like SQL's IN
  • Added EmptyBagToNullFields, which enables multi-relation left joins using COGROUP
  • Sessionize now supports long values for timestamp, in addition to string representation of time.
  • BagConcat can now operate on a bag of bags, in addition to a tuple of bags
  • Created TransposeTupleToBag, which creates a bag of key-value pairs from a tuple
  • SessionCount now implements Accumulator interface
  • DistinctBy now implements Accumulator interface
  • Using PigUnit from Maven for testing, instead of checked-in JAR
  • Added many more test cases to improve coverage
  • Improved documentation

Changes:

  • Moved WeightedSample to datafu.pig.sampling
  • Using Pig 0.11.1 for testing.
  • Renamed package datafu.pig.numbers to datafu.pig.random
  • Renamed package datafu.pig.bag.sets to datafu.pig.sets
  • Renamed TimeCount to SessionCount, moved to datafu.pig.sessions
  • ASSERT renamed to Assert
  • MD5Base64 merged into MD5 implementation, constructor arg picks which method, default being hex

Removals:

  • Removed ApplyQuantiles
  • Removed AliasBagFields, since can now achieve with nested foreach

Fixes:

  • Quantile now outputs schemas consistent with StreamingQuantile
  • Necessary fastutil classes now packaged in datafu JAR, so fastutil JAR not needed as dependency
  • Non-deterministic UDFs now marked as so