Skip to content

v1.2.2 tsv-sample: New features and performance improvements

Compare
Choose a tag to compare
@jondegenhardt jondegenhardt released this 07 Oct 18:30
· 198 commits to master since this release
v1.2.2
4532610

This release adds new capabilities and performance improvements to tsv-sample. Documentation was also updated to improve clarity. Key changes:

  • New feature: Simple random sampling with replacement - All lines from input sources are read in, then lines are repeated selected at random and written out. Lines can be output multiple times. The process continues until the specified number of samples has been written. Invoke using the -r|--replace and -n|--num NUM options.
  • New feature: Random value printing - A new feature was added for generating random values for all input lines. In the default case it shows the values used for Bernoulli sampling trials. It can also be used with 'distinct' sampling to show the sampling bucket a line is placed in based on the key-fields specified. This feature is invoked with the --gen-random-inorder option. A related feature, --print-random, was updated so that it is now supported by all applicable sampling modes.
  • Line order randomization performance improvements: One of the basic tsv-sample use cases is line order randomization. The case where all input lines are being permuted was re-written and is now quite a bit faster and uses less memory. This applies to both weighted and unweighted sampling. (The case where a subsampling is being done via the -n|--num option uses reservoir sampling was already fast.)
  • Command line option change - The option for specifying the probability used for Bernoulli sampling was changed from -r|--rate to -p|prob. This was done to create a more consistent set of option names for new features and features that may be added in the future.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.2/tsv-utils-v1.2.2_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.2/tsv-utils-v1.2.2_osx-x86_64_ldc2.tar.gz | tar xz