v1.2.2 tsv-sample: New features and performance improvements
This release adds new capabilities and performance improvements to tsv-sample
. Documentation was also updated to improve clarity. Key changes:
- New feature: Simple random sampling with replacement - All lines from input sources are read in, then lines are repeated selected at random and written out. Lines can be output multiple times. The process continues until the specified number of samples has been written. Invoke using the
-r|--replace
and-n|--num NUM
options. - New feature: Random value printing - A new feature was added for generating random values for all input lines. In the default case it shows the values used for Bernoulli sampling trials. It can also be used with 'distinct' sampling to show the sampling bucket a line is placed in based on the key-fields specified. This feature is invoked with the
--gen-random-inorder
option. A related feature,--print-random
, was updated so that it is now supported by all applicable sampling modes. - Line order randomization performance improvements: One of the basic
tsv-sample
use cases is line order randomization. The case where all input lines are being permuted was re-written and is now quite a bit faster and uses less memory. This applies to both weighted and unweighted sampling. (The case where a subsampling is being done via the-n|--num
option uses reservoir sampling was already fast.) - Command line option change - The option for specifying the probability used for Bernoulli sampling was changed from
-r|--rate
to-p|prob
. This was done to create a more consistent set of option names for new features and features that may be added in the future.
To download and unpack the prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.2/tsv-utils-v1.2.2_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.2/tsv-utils-v1.2.2_osx-x86_64_ldc2.tar.gz | tar xz