Add --removeDuplicatesProb and --removeDuplicatesByShortId to filter-fasta.py #626

terrycojones · 2018-09-19T15:19:26Z

There are two more removal of duplicates methods I would like.

--removeDuplicatesProb will remove by sequence but will store the MD5 sum of the sequences, not the sequences themselves. So it's only probabilistic. This helps to avoid running out of memory.

--removeDuplicatesByShortId de-duplicates based on the first part of the read id (up to the first space, if any). This is needed because if you combine output from (say) BLAST or DIAMOND with that from an aligner that produces SAM/BAM, the read ids won't match. That's because in a SAM/BAM file the reads have ids only up to the first space. So we need this option to be able to de-duplicate on combined reads from these different matchers.

The text was updated successfully, but these errors were encountered:

terrycojones added the enhancement label Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --removeDuplicatesProb and --removeDuplicatesByShortId to filter-fasta.py #626

Add --removeDuplicatesProb and --removeDuplicatesByShortId to filter-fasta.py #626

terrycojones commented Sep 19, 2018

Add --removeDuplicatesProb and --removeDuplicatesByShortId to filter-fasta.py #626

Add --removeDuplicatesProb and --removeDuplicatesByShortId to filter-fasta.py #626

Comments

terrycojones commented Sep 19, 2018