Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication issue: failure of one round deduplication + accuracy level issue. [fastp v0.23.4] #580

Open
nightkid03 opened this issue Oct 3, 2024 · 0 comments

Comments

@nightkid03
Copy link

Hi, there

We tried to use fastp to do de-duplication. However, we found 2 issues. Looking forward to your reply.

  1. one round of de-duplication is ineffective.
    we ran level 1 de-duplication and got "Duplication rate: 0.498141%". When we ran level 6 de-duplication on the input, we got "Duplication rate: 0.312492%". However, if we ran second round of de-duplication based on the output of first run. The Duplication rate can almost reach < 0.1%, see as below.

But
2) accuracy level issue:
we run level 1 de-duplication first and then using the output to run de-duplication at different accuracy levels.
As you can see, level 1 + level 1 -> 0.00744113%, level 1 + level 3 -> 0.088817% , level 1 + level 6 -> 0.0237203%, which doesn't make sense.

Read1 before filtering:
total reads: 15180846
total bases: 2277126900
Q20 bases: 2199749620(96.602%)
Q30 bases: 2075324182(91.1378%)

Read2 before filtering:
total reads: 15180846
total bases: 2277126900
Q20 bases: 2209710343(97.0394%)
Q30 bases: 2098006573(92.1339%)

Read1 after filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 after filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Filtering result:
reads passed filter: 30361692
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 623982
bases trimmed due to adapters: 2636132

Duplication rate: 0.498141%

Insert size peak (evaluated by paired-end reads): 226

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i /projects/f_lz332_1/DataBase/MetaGenomeData/Li_FrontMicro_2021_COVID/0.rawdata/ERR5445742_1.fastq.gz -I /projects/f_lz332_1/DataBase/MetaGenomeData/Li_FrontMicro_2021_COVID/0.rawdata/ERR5445742_2.fastq.gz -o ERR5445742_l1R1.fastq.gz -O ERR5445742_l1R2.fastq.gz --dedup --dup_calc_accuracy 1 --thread 16
fastp v0.23.4, time used: 80 seconds
Read1 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Read1 after filtering:
total reads: 15104100
total bases: 2264308814
Q20 bases: 2187263387(96.5974%)
Q30 bases: 2063424187(91.1282%)

Read2 after filtering:
total reads: 15104100
total bases: 2264308814
Q20 bases: 2197157837(97.0344%)
Q30 bases: 2085895844(92.1206%)

Filtering result:
reads passed filter: 30210448
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate: 0.00744113%

Insert size peak (evaluated by paired-end reads): 214

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i ERR5445742_l1R1.fastq.gz -I ERR5445742_l1R2.fastq.gz -o ERR5445742_l1l1R1.fastq.gz -O ERR5445742_l1l1R2.fastq.gz --dedup --dup_calc_accuracy 1 --thread 16
fastp v0.23.4, time used: 79 seconds
Read1 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Read1 after filtering:
total reads: 15091808
total bases: 2262463985
Q20 bases: 2185485043(96.5976%)
Q30 bases: 2061749882(91.1285%)

Read2 after filtering:
total reads: 15091808
total bases: 2262463985
Q20 bases: 2195369494(97.0345%)
Q30 bases: 2084200083(92.1208%)

Filtering result:
reads passed filter: 30210448
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate: 0.088817%

Insert size peak (evaluated by paired-end reads): 214

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i ERR5445742_l1R1.fastq.gz -I ERR5445742_l1R2.fastq.gz -o ERR5445742_l1l3R1.fastq.gz -O ERR5445742_l1l3R2.fastq.gz --dedup --dup_calc_accuracy 3 --thread 16
fastp v0.23.4, time used: 80 seconds
Read1 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2187424528(96.5975%)
Q30 bases: 2063578514(91.1284%)

Read2 before filtering:
total reads: 15105224
total bases: 2264474181
Q20 bases: 2197319205(97.0344%)
Q30 bases: 2086050677(92.1208%)

Read1 after filtering:
total reads: 15101641
total bases: 2263938008
Q20 bases: 2186907014(96.5975%)
Q30 bases: 2063090824(91.1284%)

Read2 after filtering:
total reads: 15101641
total bases: 2263938008
Q20 bases: 2196799311(97.0344%)
Q30 bases: 2085557436(92.1208%)

Filtering result:
reads passed filter: 30210448
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate: 0.0237203%

Insert size peak (evaluated by paired-end reads): 214

JSON report: fastp.json
HTML report: fastp.html

/projects/f_lz332_1/software/fastp -i ERR5445742_l1R1.fastq.gz -I ERR5445742_l1R2.fastq.gz -o ERR5445742_l1l6R1.fastq.gz -O ERR5445742_l1l6R2.fastq.gz --dedup --dup_calc_accuracy 6 --thread 16
fastp v0.23.4, time used: 85 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant