Bisulfite sequencing

WHAT?

to determine which C in a DNA sequence is methylated
unmethylated C will be converted to U(Uracil), which will pair with A. C -> U:A
methylated C will remain intact.
- before: C C C C G G G C G G A A G C T G C G G G C G G
- after : T T T T G G G T G G A A G T T G C G G G C G G

Limitations
- 5-Hydroxymethylcytosine a new mammalian DNA modification 5-hydroxymethylcytosine converts to cytosine-5-methylsulfonate upon bisulfite treatment, which then reads as a C when sequenced. Therefore, bisulfite sequencing cannot discriminate between 5-methylcytosine and 5-hydroxymethylcytosine.
- Incomplete conversion Bisulfite sequencing relies on the conversion of every single unmethylated cytosine residue to uracil. If conversion is incomplete, the subsequent analysis will incorrectly interpret the unconverted unmethylated cytosines as methylated cytosines, resulting in false positive results for methylation.
- Degradation of DNA during bisulfite treatment A major challenge in bisulfite sequencing is the degradation of DNA that takes place concurrently with the conversion. The conditions necessary for complete conversion, such as long incubation times, elevated temperature, and high bisulfite concentration, can lead to the degradation of about 90% of the incubated DNA. Given that the starting amount of DNA is often limited, such extensive degradation can be problematic.

Mapping

GEM(7), BSMAP(488) VAliBS(1), BS-Seeker2(155) BISMARK(1363)

Bismark Bisulfite-converted duplexes for the strand-specific detection and quantification of rare mutations

2011 first published, latest doc is 2016.
This is a perl program using bowtie2

VAliBS: a visual aligner for bisulfite sequences

2017 This is a C GUI program
in order to obtain methylation information, the DNA was dissolved into two single strands, where the underlined letter C marked the methylated cytosine. After bisulfite treated, non-methylated cytosine (C) will convert into uracil (U). Then PCR makes U converted into thymine (T), at the same time a double strand is synthesized based on each single strand (as shown in step 2 of Fig. 1). Different from normal mapping, the bisulfite mapping allows T to match C and A to match G in the reference. By comparing un-bisulfite-treated to bisulfite-treated sequences, we can identify where cytosine is methylated.
The existing mapping tools for bisulfite-treated sequences can be categorized into two groups: wild-card aligners and three-letter aligners. The common character of wild-card aligners is to replace cytosines in the sequenced reads with wild-card Y nucleotides to allow bisulfite mismatches.
On the other hand, three-letter aligners, convert C to T in both sequenced reads and genome reference prior to performing the reads mapping by using modified conventional aligners. Three-letter strategy makes it easier to reuse non-bisulfite aligner as an internal module, with these non-bisulfite aligners improved, it is convenient to replace the internal module.

2013 BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data

Ref

Genestack tutorial:

Setting up a WGBS experiment
Quality control of bisulfite sequencing reads
Preprocessing of the raw reads: trimming adaptors, contaminants and low quality bases
Bisulfite sequencing mapping of the preprocessed reads onto a reference genome
Merging the mapped reads
Quality control of the mapped reads
Methylation ratio analysis
- Trim N end-repairing fill-in bases set to “3”. This option allows to trim 3 bases from the read end to remove the DNA overhangs created during read end-repair in library preparation. It is important because this end repair procedure may introduce artefacts if the repaired bases contain methylated cytosines.
- Report only unique mappings
- Discard duplicated reads option to remove duplicated reads which have identical sequences and could be the result of library preparation. These reads could be mapped to the same position and distort results of downstream analysis.
Exploring the genome methylation levels in Genome Browse

Optimized Workflow for Bisulfite Sequencing Data Analysis

Bismark toolkit which deploys three-letter mapping algorithm and uses Bowtie as an internal aligner.

Tools

Checking bsmap

paper
- In the mammalian genome, although ~19% of the bases are Cs and another 19% are Gs, only ~1.8% of dinucleotides are CpG dinucleotides.
- As a result, we expect the overall C content of bisulfite reads to be reduced by ~50%.
- We used the general premise that all the C positions in the genome, where the asymmetric C/T transition can occur, are already known and can be used to guide the mapping of bisulfite reads. BSMAP masks Ts in the bisulfite reads as Cs (i.e., reverse bisulfite conversion) only at C positions in the original reference while keeping all other Ts in the bisulfite reads unchanged. BSMAP then maps the masked BS read directly to the reference. The asymmetric C/T conversion is achieved through position-specific bitwise masking of the bisulfite reads;
- if REF is C, T/C in reads does not count as a mismatch.
- BSMAP can also report non-unique multiple hits with a user-defined maximum number of mismatches.

docker run -it --mount src=/srv,target=/srv,type=bind,readonly --name test_bsmap  sidb_base:1.4.0 bash
cd /tmp
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/bsmap/bsmap-2.74.tgz -O /tmp/bsmap.tar.gz

tar -xf /tmp/bsmap.tar.gz
cp /srv/data/genome/human/ensembl-95/human_primary_assembly.fa .

cd /tmp/bsmap-2.74 
sed -i '1 a #include<unistd.h>' param.cpp

make bsmap
ln -s /tmp/bsmap-2.74/bsmap /usr/local/bin/

docker cp /home/xinhe/Projects/Sargasso/tests/data/raw_reads/bisulfite_human_pe_R1.fastq.gz test_bsmap:/tmp/
docker cp /home/xinhe/Projects/Sargasso/tests/data/raw_reads/bisulfite_human_pe_R2.fastq.gz test_bsmap:/tmp/
docker cp /home/xinhe/Projects/Sargasso/tests/data/raw_reads/bisulfite_human_se.fastq.gz test_bsmap:/tmp/

bsmap -H -p 20 -a bisulfite_human_pe_R1_clean.fastq.gz -b bisulfite_human_pe_R2_clean.fastq.gz -d human_primary_assembly.fa -o pe.sam > pe.log 2>&1 &
bsmap -H -p 20 -a bisulfite_human_se_clean.fastq.gz -d human_primary_assembly.fa -o pe.sam > pe.log 2>&1

pair-end reads are not processed correctly. Got a <terminate called after throwing an instance of 'std::out_of_range'> error.
The read ids are truncated. Similar problem has been ask here and here
After removing the space from read_name, bsmap works!

Checking BSMAPz

code

docker run -it --mount src=/srv,target=/srv,type=bind,readonly --name test_bsmapz  sidb_base:1.4.0 bash
cd /tmp && git clone https://github.com/zyndagj/BSMAPz.git
cd BSMAPz && make bsmapz

~~works but all the read ids are removed in the output file.~~ see above

checking

paper
code

Checking bsmooth

paper
R code
BSmooth aligner suit
- The R code is post-aligner methylation measurements
- The aligner suit provides two modes, three-letter and wildcard. three-letter use bowtie2 while wildcard (following BSMAP, but does not support gapped alignment or paired-end alignment) use Merman.
- A disadvantage is that it does not straightforwardly handle 'colorspace' reads from the SOLiD sequencing instrument. For this reason, BSmooth implements two alignment algorithms...
- developed for SOLiD colorspace bisulfite reads

docker run -it --mount src=/srv,target=/srv,type=bind,readonly --name test_bsmooth sidb_base:1.4.0 bash
cd /tmp
git clone https://github.com/BenLangmead/bsmooth-align.git
export BSMOOTH_HOME='/tmp/bsmooth-align'
cp /srv/data/genome/human/ensembl-95/human_primary_assembly.fa .

perl ./bsmooth-align/bin/bswc_bowtie2_index.pl --name=human_index human_primary_assembly.fa

Checking bismark

We decided to try the bismark program. It is one of the most cited tools for bisulfite sequencing analysis, and it is still actively being maintained.

docker run -it --mount src=/srv,target=/srv,type=bind,readonly --name test_bismark sidb_base:1.4.0 bash
cd /tmp
wget https://www.bioinformatics.babraham.ac.uk/projects/bismark/bismark_v0.21.0.tar.gz -O /tmp/bismark.tar.gz
tar -xf /tmp/bismark.tar.gz -C /opt && rm /tmp/bismark.tar.gz
ln -s /opt/bismark_v0.21.0/bismark /usr/local/bin/
ln -s /opt/bismark_v0.21.0/bismark_genome_preparation /usr/local/bin/
ln -s /opt/bismark_v0.21.0/bismark_methylation_extractor /usr/local/bin/
cp /srv/data/genome/human/ensembl-95/human_primary_assembly.fa .
wget https://www.bioinformatics.babraham.ac.uk/projects/bismark/test_data.fastq -O /tmp/test_data.fastq


bismark_genome_preparation --bowtie2 --parallel 10 /tmp
# cp -R /srv/data/genome/human/ensembl-95/Bisulfite_Genome .
bismark --bowtie2 -p 4 /tmp test_data.fastq
bismark_methylation_extractor -s --comprehensive test_data_bismark_bt2.sam

bismark_genome_preparation --bowtie2 --parallel 10 BISULFITE_indices/

It turns out the Bismark only include unique mapping due to this reason. This seems to be a bit problematic because Sargasso uses multi-map during its separation decesion making.

We can try implement the mapping ourself.

bs_genome
bs_read
bowtie2
use read_id to map bs_read back to the origin read

bismark_genome_preparation script from bismark creates two bowtie2 genome indexed, C->T and G->A. We need to write scripts to:

convert original reads to bs_reads
convert mapped bs_read back to original reads.

questions:

bisulfite treatment reduces the sequence complexity, which makes the read more likely to map to multiple locations. How is this affect the separation?

todos:

check bowtie2 default behaviour RE multimap (-a -k)

Conclusion: bowtie2 is doing as documented

nohup bowtie2 -no-discordant --no-mixed --no-head --no-unal -1 /tmp/bisulfite_human_pe_R1_clean.fastq -2 /tmp/bisulfite_human_pe_R2_clean.fastq -x bt2index -p 10 -S bowtiw2_pe_default.sam &

#5000 reads; of these:
#  5000 (100.00%) were paired; of these:
#    4980 (99.60%) aligned concordantly 0 times
#    16 (0.32%) aligned concordantly exactly 1 time
#    4 (0.08%) aligned concordantly >1 times
#0.40% overall alignment rate

# when examing the sam output file, I noticed that the 20 (16+4) pairs only appear twich(once per read) in the sam file. 
# This indicates that bowtie2 is only reporting one alignment, not all. This is the correct behaviour 
# described in bowtie2 documentation. 
peek_file bowtiw2_pe_default.sam 9999 | cut -f 1 | sort| uniq -c
#      2 SRR7461526.1104-1104-length=150
#      2 SRR7461526.1189-1189-length=150
#      2 SRR7461526.1411-1411-length=150
#      2 SRR7461526.1630-1630-length=150
#      2 SRR7461526.1781-1781-length=150
#      2 SRR7461526.2169-2169-length=150
#      2 SRR7461526.2175-2175-length=150
#      2 SRR7461526.2180-2180-length=150
#      2 SRR7461526.2406-2406-length=150
#      2 SRR7461526.3033-3033-length=150
#      2 SRR7461526.3087-3087-length=150
#      2 SRR7461526.3160-3160-length=150
#      2 SRR7461526.3447-3447-length=150
#      2 SRR7461526.3862-3862-length=150
#      2 SRR7461526.4111-4111-length=150
#      2 SRR7461526.4211-4211-length=150
#      2 SRR7461526.4495-4495-length=150
#      2 SRR7461526.4997-4997-length=150
#      2 SRR7461526.786-786-length=150
#      2 SRR7461526.876-876-length=150


# I go on to run the alignment with -k.
nohup bowtie2 -k 10 --no-discordant --no-mixed --no-head --no-unal -1 /tmp/bisulfite_human_pe_R1_clean.fastq -2 /tmp/bisulfite_human_pe_R2_clean.fastq -x bt2index -p 10 -S bowtiw2_pe_k10.sam &

#5000 reads; of these:
#  5000 (100.00%) were paired; of these:
#    4976 (99.52%) aligned concordantly 0 times
#    16 (0.32%) aligned concordantly exactly 1 time
#    8 (0.16%) aligned concordantly >1 times
#0.48% overall alignment rate

# we found 24 aligned read pairs, which is 4 pairs more than without -k option!?
peek_file bowtiw2_pe_default.sam 9999 | cut -f 1 | sort| uniq -c
#      2 SRR7461526.1104-1104-length=150
#      2 SRR7461526.1189-1189-length=150
#      2 SRR7461526.1411-1411-length=150
#      2 SRR7461526.1630-1630-length=150
#      2 SRR7461526.1781-1781-length=150
#      2 SRR7461526.2169-2169-length=150
#     20 SRR7461526.2175-2175-length=150
#      2 SRR7461526.2180-2180-length=150
#      4 SRR7461526.2406-2406-length=150
#     20 SRR7461526.2821-2821-length=150
#     20 SRR7461526.2933-2933-length=150
#      4 SRR7461526.3033-3033-length=150
#      2 SRR7461526.3087-3087-length=150
#      2 SRR7461526.3160-3160-length=150
#      2 SRR7461526.3447-3447-length=150
#     20 SRR7461526.3502-3502-length=150
#      2 SRR7461526.3862-3862-length=150
#      2 SRR7461526.4111-4111-length=150
#     20 SRR7461526.4211-4211-length=150
#      2 SRR7461526.4495-4495-length=150
#     20 SRR7461526.4796-4796-length=150
#      2 SRR7461526.4997-4997-length=150
#      2 SRR7461526.786-786-length=150
#      2 SRR7461526.876-876-length=150





#      2 SRR7461526.1104-1104-length=150	#      2 SRR7461526.1104-1104-length=150    	
#      2 SRR7461526.1189-1189-length=150	#      2 SRR7461526.1189-1189-length=150   	
#      2 SRR7461526.1411-1411-length=150	#      2 SRR7461526.1411-1411-length=150   	
#      2 SRR7461526.1630-1630-length=150	#      2 SRR7461526.1630-1630-length=150   	
#      2 SRR7461526.1781-1781-length=150	#      2 SRR7461526.1781-1781-length=150   	
#      2 SRR7461526.2169-2169-length=150	#      2 SRR7461526.2169-2169-length=150   	
#      2 SRR7461526.2175-2175-length=150	#     20 SRR7461526.2175-2175-length=150   	
#      2 SRR7461526.2180-2180-length=150	#      2 SRR7461526.2180-2180-length=150   	
#      2 SRR7461526.2406-2406-length=150	#      4 SRR7461526.2406-2406-length=150   
						#     20 SRR7461526.2821-2821-length=150
						#     20 SRR7461526.2933-2933-length=150
#      2 SRR7461526.3033-3033-length=150	#      4 SRR7461526.3033-3033-length=150   	
#      2 SRR7461526.3087-3087-length=150	#      2 SRR7461526.3087-3087-length=150   	
#      2 SRR7461526.3160-3160-length=150	#      2 SRR7461526.3160-3160-length=150   	
#      2 SRR7461526.3447-3447-length=150	#      2 SRR7461526.3447-3447-length=150 
						#     20 SRR7461526.3502-3502-length=150
#      2 SRR7461526.3862-3862-length=150	#      2 SRR7461526.3862-3862-length=150   	
#      2 SRR7461526.4111-4111-length=150	#      2 SRR7461526.4111-4111-length=150   	
#      2 SRR7461526.4211-4211-length=150	#     20 SRR7461526.4211-4211-length=150   	
#      2 SRR7461526.4495-4495-length=150	#      2 SRR7461526.4495-4495-length=150  
						#     20 SRR7461526.4796-4796-length=150
#      2 SRR7461526.4997-4997-length=150	#      2 SRR7461526.4997-4997-length=150   	
#      2 SRR7461526.786-786-length=150  	#      2 SRR7461526.786-786-length=150 	
#      2 SRR7461526.876-876-length=150  	#      2 SRR7461526.876-876-length=150

test bismark multimap behaviour

Conclusion: There are two sources of ambigty,

the read has multi-map in any thread (each thread run's a bowtie2 instance against one converted genome.)
the read has multi/single map from multiple threads In both cases, the read will be output to ambig file.

Bowtie2 is always returning A best hit, which could be a random one from the best hits. bismark will take that hit. Thus, each hits end up in the ambig file is the best hit.

Base on this， we can work out a preliminary strategy for separation:

Rat		Mouse
ambig_out	nomal_out	ambig_out	nomal_out
Read_1		Read_1		ingore
Read_1			Read_1	keep only if mouse > rat
	Read_1		Read_1	keep the best of the two
	Read_1	Read_1		keep only if rat > mouse

In the nomal_out: (reads are chemical converted DNA reads)

NM is the edit distance, which is the number of different BP of the chemical-converted-read to the original-ref. This includes the number of mismatches and the number of C->T conversion.
XA/XB is the number of non-bisulfite mismatches and ignores potential insertions or deletions.

In the ambig_out:(reads are chemical converted+ software converted DNA reads)

XM is the number of mismatches in the alignment, which is the number of different BP of the chemical converted+ software converted DNA reads to the software-converted-ref.

ref: CcTT
Sref:TTTT

B-READ: CTTT
S-B-READ: TTTT

nomal: 
B-READ_vs_ref NM=1, XA=0

ambig:
S-B-READ_vs_Sref XM: 0

ref: CcTT
Sref:TTTT

B-READ: CTAT
S-B-READ: TTAT

nomal: 
B-READ_vs_ref NM=2, XA=1

ambig:
S-B-READ_vs_Sref XM: 1

bismark --ambig_bam --bowtie2 -p 4 /tmp test_data.fastq
#Single-end alignments will be performed
#=======================================
#
#Input file is in FastQ format
#Writing a C -> T converted version of the input file test_data.fastq to test_data.fastq_C_to_T.fastq
#
#Created C -> T converted version of the FastQ file test_data.fastq (10000 sequences in total)
#
#Input file is test_data.fastq_C_to_T.fastq (FastQ)
#
#Now running 2 instances of Bowtie 2 against the bisulfite genome of /tmp/ with the specified options: -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals
#
#Now starting the Bowtie 2 aligner for CTreadCTgenome (reading in sequences from test_data.fastq_C_to_T.fastq with options -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --norc)
#Using Bowtie 2 index: /tmp/Bisulfite_Genome/CT_conversion/BS_CT
#
#Found first alignment:	SRR020138.15024316_SALK_2029:7:100:1672:1790_length=86	4	*	0	0	*	*	0	0	TTTTTAGAAAGTTTATATGGAAATTAAGTTTTTTGTATATGTTTGTAAAG	BCB?).4'324&1;B###################################	YT:Z:UU
#Now starting the Bowtie 2 aligner for CTreadGAgenome (reading in sequences from test_data.fastq_C_to_T.fastq with options -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --nofw)
#Using Bowtie 2 index: /tmp/Bisulfite_Genome/GA_conversion/BS_GA
#
#Found first alignment:	SRR020138.15024316_SALK_2029:7:100:1672:1790_length=86	4	*	0	0	*	*	0	0	TTTTTAGAAAGTTTATATGGAAATTAAGTTTTTTGTATATGTTTGTAAAG	BCB?).4'324&1;B###################################	YT:Z:UU
#Ambiguous BAM output: test_data_bismark_bt2.ambig.bam
#
#>>> Writing bisulfite mapping results to test_data_bismark_bt2.bam <<<
#
#
#Reading in the sequence file test_data.fastq
#10000 reads; of these:
#  10000 (100.00%) were unpaired; of these:
#    6631 (66.31%) aligned 0 times
#    2497 (24.97%) aligned exactly 1 time
#    872 (8.72%) aligned >1 times
#33.69% overall alignment rate
#10000 reads; of these:
#  10000 (100.00%) were unpaired; of these:
#    6607 (66.07%) aligned 0 times
#    2487 (24.87%) aligned exactly 1 time
#    906 (9.06%) aligned >1 times
#33.93% overall alignment rate
#Processed 10000 sequences in total
#
#
#Successfully deleted the temporary file test_data.fastq_C_to_T.fastq
#
#Final Alignment report
#======================
#Sequences analysed in total:	10000
#Number of alignments with a unique best hit from the different alignments:	4912
#Mapping efficiency:	49.1%
#
#Sequences with no alignments under any condition:	4176
#Sequences did not map uniquely:	912
#Sequences which were discarded because genomic sequence could not be extracted:	0
#
#Number of sequences with unique best (first) alignment came from the bowtie output:
#CT/CT:	2462	((converted) top strand)
#CT/GA:	2450	((converted) bottom strand)
#GA/CT:	0	(complementary to (converted) top strand)
#GA/GA:	0	(complementary to (converted) bottom strand)
#
#Number of alignments to (merely theoretical) complementary strands being rejected in total:	0
#
#Final Cytosine Methylation Report
#=================================
#Total number of C's analysed:	40347
#
#Total methylated C's in CpG context:	1369
#Total methylated C's in CHG context:	21
#Total methylated C's in CHH context:	103
#Total methylated C's in Unknown context:	0
#
#Total unmethylated C's in CpG context:	684
#Total unmethylated C's in CHG context:	10053
#Total unmethylated C's in CHH context:	28117
#Total unmethylated C's in Unknown context:	1
#
#C methylated in CpG context:	66.7%
#C methylated in CHG context:	0.2%
#C methylated in CHH context:	0.4%
#C methylated in Unknown context (CN or CHN):	0.0%
#
#
#Bismark completed in 0d 0h 2m 55s
#
#====================
#Bismark run complete
#====================

In the output file, there are 4912 reads, which is what's being reported:

peek_file test_data_bismark_bt2.bam 99999999 | cut -f 1 | sort| uniq -c  | wc -l
4912

However, the number of ambiguous reads are different:

peek_file test_data_bismark_bt2.ambig.bam 999999 | cut -f 1 | sort | wc -l
831    
> #Sequences did not map uniquely:	912

cd /tmp/Bisulfite_Genome/CT_conversion 
nohup bowtie2 -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --norc --no-head --no-unal /tmp/test_data.fastq -x BS_CT -S bowtiw2_se_default.sam &

#10000 reads; of these:
#  10000 (100.00%) were unpaired; of these:
#    7101 (71.01%) aligned 0 times
#    2250 (22.50%) aligned exactly 1 time
#    649 (6.49%) aligned >1 times
#28.99% overall alignment rate

nohup bowtie2 -k 10 -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --norc --no-head --no-unal /tmp/test_data.fastq -x BS_CT -S bowtiw2_se_k10.sam &

#10000 reads; of these:
#  10000 (100.00%) were unpaired; of these:
#    7094 (70.94%) aligned 0 times
#    2254 (22.54%) aligned exactly 1 time
#    652 (6.52%) aligned >1 times
#29.06% overall alignment rate

cd /tmp/Bisulfite_Genome/GA_conversion
nohup bowtie2 -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --nofw --no-head --no-unal /tmp/test_data.fastq -x BS_GA -S bowtiw2_se_default.sam &

#10000 reads; of these:
#  10000 (100.00%) were unpaired; of these:
#    7121 (71.21%) aligned 0 times
#    2245 (22.45%) aligned exactly 1 time
#    634 (6.34%) aligned >1 times
#28.79% overall alignment rate

nohup bowtie2 -k 10 -q --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --norc --no-head --no-unal /tmp/test_data.fastq -x BS_GA -S bowtiw2_se_k10.sam &

find out why PE alignment rate is so low in bsmap , check --unmapped reads
go ahead with bismark in Sargasso for a test case
go ahead with bsmap in Sargasso for a test case

I run test on two human data, using human/mouse/rat.

It seems that the SE sample % drop was caused by mismatches.
I run a small test on a subset of the SE sample with best and conservative and this further confirm the big number of reject in human:

  - best
2019-05-20 16:24:28 INFO: Starting species separation.
2019-05-20 16:24:32 INFO: Species 1: wrote 1 filtered hits for 1 reads; 31 hits for 31 reads were rejected outright, and 18 hits for 18 reads were rejected as ambiguous.
2019-05-20 16:24:32 INFO: Species 2: wrote 26712 filtered hits for 26712 reads; 6150 hits for 6150 reads were rejected outright, and 20 hits for 20 reads were rejected as ambiguous.
2019-05-20 16:24:32 INFO: Species 3: wrote 1 filtered hits for 1 reads; 126 hits for 126 reads were rejected outright, and 21 hits for 21 reads were rejected as ambiguous.

  - con
2019-05-20 16:22:55 INFO: Starting species separation.
2019-05-20 16:22:56 INFO: Species 1: wrote 1 filtered hits for 1 reads; 40 hits for 40 reads were rejected outright, and 9 hits for 9 reads were rejected as ambiguous.
2019-05-20 16:22:56 INFO: Species 2: wrote 17977 filtered hits for 17977 reads; 14898 hits for 14898 reads were rejected outright, and 7 hits for 7 reads were rejected as ambiguous.
2019-05-20 16:22:56 INFO: Species 3: wrote 1 filtered hits for 1 reads; 139 hits for 139 reads were rejected outright, and 8 hits for 8 reads were rejected as ambiguous.

The (17) XA/XB-tag (non-bisulfite mismatches) for the SE human sample mapped read to human:

4275765 XA:Z:0
2272887 XA:Z:1
 986773 XA:Z:2
 398091 XA:Z:3
 213112 XA:Z:4
    536 XA:Z:5
     78 XA:Z:6
     30 XA:Z:7
     10 XA:Z:8

So in conservative, we will throw away half of the reads strict away! This might due to the SE reads in this particular case, is not good quality, because the PE reads looks fine. I will try with another SE reads to see if I can repeat this behaviour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly