TRUST4 on long read #166
Replies: 6 comments 13 replies
-
I tested TRUST4 on the raw PacBio Hifi data. I would suggest directly run "fastq-extractor" and "annotator" because for long reads there is no need for assembly. This thread could be useful: #63 . Though this simple pipeline could not handle barcodes, I think renaming the read id, e.g. "@[barcode]_[number]" and add the option "--barcode" to "annotator", could get around this. Then, you can use the other downstream scripts like "trust-barcoderep.pl", "trust-airr.pl" normally. For the input to TRUST4, the bams with corrected BC and UMI would be nice. As you mentioned the BC, UMI positions can be varied and also have a lot of sequencing errors. If some pipelines have already cleaned the data and put them in the BAM file, I would recommend using the BAM input. Hope this will help. |
Beta Was this translation helpful? Give feedback.
-
Hi mourisl, There is the bam file with corrected BC and UMI.
e.g. 2 reads : All the reads have tagged above, run-trust4 can handle my case? I waiting for pass arguments tag CB & UB and start from annotation. pseudo CMD like:
Can you correct me ? Maybe I need also BamExtractor.cpp I have try some test CMD.
message error if I just run simple cmd
message error
why chr1 I don't have chr1 in my bam
here is your example.bam your demo bam is a particular bam? I don't know if i need this paramters in Annotator.cpp Thank you for your help. |
Beta Was this translation helpful? Give feedback.
-
Hi mourisl, TRUST4 worked fine with my test dataset (head 4k), but I have some issues about my true dataset. CMD
I have 4 libraries, named A , B, C, D. Question 1: In my opinion, TRUST4 doesn't processe all the reads. Since Found 3137630 reads, Processed 1400000 reads.
e.g. output C assembled_reads.fa
Question 2: Line 1361 in 5cdff13 Now every 100,000 reads over 3 hours. I have e.g. Found 5,412,778 reads in library A.
It's possible speed up ? I've add argument --repseq & -k 53 Question 3: Thank you, |
Beta Was this translation helpful? Give feedback.
-
Q1: the candidate reads are extracted by the fastq-extractor or bam-extractor. The process is to check the alignment coordinate or map the reads against the reference genes, e.g. GRCm38_bcrtcr.fa. The mapping is done by chaining concordant k-mer matches. The k-mer size is quite small for this stage, and the long reads have a higher chance to get a valid chain by chance. As a result, many candidate reads in long read might be really from VDJ region. There is no limit on the maximum characters in the recent updates (in master branch but not in the release) Q2: --minHitLen is the minimal overlap size to add the read to the contig, or the minimum hit length to consider a read can be aligned to a contig. Q3: The VJ gene similarity was indeed an issue in 1.0.7 and was fixed later. I think the current master branch works well, and I will draft a new release (v1.0.9) later. |
Beta Was this translation helpful? Give feedback.
-
There is no limit on the maximum characters in the recent updates (in master branch but not in the release)
Giving kmers for 3500000 reads, As expected processed 3500000 reads( * are used for assembly). I have checked these 2 cases by file size, they are not same. Could you tell me the limitation on the maximum characters in which script and which line please ? Chuang |
Beta Was this translation helpful? Give feedback.
-
One of my run has segmentation fault is similar to #29, #116 or #22
I found out one strange read in last line of B_102_0023_longRead_annot.fa The context of this read in B_102_0023_longRead_assembled_reads.fa [before/After 10 lines at this read] Cmd
Thank you very much for your help. |
Beta Was this translation helpful? Give feedback.
-
Hi mourisl,
I need help with your your personal point of view plz.
This discussion refer to my previous question #141 The custom 5' spatial BCR has well performance in term of pipeline, thanks to TRUST4. but less sensible, maybe dataset doesn't has enough sequencing depth.
Now, I have the new version dataset by PromethION ONT.
Long read library like below:
So, from now I have short and long reads from the same tissue/slide. I have the truth Spatial BC(barcodeWhitelist).
There is my first time do long read analysis.
You mentioned the PacBio HiFi data in #39 (comment)
Here, I want to know which point can fit to TRUST4.
e.g. Can you tell me which tools you've used for PacBio HiFI?
Which format as input to TRUST4? Bam with tagged BC and UMI correction ?
I just know UMI and BC are after polyA, but those positions are dynamics/changed by reads captured, I don't have any idea for this data processing.
Splitting single-end read( cDNA + BC&UMI) to 2 files cDNA read + BC&UMI reads like short read as input TRUST4?
Thanks a lot,
Chuang
Beta Was this translation helpful? Give feedback.
All reactions