-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.lang.StringIndexOutOfBoundsException when running feature extraction #5
Comments
Hello, to me this looks like the input peak file might have an error in this step. There should be peak files generated in the output directory that you can check. Essentially there is one peak file that represents the original input peaks and another that uses those peak files to make 600 bp windows for those peaks. Problems at this step would also cause problems when reading using the readTensors() function. |
Would using an unedited bigBed file cause this? |
Do you know what kind of error to look for? It seems like a normal bed file to me |
I actually didn't use a bigBed file, I used a bed narrowPeak file |
The feature extraction method should make two peak files with "_peaks.txt" and "_original.txt". It uses the _peaks.txt file since these will all be 600bp in length. It should look something like this:
It has the reformatted peak positions to fit the 600bp windows, the peak index followed by the original peak positions. When I look at the code for the error: ForwardFASTAReader.java:61
This suggests to me that these start positions used to get the reference sequence are negative. This could mean that there is a negative start position in the file or somehow the start position was so large that the integer overflowed to a negative number. This is my best guess at the moment of what's happening anyway. What reference genome are you using? |
I am using Hg38 |
There are no negative values and the largest integer is around 200,000,000, so I am not sure |
Does your _peaks.txt file look similar to the example I posted above? The difference between the start and end positions should be 600 for each one. |
This is a sample of the _peaks.txt file I'm not sure why the first 5 rows are identical. |
I put the \n in myself |
It's possible that the duplicates are causing the issue. It looks like the last three are identical too. Does your input file also contain these duplicates? I'd try removing the duplicates and running the feature extraction again. |
There were a significant amount of duplicates, but the same issue arose even after removing them |
`chr1 100038019 100038618 0 100037635 100039004 chr1 100075626 100076225 1 100075800 100076052 chr1 100132683 100133282 2 100132510 100133456 chr1 1001673 1002272 3 1001831 1002115 chr1 100188582 100189181 4 100188751 100189014 chr1 100210662 100211261 5 100210857 100211068 chr1 100210934 100211533 6 100211134 100211335 chr1 100213069 100213668 7 100213216 100213522 chr1 100217697 100218296 8 100217852 100218143 ` |
I am not sure how it would be possible for the peak start to be below 0; the range of the peak start lengths is around 10,000-200,000,000 Is there any type of normalization on the peaks? |
The only functions applied are 1) identifying the center position and 2) extending by ~300bp upstream/downstream. That seems to be working fine though. The feature extractor sorts these peaks by chromosome and then start position to incorporate the read and reference sequence information. The reference FASTA files should be separated for each chromosome (i.e., chr1.fa, chr2.fa etc). I used the chromosome FASTA files available from UCSC. This is a strange error and after reviewing the code several times, I still don't know why it would be doing this. Does it still fail if you were to just take a subset of the peaks? All one chromosome, peaks < with locations 100,000bp, etc? Hopefully this will narrow down what the problem is. |
I tried running it on a smaller subset (just the first few thousand in the dataset) and it worked fine... |
Were there any preprocessing steps your team needed to take when using data from the ENCODE database? That is where I sourced all of my data from. |
Which files from ENCODE did you use exactly? |
I used the ATAC-seq bam files available from encode. The full list of the ENCODE data used can be found in the publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009670 For processing, reads were aligned using BWA mem, shifted, and duplicates removed. Peaks were then called using MACS2 with the BAMPE option. |
So would using the bed narrowPeak files available with the dataset cause problems? |
Also, I am not sure what you mean by shifted, is that an option in BWA mem |
The narrowPeak files should not cause the problems. I wish the bug was more clear, but it's difficult to diagnose. The shifting is essentially just accounting for the 9bp overhang from transposition of the fragment. The assumption is that TN5 transposase is at the center of the cut size. So shifting is essentially just shifting the forward fragment cut sites by 5bp, and reverse fragment cut sites by -4bp. |
Hi, I am trying to run feature extraction on bulk-ATAC-seq data, but I am met with this error:
The extraction continues to run, but I am unable to use tl.readTensors() on the resulting data. However this may have been an error in my code that I have since fixed.
Is this error a problem?
The text was updated successfully, but these errors were encountered: