Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.StringIndexOutOfBoundsException when running feature extraction #5

Open
dylan6thomas opened this issue Jul 13, 2022 · 22 comments

Comments

@dylan6thomas
Copy link

Hi, I am trying to run feature extraction on bulk-ATAC-seq data, but I am met with this error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -23772
        at java.base/java.lang.String.substring(String.java:1840)
        at org.jax.FASTAReader.ForwardFASTAReader.setRegion(ForwardFASTAReader.java:61)
        at org.jax.bamextractor.ReadExtractor.extractFeatures(ReadExtractor.java:245)
        at org.jax.bamextractor.ReadExtractor.main(ReadExtractor.java:89)
        ... 5 more

The extraction continues to run, but I am unable to use tl.readTensors() on the resulting data. However this may have been an error in my code that I have since fixed.

Is this error a problem?

@ajt986
Copy link
Contributor

ajt986 commented Jul 13, 2022

Hello, to me this looks like the input peak file might have an error in this step. There should be peak files generated in the output directory that you can check. Essentially there is one peak file that represents the original input peaks and another that uses those peak files to make 600 bp windows for those peaks. Problems at this step would also cause problems when reading using the readTensors() function.

@dylan6thomas
Copy link
Author

Would using an unedited bigBed file cause this?
I edited the file to only include the peaks and am running it again currently.

@dylan6thomas
Copy link
Author

Do you know what kind of error to look for? It seems like a normal bed file to me

@dylan6thomas
Copy link
Author

Would using an unedited bigBed file cause this?
I edited the file to only include the peaks and am running it again currently.

I actually didn't use a bigBed file, I used a bed narrowPeak file

@ajt986
Copy link
Contributor

ajt986 commented Jul 13, 2022

The feature extraction method should make two peak files with "_peaks.txt" and "_original.txt". It uses the _peaks.txt file since these will all be 600bp in length. It should look something like this:

chr1	29062	29661	1	29305	29420
chr1	237452	238051	2	237696	237809
chr1	521268	521867	3	521531	521605
chr1	662365	662964	4	662623	662708

It has the reformatted peak positions to fit the 600bp windows, the peak index followed by the original peak positions.

When I look at the code for the error:

ForwardFASTAReader.java:61

		if (start > _start){ 
			_fasta = _fasta.substring(start-_start-1); #line 61
			_start = start-1;
		}

This suggests to me that these start positions used to get the reference sequence are negative. This could mean that there is a negative start position in the file or somehow the start position was so large that the integer overflowed to a negative number. This is my best guess at the moment of what's happening anyway. What reference genome are you using?

@dylan6thomas
Copy link
Author

I am using Hg38

@dylan6thomas
Copy link
Author

There are no negative values and the largest integer is around 200,000,000, so I am not sure

@ajt986
Copy link
Contributor

ajt986 commented Jul 13, 2022

Does your _peaks.txt file look similar to the example I posted above? The difference between the start and end positions should be 600 for each one.

@dylan6thomas
Copy link
Author

This is a sample of the _peaks.txt file
chr1 100038019 100038618 0 100037635 100039004 chr1 100038019 100038618 1 100037635 100039004 chr1 100038019 100038618 2 100037635 100039004 chr1 100038019 100038618 3 100037635 100039004 chr1 100038019 100038618 4 100037635 100039004 chr1 100075626 100076225 5 100075800 100076052 chr1 100132683 100133282 6 100132510 100133456 chr1 100132683 100133282 7 100132510 100133456 chr1 100132683 100133282 8 100132510 100133456 chr1 1001673 1002272 9 1001831 1002115

I'm not sure why the first 5 rows are identical.

@dylan6thomas
Copy link
Author

dylan6thomas commented Jul 13, 2022

chr1	100038019	100038618	0	100037635	100039004 \n
chr1	100038019	100038618	1	100037635	100039004 \n
chr1	100038019	100038618	2	100037635	100039004 \n
chr1	100038019	100038618	3	100037635	100039004 \n
chr1	100038019	100038618	4	100037635	100039004 \n
chr1	100075626	100076225	5	100075800	100076052 \n
chr1	100132683	100133282	6	100132510	100133456 \n
chr1	100132683	100133282	7	100132510	100133456 \n
chr1	100132683	100133282	8	100132510	100133456 \n
chr1	1001673	1002272	9	1001831	1002115

I put the \n in myself

@ajt986
Copy link
Contributor

ajt986 commented Jul 13, 2022

It's possible that the duplicates are causing the issue. It looks like the last three are identical too. Does your input file also contain these duplicates? I'd try removing the duplicates and running the feature extraction again.

@dylan6thomas
Copy link
Author

There were a significant amount of duplicates, but the same issue arose even after removing them

@dylan6thomas
Copy link
Author

dylan6thomas commented Jul 13, 2022

`chr1 100038019 100038618 0 100037635 100039004

chr1 100075626 100076225 1 100075800 100076052

chr1 100132683 100133282 2 100132510 100133456

chr1 1001673 1002272 3 1001831 1002115

chr1 100188582 100189181 4 100188751 100189014

chr1 100210662 100211261 5 100210857 100211068

chr1 100210934 100211533 6 100211134 100211335

chr1 100213069 100213668 7 100213216 100213522

chr1 100217697 100218296 8 100217852 100218143 `

@dylan6thomas
Copy link
Author

I am not sure how it would be possible for the peak start to be below 0; the range of the peak start lengths is around 10,000-200,000,000

Is there any type of normalization on the peaks?

@ajt986
Copy link
Contributor

ajt986 commented Jul 13, 2022

The only functions applied are 1) identifying the center position and 2) extending by ~300bp upstream/downstream. That seems to be working fine though.

The feature extractor sorts these peaks by chromosome and then start position to incorporate the read and reference sequence information. The reference FASTA files should be separated for each chromosome (i.e., chr1.fa, chr2.fa etc). I used the chromosome FASTA files available from UCSC.

This is a strange error and after reviewing the code several times, I still don't know why it would be doing this. Does it still fail if you were to just take a subset of the peaks? All one chromosome, peaks < with locations 100,000bp, etc? Hopefully this will narrow down what the problem is.

@dylan6thomas
Copy link
Author

I tried running it on a smaller subset (just the first few thousand in the dataset) and it worked fine...

@dylan6thomas
Copy link
Author

Were there any preprocessing steps your team needed to take when using data from the ENCODE database? That is where I sourced all of my data from.

@dylan6thomas
Copy link
Author

Which files from ENCODE did you use exactly?

@ajt986
Copy link
Contributor

ajt986 commented Jul 13, 2022

I used the ATAC-seq bam files available from encode. The full list of the ENCODE data used can be found in the publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009670

For processing, reads were aligned using BWA mem, shifted, and duplicates removed. Peaks were then called using MACS2 with the BAMPE option.

@dylan6thomas
Copy link
Author

So would using the bed narrowPeak files available with the dataset cause problems?

@dylan6thomas
Copy link
Author

Also, I am not sure what you mean by shifted, is that an option in BWA mem

@ajt986
Copy link
Contributor

ajt986 commented Jul 14, 2022

The narrowPeak files should not cause the problems. I wish the bug was more clear, but it's difficult to diagnose.

The shifting is essentially just accounting for the 9bp overhang from transposition of the fragment. The assumption is that TN5 transposase is at the center of the cut size. So shifting is essentially just shifting the forward fragment cut sites by 5bp, and reverse fragment cut sites by -4bp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants