java.lang.StringIndexOutOfBoundsException when running feature extraction #5

dylan6thomas · 2022-07-13T14:57:05Z

Hi, I am trying to run feature extraction on bulk-ATAC-seq data, but I am met with this error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -23772
        at java.base/java.lang.String.substring(String.java:1840)
        at org.jax.FASTAReader.ForwardFASTAReader.setRegion(ForwardFASTAReader.java:61)
        at org.jax.bamextractor.ReadExtractor.extractFeatures(ReadExtractor.java:245)
        at org.jax.bamextractor.ReadExtractor.main(ReadExtractor.java:89)
        ... 5 more

The extraction continues to run, but I am unable to use tl.readTensors() on the resulting data. However this may have been an error in my code that I have since fixed.

Is this error a problem?

The text was updated successfully, but these errors were encountered:

ajt986 · 2022-07-13T16:59:38Z

Hello, to me this looks like the input peak file might have an error in this step. There should be peak files generated in the output directory that you can check. Essentially there is one peak file that represents the original input peaks and another that uses those peak files to make 600 bp windows for those peaks. Problems at this step would also cause problems when reading using the readTensors() function.

dylan6thomas · 2022-07-13T17:13:38Z

Would using an unedited bigBed file cause this?
I edited the file to only include the peaks and am running it again currently.

dylan6thomas · 2022-07-13T17:24:22Z

Do you know what kind of error to look for? It seems like a normal bed file to me

dylan6thomas · 2022-07-13T18:04:29Z

Would using an unedited bigBed file cause this?
I edited the file to only include the peaks and am running it again currently.

I actually didn't use a bigBed file, I used a bed narrowPeak file

ajt986 · 2022-07-13T18:10:40Z

The feature extraction method should make two peak files with "_peaks.txt" and "_original.txt". It uses the _peaks.txt file since these will all be 600bp in length. It should look something like this:

chr1	29062	29661	1	29305	29420
chr1	237452	238051	2	237696	237809
chr1	521268	521867	3	521531	521605
chr1	662365	662964	4	662623	662708

It has the reformatted peak positions to fit the 600bp windows, the peak index followed by the original peak positions.

When I look at the code for the error:

ForwardFASTAReader.java:61

		if (start > _start){ 
			_fasta = _fasta.substring(start-_start-1); #line 61
			_start = start-1;
		}

This suggests to me that these start positions used to get the reference sequence are negative. This could mean that there is a negative start position in the file or somehow the start position was so large that the integer overflowed to a negative number. This is my best guess at the moment of what's happening anyway. What reference genome are you using?

dylan6thomas · 2022-07-13T18:11:50Z

I am using Hg38

dylan6thomas · 2022-07-13T18:16:18Z

There are no negative values and the largest integer is around 200,000,000, so I am not sure

ajt986 · 2022-07-13T18:20:13Z

Does your _peaks.txt file look similar to the example I posted above? The difference between the start and end positions should be 600 for each one.

dylan6thomas · 2022-07-13T18:28:02Z

This is a sample of the _peaks.txt file
chr1 100038019 100038618 0 100037635 100039004 chr1 100038019 100038618 1 100037635 100039004 chr1 100038019 100038618 2 100037635 100039004 chr1 100038019 100038618 3 100037635 100039004 chr1 100038019 100038618 4 100037635 100039004 chr1 100075626 100076225 5 100075800 100076052 chr1 100132683 100133282 6 100132510 100133456 chr1 100132683 100133282 7 100132510 100133456 chr1 100132683 100133282 8 100132510 100133456 chr1 1001673 1002272 9 1001831 1002115

I'm not sure why the first 5 rows are identical.

dylan6thomas · 2022-07-13T18:28:51Z

chr1	100038019	100038618	0	100037635	100039004 \n
chr1	100038019	100038618	1	100037635	100039004 \n
chr1	100038019	100038618	2	100037635	100039004 \n
chr1	100038019	100038618	3	100037635	100039004 \n
chr1	100038019	100038618	4	100037635	100039004 \n
chr1	100075626	100076225	5	100075800	100076052 \n
chr1	100132683	100133282	6	100132510	100133456 \n
chr1	100132683	100133282	7	100132510	100133456 \n
chr1	100132683	100133282	8	100132510	100133456 \n
chr1	1001673	1002272	9	1001831	1002115

I put the \n in myself

ajt986 · 2022-07-13T18:39:04Z

It's possible that the duplicates are causing the issue. It looks like the last three are identical too. Does your input file also contain these duplicates? I'd try removing the duplicates and running the feature extraction again.

dylan6thomas · 2022-07-13T18:51:07Z

There were a significant amount of duplicates, but the same issue arose even after removing them

dylan6thomas · 2022-07-13T19:13:33Z

`chr1 100038019 100038618 0 100037635 100039004

chr1 100075626 100076225 1 100075800 100076052

chr1 100132683 100133282 2 100132510 100133456

chr1 1001673 1002272 3 1001831 1002115

chr1 100188582 100189181 4 100188751 100189014

chr1 100210662 100211261 5 100210857 100211068

chr1 100210934 100211533 6 100211134 100211335

chr1 100213069 100213668 7 100213216 100213522

chr1 100217697 100218296 8 100217852 100218143 `

dylan6thomas · 2022-07-13T19:40:17Z

I am not sure how it would be possible for the peak start to be below 0; the range of the peak start lengths is around 10,000-200,000,000

Is there any type of normalization on the peaks?

ajt986 · 2022-07-13T20:20:33Z

The only functions applied are 1) identifying the center position and 2) extending by ~300bp upstream/downstream. That seems to be working fine though.

The feature extractor sorts these peaks by chromosome and then start position to incorporate the read and reference sequence information. The reference FASTA files should be separated for each chromosome (i.e., chr1.fa, chr2.fa etc). I used the chromosome FASTA files available from UCSC.

This is a strange error and after reviewing the code several times, I still don't know why it would be doing this. Does it still fail if you were to just take a subset of the peaks? All one chromosome, peaks < with locations 100,000bp, etc? Hopefully this will narrow down what the problem is.

dylan6thomas · 2022-07-13T20:50:09Z

I tried running it on a smaller subset (just the first few thousand in the dataset) and it worked fine...

dylan6thomas · 2022-07-13T21:21:47Z

Were there any preprocessing steps your team needed to take when using data from the ENCODE database? That is where I sourced all of my data from.

dylan6thomas · 2022-07-13T21:25:13Z

Which files from ENCODE did you use exactly?

ajt986 · 2022-07-13T21:34:51Z

I used the ATAC-seq bam files available from encode. The full list of the ENCODE data used can be found in the publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009670

For processing, reads were aligned using BWA mem, shifted, and duplicates removed. Peaks were then called using MACS2 with the BAMPE option.

dylan6thomas · 2022-07-13T21:52:20Z

So would using the bed narrowPeak files available with the dataset cause problems?

dylan6thomas · 2022-07-13T22:09:13Z

Also, I am not sure what you mean by shifted, is that an option in BWA mem

ajt986 · 2022-07-14T14:26:38Z

The narrowPeak files should not cause the problems. I wish the bug was more clear, but it's difficult to diagnose.

The shifting is essentially just accounting for the 9bp overhang from transposition of the fragment. The assumption is that TN5 transposase is at the center of the cut size. So shifting is essentially just shifting the forward fragment cut sites by 5bp, and reverse fragment cut sites by -4bp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.StringIndexOutOfBoundsException when running feature extraction #5

java.lang.StringIndexOutOfBoundsException when running feature extraction #5

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022 •

edited

Loading

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022 •

edited

Loading

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 14, 2022

java.lang.StringIndexOutOfBoundsException when running feature extraction #5

java.lang.StringIndexOutOfBoundsException when running feature extraction #5

Comments

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022 • edited Loading

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022 • edited Loading

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

dylan6thomas commented Jul 13, 2022

ajt986 commented Jul 14, 2022

dylan6thomas commented Jul 13, 2022 •

edited

Loading

dylan6thomas commented Jul 13, 2022 •

edited

Loading