Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to merge mutiple GFF3 files? #20

Open
bioinformaticspcj opened this issue Nov 27, 2020 · 11 comments · May be fixed by #23
Open

How to merge mutiple GFF3 files? #20

bioinformaticspcj opened this issue Nov 27, 2020 · 11 comments · May be fixed by #23

Comments

@bioinformaticspcj
Copy link

Dear the authors:
Thanks for devoloping such an useful program. Now I have a question that if I have mutiple GFF3 files to train bssm, how can I merge these files for the gthbssmtrain program? I do not know how to sort the file as the program put an error:

"error: the file testPB.gff is not sorted (example: line 9 and 14)"

Thanks a gain and looking forward to your valuable help.

@satta
Copy link
Member

satta commented Nov 27, 2020

Without being too familiar with the BSSM input requirements, I can say that you can use GenomeTools to preprocess your input. In general, you can use gt gff3 sort to sort individual GFF3 files. Use the -tidy option when it still complains about some format quirks. Then, you can use gt merge to merge all sorted files into one. The -help option can tell you more about how to use each individual tool.

@bioinformaticspcj
Copy link
Author

bioinformaticspcj commented Nov 29, 2020 via email

@satta
Copy link
Member

satta commented Nov 29, 2020

Looks like the problem seems to be related by the relation of sequences and annotations in your input. This would probably be difficult to debug without your input files (or a minimal reduced version triggering the problem).

@bioinformaticspcj
Copy link
Author

bioinformaticspcj commented Nov 30, 2020 via email

@satta
Copy link
Member

satta commented Dec 3, 2020

I have taken some free time to dig into this today. I was able to reduce the test case to this minimal example:
test.gff3.gz

Apparently, when calculating sequence ranges for extraction, we are passed a start coordinate of 0, which then, when subtracted the offset of 1, underflows later during coordinate munging before extraction in GenomeTools:

[...]
Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7ca455b in __GI_abort () at abort.c:79
#2  0x00005555555e09aa in gt_bioseq_get_sequence_range (bs=<optimized out>, idx=<optimized out>, start=start@entry=18446744073709551615, end=end@entry=100)
    at src/core/bioseq.c:388
#3  0x00005555555e226e in gt_bioseq_col_md5_to_seq (sc=<optimized out>, seq=0x7fffffffdaa0, start=18446744073709551615, end=100, md5_seqid=<optimized out>, 
    err=0x5555556827b0) at src/core/bioseq_col.c:287
#4  0x0000555555599673 in gt_region_mapping_get_sequence (rm=0x555555687980, seq=0x7fffffffdaa0, seqid=0x555555688ee0, start=0, end=101, err=0x5555556827b0)
    at src/extended/region_mapping.c:240
#5  0x000055555556123e in find_false_sites (false_don_0_gt=0x55555569c8e0, false_don_0_gc=0x555555690b40, false_acc_0=0x55555569c910, false_don_1_gt=0x5555556940b0, 
    false_don_1_gc=0x555555690b70, false_acc_1=0x5555556940e0, false_don_2_gt=0x555555694110, false_don_2_gc=0x555555690ba0, false_acc_2=0x555555690ab0, 
    seqs=0x555555687bb0, proc_exons=true, region_mapping=0x555555687980, gcdonor=true, err=0x5555556827b0) at src/gth/bssm_seq_processor.c:1089
#6  0x000055555556198f in gth_bssm_seq_processor_find_false_sites (bsp=0x5555556879f0, region_mapping=0x555555687980, err=0x5555556827b0)
    at src/gth/bssm_seq_processor.c:1153
#7  0x000055555555ce4a in gt_gthbssmtrain_runner (argc=7, argv=0x7fffffffdde8, parsed_args=6, tool_arguments=0x555555684870, err=0x5555556827b0)
    at src/gth/gt_gthbssmtrain.c:319
#8  0x000055555558727b in gt_tool_run (tool=tool@entry=0x555555684810, argc=argc@entry=7, argv=argv@entry=0x7fffffffdde8, err=err@entry=0x5555556827b0)
    at src/core/tool.c:107
#9  0x0000555555587e0c in gt_toolobjdriver_with_license (tool_constructor=<optimized out>, version_func=0x55555555c22b <gthversionfunc>, argc=7, argv=0x7fffffffdde8, 
    license_out=license_out@entry=0x0, license_constructor=license_constructor@entry=0x0, license_destructor=0x0) at src/core/tooldriver.c:96
#10 0x0000555555587f4f in gt_toolobjdriver (tool_constructor=<optimized out>, version_func=<optimized out>, argc=<optimized out>, argv=<optimized out>)
    at src/core/tooldriver.c:67
#11 0x000055555555c229 in main (argc=7, argv=0x7fffffffdde8) at src/gthbssmtrain.c:8
(gdb) f 5
#5  0x000055555556123e in find_false_sites (false_don_0_gt=0x55555569c8e0, false_don_0_gc=0x555555690b40, false_acc_0=0x55555569c910, false_don_1_gt=0x5555556940b0, 
    false_don_1_gc=0x555555690b70, false_acc_1=0x5555556940e0, false_don_2_gt=0x555555694110, false_don_2_gc=0x555555690ba0, false_acc_2=0x555555690ab0, 
    seqs=0x555555687bb0, proc_exons=true, region_mapping=0x555555687980, gcdonor=true, err=0x5555556827b0) at src/gth/bssm_seq_processor.c:1089
1089	              had_err = gt_region_mapping_get_sequence(region_mapping,
(gdb) p *intron
$1 = {seqid = 0x555555688ee0, seq = 0x555555687f70, desc = 0x555555690a80, range = {start = 11, end = 77}, reverse = false, phase = 0}
(gdb) p acc_range
$2 = {start = 0, end = 101}
(gdb) f 4
#4  0x0000555555599673 in gt_region_mapping_get_sequence (rm=0x555555687980, seq=0x7fffffffdaa0, seqid=0x555555688ee0, start=0, end=101, err=0x5555556827b0)
    at src/extended/region_mapping.c:240
240	      had_err = gt_seq_col_md5_to_seq(rm->seq_col, seq, start - offset,
(gdb) p offset
$3 = 1
(gdb) p start
$4 = 0
(gdb) p start-offset
$5 = 18446744073709551615
(gdb) f 2
#2  0x00005555555e09aa in gt_bioseq_get_sequence_range (bs=<optimized out>, idx=<optimized out>, start=start@entry=18446744073709551615, end=end@entry=100)
    at src/core/bioseq.c:388
388	  gt_assert(idx < gt_encseq_num_of_sequences(bs->encseq) && end >= start);
(gdb) p start
$6 = 18446744073709551615

It looks like these function expect GFF3-style coordinates, starting at 1. However, the 0 was calculated in src/gth/bssm_seq_processor.c:1067:

if (!intron->reverse) {
  if (intron->range.start + j >= 50) {
    acc_range.start = intron->range.start + j - 50;
    acc_underflow = false;
  }
  else
    acc_underflow = true;
  acc_range.end = intron->range.start + j + 51;
}

which -- here in the case of j = 39 (and intron->range.start = 11, see above) -- will lead to a acc_range.start of 0.

Simply replacing the >= with > fixes the problem and allows the training to complete on OP's whole data set.

diff --git a/src/gth/bssm_seq_processor.c b/src/gth/bssm_seq_processor.c
index 62bb279..55e6840 100644
--- a/src/gth/bssm_seq_processor.c
+++ b/src/gth/bssm_seq_processor.c
@@ -1021,7 +1021,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
              (gcdonor &&
              (cseq[j+1] == 'C' || cseq[j+1] == 'c')))) {
           if (!intron->reverse) {
-            if (intron->range.start + j >= 50) {
+            if (intron->range.start + j > 50) {
               don_range.start = intron->range.start + j - 50;
               don_underflow = false;
             }
@@ -1030,7 +1030,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
             don_range.end = intron->range.start + j + 51;
           }
           else {
-            if (intron->range.end >= j + 51) {
+            if (intron->range.end > j + 51) {
               don_range.start = intron->range.end - j - 51;
               don_underflow = false;
             }
@@ -1063,7 +1063,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
                   (cseq[j]   == 'A' || cseq[j]   == 'a') &&
                   (cseq[j+1] == 'G' || cseq[j+1] == 'g')) {
           if (!intron->reverse) {
-            if (intron->range.start + j >= 50) {
+            if (intron->range.start + j > 50) {
               acc_range.start = intron->range.start + j - 50;
               acc_underflow = false;
             }
@@ -1072,7 +1072,7 @@ static int find_false_sites(GtArray *false_don_0_gt, GtArray *false_don_0_gc,
             acc_range.end = intron->range.start + j + 51;
           }
           else {
-            if (intron->range.end >= j + 51) {
+            if (intron->range.end > j + 51) {
               acc_range.start = intron->range.end - j - 51;
               acc_underflow = false;
             }

@gordon do you have any comments or objections? I'm not too familiar with the intricacies of the code in this spot. If it's OK I can prepare a PR.

@bioinformaticspcj
Copy link
Author

bioinformaticspcj commented Dec 7, 2020 via email

@satta
Copy link
Member

satta commented Dec 7, 2020

I'd leave the actual merge & release to someone with a better idea of how these changes might impact the output.

@gordon
Copy link
Member

gordon commented Dec 7, 2020

@satta: Many thanks for looking into this! Could you prepare a PR for your change? It looks good, but I want to test it on the old testsuite to see if it changes any previous results.

@satta satta linked a pull request Dec 7, 2020 that will close this issue
@satta
Copy link
Member

satta commented Dec 7, 2020

@satta: Many thanks for looking into this! Could you prepare a PR for your change? It looks good, but I want to test it on the old testsuite to see if it changes any previous results.

Sure, see #23

@gordon
Copy link
Member

gordon commented Dec 7, 2020

Great, thanks. Testing...

@bioinformaticspcj
Copy link
Author

Hi, has the test finished?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants