Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] allow index from a signature zipfile #74

Merged
merged 5 commits into from
Aug 29, 2023
Merged

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Aug 28, 2023

If I'm not mistaken, we can't yet robustly read signatures from sig zip files into rust. This is an experimental hackaround to extract files from zip and then pass the new filepaths into the branchwater/mastiff index function.

chatgpt helped me a lot with the rust, so I make no guarantees on it ... just that it seems to work for me.

indexing gtdb:

time sourmash scripts index gtdb-rs214-reps.k31.zip -o gtdb-rs214-reps.k31.rdb

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000 / threshold: 0.01
indexing all sketches in 'gtdb-rs214-reps.k31.zip'
wrote 85205 signatures to temp dir
Loaded 85205 sig paths in siglist
...index is done! results in 'gtdb-rs214-reps.k31.rdb'

real    39m56.399s
user    33m57.385s
sys     5m51.034s

db is 6.7G.

and running gather with podar-ref set:

time sourmash scripts gather podar-ref-list.txt /group/ctbrowngrp/sourmash-db/gtdb-
rs214/gtdb-rs214-reps.k31.rdb

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000 / threshold_bp: 50000
gathering all sketches in 'podar-ref-list.txt' against '/group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k31.rdb' using 28 threads

DONE. Processed 64 search sigs
...gather is done!

real    1m26.296s
user    1m17.609s
sys     0m7.943s

@bluegenes
Copy link
Contributor Author

bluegenes commented Aug 29, 2023

With full GTDB

Indexing (390k sigs): 3h 40min

time sourmash scripts index gtdb-rs214-k31.zip -o gtdb-rs214.k31.rdb

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000
indexing all sketches in 'gtdb-rs214-k31.zip'

wrote 392058 signatures to temp dir
Loaded 392058 sig paths in siglist
...index is done! results in 'gtdb-rs214.k31.rdb'

real    217m24.231s
user    183m19.754s
sys     32m58.711s

Note that there are 402,709 sigs in the GTDB-rs214 zip manifest, meaning that 392k are the number of unique sigs in terms of md5sum. This means we're losing the identifying information for duplicated md5sum sigs with this (admittedly hacky) method. But I think we don't deal with this well in regular sourmash gather at the moment, either.

Gather (28 threads, 64 queries): 2-3min

time sourmash scripts gather podar-ref-list.txt gtdb-rs214.k31.rdb

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000 / threshold_bp: 50000
DONE. Processed 64 search sigs
...gather is done!

real    3m15.715s
user    1m56.744s
sys     0m16.514s

And with threshold_bp 0:

time sourmash scripts gather podar-ref-list.txt gtdb-rs214.k31.rdb -t 0

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000 / threshold_bp: 0.0
gathering all sketches in 'podar-ref-list.txt' against 'gtdb-rs214.k31.rdb' using 28 threads

DONE. Processed 64 search sigs
...gather is done!

real    1m56.412s
user    1m47.068s
sys     0m9.049s

Using 1 thread ... is just as fast? Hmm...

time sourmash scripts gather podar-ref-list.txt gtdb-rs214.k31.rdb -t 0 -c 1 -o gtest.csv

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000 / threshold_bp: 0.0
gathering all sketches in 'podar-ref-list.txt' against 'gtdb-rs214.k31.rdb' using 1 threads
DONE. Processed 64 search sigs
...gather is done! results in 'gtest.csv'

real    1m55.545s
user    1m45.736s
sys     0m9.274s

On an srun with 4 threads: 20min

time sourmash scripts gather podar-ref-list.txt gtdb-rs214.k31.rdb -t 0 -c 1 -o gtest.csv

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000 / threshold_bp: 0.0
gathering all sketches in 'podar-ref-list.txt' against 'gtdb-rs214.k31.rdb' using 1 threads

DONE. Processed 64 search sigs
...gather is done! results in 'gtest.csv'

real    20m2.326s
user    1m41.909s
sys     0m19.415s

Is the global thread pool being limited properly for gather, despite using the code from #57?

@bluegenes
Copy link
Contributor Author

bluegenes commented Aug 29, 2023

Or, is rdb gather just not very thread-efficient, and variation is a system issue?

Running again, this time with /usr/bin/time -v:

On srun with 28 threads: 1min

/usr/bin/time -v sourmash scripts gather podar-ref-list.txt gtdb-rs214.k31.rdb -o test2.csv -t 0 -c 28
ksize: 31 / scaled: 1000 / threshold_bp: 0.0
gathering all sketches in 'podar-ref-list.txt' against 'gtdb-rs214.k31.rdb' using 28 threads
DONE. Processed 64 search sigs
...gather is done! results in 'test2.csv'
        
Command being timed: "sourmash scripts gather podar-ref-list.txt /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214.k31.rdb -o
 test2.csv -t 0 -c 28"
        User time (seconds): 121.27
        System time (seconds): 15.43
        Percent of CPU this job got: 215%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:03.57
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1123804
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 805101
        Voluntary context switches: 176895
        Involuntary context switches: 1301
        Swaps: 0
        File system inputs: 0
        File system outputs: 64
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

On srun with 2 threads (and trying to limit to 1 via -c 1): 2min

/usr/bin/time -v sourmash scripts gather podar-ref-list.txt /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214.k31.rdb -o test2.csv -t 0 -c 1

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1000 / threshold_bp: 0.0
gathering all sketches in 'podar-ref-list.txt' against '/group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214.k31.rdb' using 1 threads
DONE. Processed 64 search sigs
...gather is done! results in 'test2.csv'
        Command being timed: "sourmash scripts gather podar-ref-list.txt /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214.k31.rdb -o test2.csv -t 0 -c 1"
        User time (seconds): 111.48
        System time (seconds): 9.53
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:01.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1008332
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1396431
        Voluntary context switches: 3777
        Involuntary context switches: 709
        Swaps: 0
        File system inputs: 0
        File system outputs: 80
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

on another run, 1 thread: 5min

        Command being timed: "sourmash scripts gather podar-ref-list.txt /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214.k31.rdb -o test2.csv -t 0 -c 1"
        User time (seconds): 106.78
        System time (seconds): 25.00
        Percent of CPU this job got: 44%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:55.57
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1008240
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 782
        Minor (reclaiming a frame) page faults: 1548353
        Voluntary context switches: 881521
        Involuntary context switches: 2199
        Swaps: 0
        File system inputs: 0
        File system outputs: 80
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

so yes, seems threads aren't doing much here.

@bluegenes bluegenes changed the title [EXP] allow index from a signature zipfile [MRG] allow index from a signature zipfile Aug 29, 2023
@bluegenes bluegenes merged commit 309f01a into try-mastiff Aug 29, 2023
2 of 12 checks passed
@bluegenes bluegenes deleted the hack-zip branch August 29, 2023 18:34
@mr-eyes
Copy link
Member

mr-eyes commented Aug 29, 2023

That's awesome! Great job, @bluegenes!

ctb added a commit that referenced this pull request Aug 31, 2023
* try adding rdb index and manysearch

* init testing

* use pathlist loading for better errs; more tests

* also check intersect_hashes

* add test for index check

* add multiquery mastiff gather

* init mastiff gather testing

* remove original single-query mastiff search, gather

* more cleanup

* MRG: fix `if let` warnings (#63)

* fix threads for changes from main

* rm threshold

* [MRG] allow index from a signature zipfile (#74)

* zipfile hackaround

* fix

* fix tests

* clean up; unify search testing; pin core to commit

* upd py toml

* test index zip

* add some indexed fastmultigather testing

* add cargo lock

* more index tests

* indexed multigather tests

* revert to branch while trying upds

* better help; avoid recalc threshold

* EXP: try fix CI for rocksdb (#80)

* Add trial workflow

* ok weird removing sourmash

* try again

* do the test

* remove maturin CI for the moment

* try caching rust build stuff

* fix yaml syntax

* test actions

---------

Co-authored-by: C. Titus Brown <[email protected]>
ctb added a commit that referenced this pull request Sep 1, 2023
* [MRG] add mastiff interface functions (#58)

* try adding rdb index and manysearch

* init testing

* use pathlist loading for better errs; more tests

* also check intersect_hashes

* add test for index check

* add multiquery mastiff gather

* init mastiff gather testing

* remove original single-query mastiff search, gather

* more cleanup

* MRG: fix `if let` warnings (#63)

* fix threads for changes from main

* rm threshold

* [MRG] allow index from a signature zipfile (#74)

* zipfile hackaround

* fix

* fix tests

* clean up; unify search testing; pin core to commit

* upd py toml

* test index zip

* add some indexed fastmultigather testing

* add cargo lock

* more index tests

* indexed multigather tests

* revert to branch while trying upds

* better help; avoid recalc threshold

* EXP: try fix CI for rocksdb (#80)

* Add trial workflow

* ok weird removing sourmash

* try again

* do the test

* remove maturin CI for the moment

* try caching rust build stuff

* fix yaml syntax

* test actions

---------

Co-authored-by: C. Titus Brown <[email protected]>

* improve gather output

* cargo lock

* re add jaccard

* added test for max cont

* version and cite

* rm warning

* bump versions

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>
ctb added a commit that referenced this pull request Sep 1, 2023
* add max_containment column

* change variable name

* another variable rename

* bump version

* MRG: re-add Jaccard; many UX output improvements (#85)

* [MRG] add mastiff interface functions (#58)

* try adding rdb index and manysearch

* init testing

* use pathlist loading for better errs; more tests

* also check intersect_hashes

* add test for index check

* add multiquery mastiff gather

* init mastiff gather testing

* remove original single-query mastiff search, gather

* more cleanup

* MRG: fix `if let` warnings (#63)

* fix threads for changes from main

* rm threshold

* [MRG] allow index from a signature zipfile (#74)

* zipfile hackaround

* fix

* fix tests

* clean up; unify search testing; pin core to commit

* upd py toml

* test index zip

* add some indexed fastmultigather testing

* add cargo lock

* more index tests

* indexed multigather tests

* revert to branch while trying upds

* better help; avoid recalc threshold

* EXP: try fix CI for rocksdb (#80)

* Add trial workflow

* ok weird removing sourmash

* try again

* do the test

* remove maturin CI for the moment

* try caching rust build stuff

* fix yaml syntax

* test actions

---------

Co-authored-by: C. Titus Brown <[email protected]>

* improve gather output

* cargo lock

* re add jaccard

* added test for max cont

* version and cite

* rm warning

* bump versions

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>

* cleanup

* wat

* dup simple test

* test fix

---------

Co-authored-by: C. Titus Brown <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants