Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastmultigather x rocksdb doesn't actually use multiple cores #503

Open
ctb opened this issue Nov 8, 2024 · 6 comments
Open

fastmultigather x rocksdb doesn't actually use multiple cores #503

ctb opened this issue Nov 8, 2024 · 6 comments

Comments

@ctb
Copy link
Collaborator

ctb commented Nov 8, 2024

one of the upshots from the benchmarks in #479 is that fmg x rocksdb doesn't use multiple cores and is also I/O bound. This is presumably because it is a serial algorithm that blocks on RocksDB queries!

I was able to get decent parallelism by using snakemake with multiple processes (see benchmark).

This is not inherently a problem, but I think we should at the least mention it in the docs.

A few thoughts:

  • it also would be nice to get rid of the -o behavior in the fastmultigather command - it is an enduring source of confusion for me, at least. (ref: Can't get fastmultigather to write to a specific output file #299)
  • we might consider splitting off a new command, or supporting fastgather against rocksdb? It doesn't inherently make sense to have fmg support multiple threads that it's not going to use ;).
  • In order to get better in-application parallelism against RocksDB I think we would need to explore having multiple RocksDB handles open to the same database. That doesn't seem like a problem to me and is probably the right solution, if it actually works ;).
@ctb
Copy link
Collaborator Author

ctb commented Nov 8, 2024

for posterity, here is the Snakefile I am using:

rule all:
    input:
        expand("fmg-smk/{acc}.fmg.csv", acc=ACC)

rule fmg:
    input:
        "/group/ctbrowngrp/irber/data/wort-data/wort-sra/sigs/{acc}.sig",
    output:
        "fmg-smk/{acc}.fmg.csv",
    shell: """
        sourmash scripts fastmultigather {input} gtdb-rs214-k31.rocksdb \
           -c 1 -s 10000 -o {output} -t 50_000
    """

@luizirber
Copy link
Member

luizirber commented Nov 8, 2024

Adding py-spy and debug symbols to help generate a flamegraph in https://github.com/sourmash-bio/sourmash_plugin_branchwater/compare/lirber/gather_perf

github is scrapping the interactive parts of the flamegraph SVG, but posting below. Generated by running

py-spy record -o flamegraph2.svg -n -- \
   sourmash scripts fastmultigather ERR2241825.sig gtdb-rs214-k31.rocksdb -c 1 -s 10000 -o out.csv -t 50000

flamegraph2

Some initial comments:

  • rocksdb is the call stack on the left (6.69%), and about half of the really tall one on the right (0.61%). While that's not trivial, it is far from what takes most of the time
  • there seems to be a lot of conversion between KmerMinHash and KmerMinHashBTree in the middle, including duplicating data with clone, which can likely be avoided
  • this took ~48 minutes to run

@ctb
Copy link
Collaborator Author

ctb commented Nov 9, 2024

looks like this is all happening in disk_revindex.rs, impl RevIndexOps for RevIndex::gather() permalink

The clone is concerning - there are two clones of the metagenome query, one at the beginning,

        let mut query = KmerMinHashBTree::from(orig_query.clone());

that is needed because orig_query is used later on in calculate_gather_stats, where it is used to calculate an intersection.

The second metagenome clone is done on the modified query, each iteration, when query is converted back into a KmerMinHash:

            let query_mh = KmerMinHash::from(query.clone());

and this is used in calculate_gather_stats as remaining_query. It is never modified, so we could avoid the clone, but I guess the thinking behind converting it from a KmerMinHashBTree back to a KmerMinHash is that the various operations it's used in will be faster on a KmerMinHash? Specifically, it is used as an argument for intersection and inflated_abundances.

We can always make gather faster, I guess 😆

@luizirber
Copy link
Member

After sourmash-bio/sourmash#3385:

flamegraph_md5sum

Down to 22m (was 48m before).

@luizirber
Copy link
Member

Tried using KmerMinHashBTree all the way, instead of converting back to KmerMinHash for calculations, in https://github.com/sourmash-bio/sourmash/compare/lirber/gather_downsampled_iter (misleading branch name 😓 ):

flamegraph_btree

This took 25m vs 22m with KmerMinHash, and didn't eliminate the clones. So +0 on changing it for now. Time is mostly spent on calculating intersections now.

luizirber added a commit to sourmash-bio/sourmash that referenced this issue Nov 11, 2024
…shBTree (#3385)

While debugging
sourmash-bio/sourmash_plugin_branchwater#503
the flamegraph showed ~26% of the time was spent on calculating MD5.

WHY????

Turns out cloning and converting to `KmerMinHash` to `KmerMinHashBTree`
triggered recalculation of the MD5 sum, even if it was already present
(or... not needed). Oops!
@ctb
Copy link
Collaborator Author

ctb commented Nov 11, 2024

From a ~two day run of fastmultigather against rocksdb, with an early version of #504 and NOT using the performance fix above, we get good multicore usage on fastmultigather using snakemake/process level parallelism. Only 6 GB of RAM used, too!

        Command being timed: "snakemake -c 10"
        User time (seconds): 1419707.49
        System time (seconds): 39212.19
        Percent of CPU this job got: 972%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 41:39:12
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 6060108
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 53236
        Minor (reclaiming a frame) page faults: 3151089386
        Voluntary context switches: 79502408
        Involuntary context switches: 2974025
        Swaps: 0
        File system inputs: 116798016
        File system outputs: 33824280
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants