Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: integrate sketching utils from directsketch #516

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Nov 15, 2024

In directsketch, I tried to come up with a set of sketching utils that would make reading parameters + building sketches easier and more standardized. I introduced a series of buildutils -- BuildRecord,BuildManifest, BuildCollection, etc that are similar to their sourmash counterparts, but enable building and updating.

Here, I copy the utils over from directsketch and replace the two async functions with sync counterparts that enable writing the sigs and manifest from the buildutils. Then, I use BuildCollection in manysketch to manage parameter string handling, sig template generation, and sig writing.

To do:

  • benchmark as-is
  • add/use par_iter across sig templates and benchmark

Future:

  • allow translation

@bluegenes
Copy link
Contributor Author

bluegenes commented Nov 15, 2024

status:
zip files are getting written, but can't (yet) be read with sourmash. Running into this error again:

sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidArchive("Extra data field contains disk number")' at src/core/src/storage/mod.rs:390

Apparently the zip that is installed in my environment is v2.2, which has the bug mentioned here, and not v2.0, which is in Cargo.toml.

...why/how?

Update: I'm not sure why/how that was happening, but I did some cleaning + updating, and now the right versions seem to be installing. Hopefully this will stay fixed.

@bluegenes
Copy link
Contributor Author

bluegenes commented Nov 16, 2024

The only remaining failing tests are the ones with "incomplete" param strings -- i.e. user doesn't provide a moltype or doesn't provide a ksize. Temporarily commented out.

The current parameter strategy = DNA default (k=31,scaled=1000,noabund), then modify any parameters that are provided. Previously, we required a moltype and ksize to be present in the parameter string.

What do we want to require? I think maybe the right solution may be to require moltype, but not necessarily scaled or ksize? If a moltype is provided, we can use the default scaled/ksize. However, we don't necessarily want to assume DNA if the moltype is not provided?

thoughts @ctb?

Update: I've changed the strategy to require moltype. I think this matches our sketch fromfile strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant