Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use intake-solr with larger than memory datasets #7

Open
sodre opened this issue Jul 15, 2020 · 2 comments
Open

Use intake-solr with larger than memory datasets #7

sodre opened this issue Jul 15, 2020 · 2 comments

Comments

@sodre
Copy link

sodre commented Jul 15, 2020

As a user of intake-solr I would like to access datasets/queries that are larger than memory.

I believe the intake way to solve this is by creating an Solr Driver that has partitioned access, and that implements the to_dask() method. Is that correct?

@martindurant
Copy link
Member

I'm afraid not - to_dask will produce a dask dataframe with a single partition containing all the data, which is the default behaviour in the absence of any dask-specific code. The pysolr package which executes the query has no way to split the output into partitions in the way that would be useful. I might be out of date, though - if you know pysolr better or there is a more recent executor, I'd be happy to help point you towards implementing it for intake/

@sodre
Copy link
Author

sodre commented Jul 16, 2020

Okay, I have a sample implementation that I hacked up this past week. Let me know what you think.

caveat: I don't have any unit tests written, but it is working in our space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants