Use intake-solr with larger than memory datasets #7

sodre · 2020-07-15T13:14:53Z

As a user of intake-solr I would like to access datasets/queries that are larger than memory.

I believe the intake way to solve this is by creating an Solr Driver that has partitioned access, and that implements the to_dask() method. Is that correct?

martindurant · 2020-07-15T13:20:25Z

I'm afraid not - to_dask will produce a dask dataframe with a single partition containing all the data, which is the default behaviour in the absence of any dask-specific code. The pysolr package which executes the query has no way to split the output into partitions in the way that would be useful. I might be out of date, though - if you know pysolr better or there is a more recent executor, I'd be happy to help point you towards implementing it for intake/

sodre · 2020-07-16T23:08:07Z

Okay, I have a sample implementation that I hacked up this past week. Let me know what you think.

caveat: I don't have any unit tests written, but it is working in our space

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use intake-solr with larger than memory datasets #7

Use intake-solr with larger than memory datasets #7

sodre commented Jul 15, 2020

martindurant commented Jul 15, 2020

sodre commented Jul 16, 2020

Use intake-solr with larger than memory datasets #7

Use intake-solr with larger than memory datasets #7

Comments

sodre commented Jul 15, 2020

martindurant commented Jul 15, 2020

sodre commented Jul 16, 2020