Sequential selection WIP #81

amy-langley · 2019-03-05T17:44:01Z

Currently is sorting by whatever order the subject ids are in the cache, not by priority.

Fixed #77

marten · 2019-03-05T17:47:09Z

@camallen easiest solution would just to tell the DB to ORDER BY set_member_subjects.priority ASC. There's no index on that for now. Can we just add that or do you expect issues if we do?

camallen · 2019-03-05T18:19:04Z

@camallen easiest solution would just to tell the DB to ORDER BY set_member_subjects.priority ASC. There's no index on that for now. Can we just add that or do you expect issues if we do?

I'd be hesitant to add more load to that database, though could be convinced with some quality benchmarking of variable data set sizes that shows minimal impact on the db. Asthe number of linked subjects gets large then the order by clause can be slow in my experience. We worked / benchmarked this a bit before the panoptes rebuild and cellect implementation.

Is there a way to load the data without ordering in the db and then re-order in memory or use a datastructure that orders based on the priority field?

For ref - Cellect uses some internal data structures to handle this
It load the data via https://github.com/zooniverse/cellect_panoptes/blob/9983ed67b463e022b93f00f7390b0c504a565589/lib/cellect/server/adapters/panoptes.rb#L84

then adds the data to a Fibonacci heap for ordered priority access
https://github.com/parrish/diff_set/blob/a37aaf87f5761c5b3ff0d814df9193f0d5ba1c3c/ext/diff_set/priority_set.h#L51

using this comparator function https://github.com/parrish/diff_set/blob/a37aaf87f5761c5b3ff0d814df9193f0d5ba1c3c/ext/diff_set/priority_set.h#L36

And then access the data in order
https://github.com/parrish/diff_set/blob/a37aaf87f5761c5b3ff0d814df9193f0d5ba1c3c/ext/diff_set/priority_set.cpp#L40

marten · 2019-03-05T19:45:04Z

Yeah we can do that, but currently at that point in the code there's no knowledge of how the workflow is configured, so it can't make that decision there. We'll have to thread the data through in that case.

zwolf · 2019-06-17T16:27:22Z

Circling waaaaay back to this now: the sorting needs to occur either on the DB side or in the Elixir process. The implication a while back was that it was in Designator's Subject model that this data was pulled from the DB and cached, but in revisiting this I can't find anywhere where the Subject#unretired_ids method is used. Rather, the Workflow model's subject_ids method seems the spot. I've started there and will pull the SMS priority field in addition to the id and sort by it before returning the ids in that order.

marten · 2019-06-17T16:54:11Z

Yeah that appears to be dead code. Workflow is where it's at.

marten · 2019-06-17T16:57:45Z

FWIW, after Cam's comments and the ensuing delay, I would suggest doing the sorting in Elixir (at the point just after we retrieve from the DB, and before it gets written to the SubjectSetCache) until we have evidence that it's not feasible. Easy to implement, and it won't cause the entire database to fall over, at most it'll make reloading slow.

zwolf · 2019-06-20T22:47:40Z

Above includes the priority when pulling from SMS, then sorts by it and returns only the ids in that order. This should, when called from the sync or async reloader, push them into the cache in that order. Selection basically assumes that the order of the Array as it exists in the cache will be the order that they're selected. So any stream involved in a request for subjects for a prioritized workflow will be served starting with the lowest priority, accounting for the various other filters involved.

Questions this raised:

How do multiple subject sets factor in here? convert_to_streams builds as many streams as subject sets, which are then interleaved by way of do_select and some alleged sorcery. The subjects themselves will be returned in priority order, but the streams will have been combined in such a way that makes that irrelevant.

Might be misunderstanding the composition, but a single workflow (with a nil subject_set_id if a param isn't included) is sent to the stream builder:

designator/lib/designator/selection.ex

Lines 50 to 54 in 3429052

    
           def get_streams(workflow, user, subject_set_id) do 
        
             streams = selection_subject_set_ids(workflow.subject_set_ids, subject_set_id) 
        
             |> get_subject_set_from_cache(workflow) 
        
             |> reject_empty_sets 
        
             |> convert_to_streams(workflow)

But get_subject_set_from_cache and convert_to_streams both take two args and don't seem to be getting the all the workflow's subject_set_ids.

What is an Array wrapper library doing for us that built-in lists don't? There's a lot of filtering of the subject array, but that's all a stream by that point, and it's not obvious what the extended library functionality defined in deps is vs. base elixir.

Going to keep at this, happy to talk it through if anyone has any insights.

camallen · 2019-06-21T09:53:32Z

any stream involved in a request for subjects for a prioritized workflow will be served starting with the lowest priority

Excellent - We need to ensure consistency with the internal API selector, that is Ascending order and was mapped to the page numbers use case from books. https://github.com/zooniverse/Panoptes/blob/e9cea8ebd02c413e3304ad98da8ce5aeea9fcc38/lib/subjects/postgresql_in_order_selection.rb#L15

zwolf · 2020-03-17T17:31:40Z

Closing because the remote repo has been deleted.

amy-langley and others added 7 commits June 21, 2019 11:09

add sequential stream class

c30f714

reorganize iteration strategies

166e349

use workflow.prioritized to specify iteration strategy

7c0dd0b

Sort by SMSes by priority when retrieving from db

c2998ab

Delete unused subject model

3e711fa

Remove IEx require

8ed2fb2

Un-order ids in sequential test

1a12958

zwolf closed this Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential selection WIP #81

Sequential selection WIP #81

amy-langley commented Mar 5, 2019 •

edited by marten

Loading

marten commented Mar 5, 2019

camallen commented Mar 5, 2019

marten commented Mar 5, 2019

zwolf commented Jun 17, 2019

marten commented Jun 17, 2019

marten commented Jun 17, 2019

zwolf commented Jun 20, 2019

camallen commented Jun 21, 2019

zwolf commented Mar 17, 2020

Sequential selection WIP #81

Sequential selection WIP #81

Conversation

amy-langley commented Mar 5, 2019 • edited by marten Loading

marten commented Mar 5, 2019

camallen commented Mar 5, 2019

marten commented Mar 5, 2019

zwolf commented Jun 17, 2019

marten commented Jun 17, 2019

marten commented Jun 17, 2019

zwolf commented Jun 20, 2019

camallen commented Jun 21, 2019

zwolf commented Mar 17, 2020

amy-langley commented Mar 5, 2019 •

edited by marten

Loading