Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential selection WIP #81

Closed
wants to merge 7 commits into from
Closed

Sequential selection WIP #81

wants to merge 7 commits into from

Conversation

amy-langley
Copy link

@amy-langley amy-langley commented Mar 5, 2019

Currently is sorting by whatever order the subject ids are in the cache, not by priority.

Fixed #77

@marten
Copy link
Contributor

marten commented Mar 5, 2019

@camallen easiest solution would just to tell the DB to ORDER BY set_member_subjects.priority ASC. There's no index on that for now. Can we just add that or do you expect issues if we do?

@camallen
Copy link
Contributor

camallen commented Mar 5, 2019

@camallen easiest solution would just to tell the DB to ORDER BY set_member_subjects.priority ASC. There's no index on that for now. Can we just add that or do you expect issues if we do?

I'd be hesitant to add more load to that database, though could be convinced with some quality benchmarking of variable data set sizes that shows minimal impact on the db. Asthe number of linked subjects gets large then the order by clause can be slow in my experience. We worked / benchmarked this a bit before the panoptes rebuild and cellect implementation.

Is there a way to load the data without ordering in the db and then re-order in memory or use a datastructure that orders based on the priority field?

For ref - Cellect uses some internal data structures to handle this
It load the data via https://github.com/zooniverse/cellect_panoptes/blob/9983ed67b463e022b93f00f7390b0c504a565589/lib/cellect/server/adapters/panoptes.rb#L84

then adds the data to a Fibonacci heap for ordered priority access
https://github.com/parrish/diff_set/blob/a37aaf87f5761c5b3ff0d814df9193f0d5ba1c3c/ext/diff_set/priority_set.h#L51

using this comparator function https://github.com/parrish/diff_set/blob/a37aaf87f5761c5b3ff0d814df9193f0d5ba1c3c/ext/diff_set/priority_set.h#L36

And then access the data in order
https://github.com/parrish/diff_set/blob/a37aaf87f5761c5b3ff0d814df9193f0d5ba1c3c/ext/diff_set/priority_set.cpp#L40

@marten
Copy link
Contributor

marten commented Mar 5, 2019

Yeah we can do that, but currently at that point in the code there's no knowledge of how the workflow is configured, so it can't make that decision there. We'll have to thread the data through in that case.

@zwolf
Copy link
Member

zwolf commented Jun 17, 2019

Circling waaaaay back to this now: the sorting needs to occur either on the DB side or in the Elixir process. The implication a while back was that it was in Designator's Subject model that this data was pulled from the DB and cached, but in revisiting this I can't find anywhere where the Subject#unretired_ids method is used. Rather, the Workflow model's subject_ids method seems the spot. I've started there and will pull the SMS priority field in addition to the id and sort by it before returning the ids in that order.

@marten
Copy link
Contributor

marten commented Jun 17, 2019

Yeah that appears to be dead code. Workflow is where it's at.

@marten
Copy link
Contributor

marten commented Jun 17, 2019

FWIW, after Cam's comments and the ensuing delay, I would suggest doing the sorting in Elixir (at the point just after we retrieve from the DB, and before it gets written to the SubjectSetCache) until we have evidence that it's not feasible. Easy to implement, and it won't cause the entire database to fall over, at most it'll make reloading slow.

@zwolf
Copy link
Member

zwolf commented Jun 20, 2019

Above includes the priority when pulling from SMS, then sorts by it and returns only the ids in that order. This should, when called from the sync or async reloader, push them into the cache in that order. Selection basically assumes that the order of the Array as it exists in the cache will be the order that they're selected. So any stream involved in a request for subjects for a prioritized workflow will be served starting with the lowest priority, accounting for the various other filters involved.

Questions this raised:

  • How do multiple subject sets factor in here? convert_to_streams builds as many streams as subject sets, which are then interleaved by way of do_select and some alleged sorcery. The subjects themselves will be returned in priority order, but the streams will have been combined in such a way that makes that irrelevant.
  • Might be misunderstanding the composition, but a single workflow (with a nil subject_set_id if a param isn't included) is sent to the stream builder:
    def get_streams(workflow, user, subject_set_id) do
    streams = selection_subject_set_ids(workflow.subject_set_ids, subject_set_id)
    |> get_subject_set_from_cache(workflow)
    |> reject_empty_sets
    |> convert_to_streams(workflow)

    But get_subject_set_from_cache and convert_to_streams both take two args and don't seem to be getting the all the workflow's subject_set_ids.
  • What is an Array wrapper library doing for us that built-in lists don't? There's a lot of filtering of the subject array, but that's all a stream by that point, and it's not obvious what the extended library functionality defined in deps is vs. base elixir.

Going to keep at this, happy to talk it through if anyone has any insights.

@camallen
Copy link
Contributor

any stream involved in a request for subjects for a prioritized workflow will be served starting with the lowest priority

Excellent - We need to ensure consistency with the internal API selector, that is Ascending order and was mapped to the page numbers use case from books. https://github.com/zooniverse/Panoptes/blob/e9cea8ebd02c413e3304ad98da8ce5aeea9fcc38/lib/subjects/postgresql_in_order_selection.rb#L15

@zwolf
Copy link
Member

zwolf commented Mar 17, 2020

Closing because the remote repo has been deleted.

@zwolf zwolf closed this Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How would we add a different selection strategy?
4 participants