Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Speed up distance computations #169

Draft
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

ElDeveloper
Copy link
Member

Use Pandas vectorized operations to speed up the validation and
computation of first distances and distances to baseline.

In a dataset with ~6K samples runtime went from >1 minute to under a
second.


All tests pass except for test_first_distances_ecam which fails because of replicate_handling='drop'. As far as I can tell when there's a repeated state the repeated state and the next state are dropped. Which seems odd, but wanted to check if this is expected or if I am misunderstanding something. Seems to me that only the repeated state should be dropped.

For example, when I go to ecam_map_maturity.txt, I noticed that sample 10249.C001.07SS is not in ecam-first-distances even though the state for that sample isn't repeated. Same is true for 10249.C001.17SS which should be included since state 10 is repeated but state 11 isn't. Any help would be great here.

Use Pandas vectorized operations to speed up the validation and
computation of first distances and distances to baseline.

In a dataset with ~6K samples runtime went from >1 minute to under a
second.
@nbokulich
Copy link
Member

thanks @ElDeveloper ! this sounds great.

Re: the test failure with drop, this is the intended behavior:

'replicate_handling': (
'Choose how replicate samples are handled. If replicates are '
'detected, "error" causes method to fail; "drop" will discard all '
'replicated samples; "random" chooses one representative at random '
'from among replicates.')

As the first-distances are calculated across each interval, missing timepoints (including if a timepoint is dropped) effectively cause two first distances to drop (at time t and time t+1). So the examples you gave are correct; these are cases where the first distance at time t+1 cannot be calculated because the preceding timepoint t is missing (because all samples were dropped). So the first distances cannot be calculated at either interval t-1 <--> t or t <--> t+1.

For these reasons, the "drop" option is rather specific and might only be valuable in some situations. "random" is the more suitable option in most real situations, as it will lead to a random sample being selected at those timepoints so that the first distances at both intervals can be calculated. But it sounds like the test is working as originally intended.

@ElDeveloper
Copy link
Member Author

Fair enough, thanks for the explanation! I'll fix the code in that case. Would you also be OK if I changed the documentation in plugin_setup.py? Seems like "will discard all replicated samples" is ambiguous on the fact that t and t+1 are dropped.

@nbokulich
Copy link
Member

Would you also be OK if I changed the documentation in plugin_setup.py? Seems like "will discard all replicated samples" is ambiguous on the fact that t and t+1 are dropped.

documentation updates are very welcome 😄

but not that line — a bunch of actions in this plugin share the same parameter descriptions. So this action should either get a unique parameter description, or better describe this behavior in the action description (which IMO would be best, since this is the behavior from any missing samples, not specifically replicate_handling='drop')

@ElDeveloper
Copy link
Member Author

Cheers thanks so much.

Would you be able to help me diagnose one more case. I've updated my code, but now I am seeing that other samples are currently being excluded from the expected output. For example: 10249.C002.09SS, 10249.C002.15SS, 10249.C002.18SS, 10249.C005.07SS, 10249.C008.04SS, 10249.C008.12SS, 10249.C009.16SS, 10249.C011.04SS, 10249.C014.12SS, 10249.C018.03SS, 10249.C018.17SS, 10249.C021.04SS, 10249.C021.05SS, 10249.C021.07SS, 10249.C021.08SS, etc. When I look at the mapping file for a few of these, I don't see them being associated with any of the repeated sample pairs that are dropped. Any hints as to why these are dropped from the expected output?

I've highlighted the first few in the context of the mapping file here:

10249.C002.09SS, 10249.C002.15SS, 10249.C002.18SS

Screen Shot 2022-02-08 at 9 58 36 AM

10249.C005.07SS

Screen Shot 2022-02-08 at 9 59 42 AM

10249.C008.04SS, 10249.C008.12SS

Screen Shot 2022-02-08 at 10 04 02 AM

@nbokulich
Copy link
Member

nbokulich commented Feb 9, 2022

Hey @ElDeveloper ,
Those first distances are not being calculated because they are missing the immediately preceding timepoint (i.e., it's not being dropped, rather that timepoint is just missing to begin with).

The first distances intervals are meant to be measured across the same time intervals in each subject. So if you have a table like this (rows are samples, cols are months, cells indicate if a sample was collected at that timepoint):

subject-id 0 1 2 3 4 5 10
A + + + + + + +
B + + + + +

A has all first distances calculated, but B only has distances starting at timepoint 4 (i.e., samples were collected at 0 and 3, but the immediately preceding timepoint is missing). On the other hand, one could drop those columns and all distances would be calculated for each sample in a table that looks like this:

subject-id 0 3 4 5 10
A + + + + +
B + + + + +

This is the case above, where 10249.C002.09SS is a sample at month 8, but month 7 is missing. If month 7 were missing in all subjects, the distance interval would be from months 6->8. But that's not the case here, as the interval is based on all samples in the table.

Does that make sense?

Note: the intervals are evenly measuring across samples, but the intervals are not necessarily even across time. One big enhancement here could be to also expose an interval parameter, to allow auto (status quo, variable spacing possible), or at explicit intervals (e.g., to ensure that all distances represent 1 month intervals, to set an explicit lag window, or even to skip shorter intervals). I see this as an ENH not needed now, but maybe based on how your code is set up you already have this implemented and it is easier to expose this now instead of totally refactoring.

@ElDeveloper
Copy link
Member Author

Thanks so much for the explanation. I didn't realize all this was going on under the hood. I like the idea of auto vs shared? or something like that. Let me rework this current version and I'll see where I get. 👍🏽

@ElDeveloper
Copy link
Member Author

@nbokulich all tests are passing and should be ready for review now. I've left out the interval selection out for this pass but it is something we could naturally add in a follow-up PR. Looking forward to hearing your thoughts, I know this sort of code can sometimes be hard to read.

Copy link
Member

@nbokulich nbokulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ElDeveloper ! looks like a vast improvement 😁

I have not done a test run yet, but just finished reviewing the code. I have a few comments and questions throughout. Pls let me know what you think!

metadata.index[1:], state_column].values

states = metadata[state_column].unique()
states.sort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sort then flip? why not just sort descending?

]['Distance']

output.index.name = '#SampleID'
output = output.iloc[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls drop in a comment here — why are you flipping the dataframe again?


output = metadata.groupby(individual_id_column).apply(column_getter)

# When the output of groupby is a single series i.e. when there's only one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be better caught with a conditional, e.g., by checking the number of unique individuals?

distance_matrix: pd.DataFrame, metadata: pd.DataFrame,
state_column: str, individual_id_column: str) -> pd.Series:

# with the sorted table, we add an extra column with indices so we make
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really following this idea... adding, renaming, and handling this extra column seems to take quite a few lines. Wouldn't it just be easier to reset_index? instead of indexing on an artificial new column?

return output


def _vectorized_distance_to_baseline(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of merging metadata + distance matrix, then using a groupby by individual_id

but some of the logic involved here seems a bit convoluted. It might be easier + more transparent to drop the slicing by index etc. In the groupby operation you could drop the first row (the baseline) and grab the column name matching that index. The result would be a vector of distances to baseline for that individual.


# there's four "utility" columns that need to be ignored:
# individual, state, combo, and index
def column_getter(frame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is a bit difficult to follow — see my comment above.

else:
# this way of finding duplicates is relevant for "random" too
duplicated = metadata.duplicated(subset=combo_column)
if replicate_handling == 'error':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to:

if duplicated.any() and replicate_handling == 'error':

???

if replicate_handling == 'drop':
duplicated = metadata.duplicated(subset=combo_column, keep=False)
else:
# this way of finding duplicates is relevant for "random" too
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is randomization done? as far as I can tell, this rather takes the first

taking first is not all that bad either 😁 and could be exposed as a separate option. random could just randomly shuffle before duplicated(keep='first')

f' state value of "{baseline}": {missing}')

# use -np.inf to always make the baseline the first in the list
metadata.replace({baseline: -np.inf}, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will replace all matching values in the entire df, correct? instead of just the state_column column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also relates to my question about _vectorized_distance_to_baseline... instead of doing a pass over the df to replace baseline values now, maybe this could just be done inside that function when grabbing the baseline row?

@ElDeveloper
Copy link
Member Author

@nbokulich thanks for the detailed review and apologies for the delay getting back to you on these comments. I'll be doing that in the near future. Thanks so much for considering, and for your patience! ⏳ 🚤

@lizgehret
Copy link
Member

Hey @ElDeveloper, thanks for all of your hard work here! We are currently doing some PR triage and review right now - are you still in the process of addressing @nbokulich's review comments?

@ElDeveloper
Copy link
Member Author

Hi @lizgehret thanks for the heads up. I have almost replied to all the comments (although I haven't posted the review yet), but there's a few where I still need to do some additional work. I'll respond ASAP.

@gregcaporaso gregcaporaso marked this pull request as draft December 11, 2023 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants