-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Speed up distance computations #169
base: dev
Are you sure you want to change the base?
Conversation
Use Pandas vectorized operations to speed up the validation and computation of first distances and distances to baseline. In a dataset with ~6K samples runtime went from >1 minute to under a second.
thanks @ElDeveloper ! this sounds great. Re: the test failure with q2-longitudinal/q2_longitudinal/plugin_setup.py Lines 91 to 95 in 639ced8
As the For these reasons, the "drop" option is rather specific and might only be valuable in some situations. "random" is the more suitable option in most real situations, as it will lead to a random sample being selected at those timepoints so that the first distances at both intervals can be calculated. But it sounds like the test is working as originally intended. |
Fair enough, thanks for the explanation! I'll fix the code in that case. Would you also be OK if I changed the documentation in |
documentation updates are very welcome 😄 but not that line — a bunch of actions in this plugin share the same parameter descriptions. So this action should either get a unique parameter description, or better describe this behavior in the action description (which IMO would be best, since this is the behavior from any missing samples, not specifically |
Cheers thanks so much. Would you be able to help me diagnose one more case. I've updated my code, but now I am seeing that other samples are currently being excluded from the expected output. For example: I've highlighted the first few in the context of the mapping file here:
|
Hey @ElDeveloper , The first distances intervals are meant to be measured across the same time intervals in each subject. So if you have a table like this (rows are samples, cols are months, cells indicate if a sample was collected at that timepoint):
A has all first distances calculated, but B only has distances starting at timepoint 4 (i.e., samples were collected at 0 and 3, but the immediately preceding timepoint is missing). On the other hand, one could drop those columns and all distances would be calculated for each sample in a table that looks like this:
This is the case above, where Does that make sense? Note: the intervals are evenly measuring across samples, but the intervals are not necessarily even across time. One big enhancement here could be to also expose an interval parameter, to allow |
Thanks so much for the explanation. I didn't realize all this was going on under the hood. I like the idea of |
@nbokulich all tests are passing and should be ready for review now. I've left out the interval selection out for this pass but it is something we could naturally add in a follow-up PR. Looking forward to hearing your thoughts, I know this sort of code can sometimes be hard to read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ElDeveloper ! looks like a vast improvement 😁
I have not done a test run yet, but just finished reviewing the code. I have a few comments and questions throughout. Pls let me know what you think!
metadata.index[1:], state_column].values | ||
|
||
states = metadata[state_column].unique() | ||
states.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why sort then flip? why not just sort descending?
]['Distance'] | ||
|
||
output.index.name = '#SampleID' | ||
output = output.iloc[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pls drop in a comment here — why are you flipping the dataframe again?
|
||
output = metadata.groupby(individual_id_column).apply(column_getter) | ||
|
||
# When the output of groupby is a single series i.e. when there's only one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be better caught with a conditional, e.g., by checking the number of unique individuals?
distance_matrix: pd.DataFrame, metadata: pd.DataFrame, | ||
state_column: str, individual_id_column: str) -> pd.Series: | ||
|
||
# with the sorted table, we add an extra column with indices so we make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really following this idea... adding, renaming, and handling this extra column seems to take quite a few lines. Wouldn't it just be easier to reset_index
? instead of indexing on an artificial new column?
return output | ||
|
||
|
||
def _vectorized_distance_to_baseline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of merging metadata + distance matrix, then using a groupby by individual_id
but some of the logic involved here seems a bit convoluted. It might be easier + more transparent to drop the slicing by index etc. In the groupby operation you could drop the first row (the baseline) and grab the column name matching that index. The result would be a vector of distances to baseline for that individual.
|
||
# there's four "utility" columns that need to be ignored: | ||
# individual, state, combo, and index | ||
def column_getter(frame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is a bit difficult to follow — see my comment above.
else: | ||
# this way of finding duplicates is relevant for "random" too | ||
duplicated = metadata.duplicated(subset=combo_column) | ||
if replicate_handling == 'error': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to:
if duplicated.any() and replicate_handling == 'error':
???
if replicate_handling == 'drop': | ||
duplicated = metadata.duplicated(subset=combo_column, keep=False) | ||
else: | ||
# this way of finding duplicates is relevant for "random" too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is randomization done? as far as I can tell, this rather takes the first
taking first
is not all that bad either 😁 and could be exposed as a separate option. random
could just randomly shuffle before duplicated(keep='first')
f' state value of "{baseline}": {missing}') | ||
|
||
# use -np.inf to always make the baseline the first in the list | ||
metadata.replace({baseline: -np.inf}, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will replace all matching values in the entire df, correct? instead of just the state_column
column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also relates to my question about _vectorized_distance_to_baseline
... instead of doing a pass over the df to replace
baseline values now, maybe this could just be done inside that function when grabbing the baseline row?
@nbokulich thanks for the detailed review and apologies for the delay getting back to you on these comments. I'll be doing that in the near future. Thanks so much for considering, and for your patience! ⏳ 🚤 |
Hey @ElDeveloper, thanks for all of your hard work here! We are currently doing some PR triage and review right now - are you still in the process of addressing @nbokulich's review comments? |
Hi @lizgehret thanks for the heads up. I have almost replied to all the comments (although I haven't posted the review yet), but there's a few where I still need to do some additional work. I'll respond ASAP. |
Use Pandas vectorized operations to speed up the validation and
computation of first distances and distances to baseline.
In a dataset with ~6K samples runtime went from >1 minute to under a
second.
All tests pass except for
test_first_distances_ecam
which fails because ofreplicate_handling='drop'
. As far as I can tell when there's a repeated state the repeated state and the next state are dropped. Which seems odd, but wanted to check if this is expected or if I am misunderstanding something. Seems to me that only the repeated state should be dropped.For example, when I go to
ecam_map_maturity.txt
, I noticed that sample10249.C001.07SS
is not inecam-first-distances
even though the state for that sample isn't repeated. Same is true for10249.C001.17SS
which should be included since state10
is repeated but state11
isn't. Any help would be great here.