PERF: Speed up distance computations #169

ElDeveloper · 2022-02-07T04:31:36Z

Use Pandas vectorized operations to speed up the validation and
computation of first distances and distances to baseline.

In a dataset with ~6K samples runtime went from >1 minute to under a
second.

All tests pass except for test_first_distances_ecam which fails because of replicate_handling='drop'. As far as I can tell when there's a repeated state the repeated state and the next state are dropped. Which seems odd, but wanted to check if this is expected or if I am misunderstanding something. Seems to me that only the repeated state should be dropped.

For example, when I go to ecam_map_maturity.txt, I noticed that sample 10249.C001.07SS is not in ecam-first-distances even though the state for that sample isn't repeated. Same is true for 10249.C001.17SS which should be included since state 10 is repeated but state 11 isn't. Any help would be great here.

Use Pandas vectorized operations to speed up the validation and computation of first distances and distances to baseline. In a dataset with ~6K samples runtime went from >1 minute to under a second.

nbokulich · 2022-02-07T10:25:12Z

thanks @ElDeveloper ! this sounds great.

Re: the test failure with drop, this is the intended behavior:

q2-longitudinal/q2_longitudinal/plugin_setup.py

Lines 91 to 95 in 639ced8

    
           'replicate_handling': ( 
        
               'Choose how replicate samples are handled. If replicates are ' 
        
               'detected, "error" causes method to fail; "drop" will discard all ' 
        
               'replicated samples; "random" chooses one representative at random ' 
        
               'from among replicates.')

As the first-distances are calculated across each interval, missing timepoints (including if a timepoint is dropped) effectively cause two first distances to drop (at time t and time t+1). So the examples you gave are correct; these are cases where the first distance at time t+1 cannot be calculated because the preceding timepoint t is missing (because all samples were dropped). So the first distances cannot be calculated at either interval t-1 <--> t or t <--> t+1.

For these reasons, the "drop" option is rather specific and might only be valuable in some situations. "random" is the more suitable option in most real situations, as it will lead to a random sample being selected at those timepoints so that the first distances at both intervals can be calculated. But it sounds like the test is working as originally intended.

ElDeveloper · 2022-02-07T16:13:03Z

Fair enough, thanks for the explanation! I'll fix the code in that case. Would you also be OK if I changed the documentation in plugin_setup.py? Seems like "will discard all replicated samples" is ambiguous on the fact that t and t+1 are dropped.

nbokulich · 2022-02-07T18:08:29Z

Would you also be OK if I changed the documentation in plugin_setup.py? Seems like "will discard all replicated samples" is ambiguous on the fact that t and t+1 are dropped.

documentation updates are very welcome 😄

but not that line — a bunch of actions in this plugin share the same parameter descriptions. So this action should either get a unique parameter description, or better describe this behavior in the action description (which IMO would be best, since this is the behavior from any missing samples, not specifically replicate_handling='drop')

ElDeveloper · 2022-02-08T18:06:53Z

Cheers thanks so much.

Would you be able to help me diagnose one more case. I've updated my code, but now I am seeing that other samples are currently being excluded from the expected output. For example: 10249.C002.09SS, 10249.C002.15SS, 10249.C002.18SS, 10249.C005.07SS, 10249.C008.04SS, 10249.C008.12SS, 10249.C009.16SS, 10249.C011.04SS, 10249.C014.12SS, 10249.C018.03SS, 10249.C018.17SS, 10249.C021.04SS, 10249.C021.05SS, 10249.C021.07SS, 10249.C021.08SS, etc. When I look at the mapping file for a few of these, I don't see them being associated with any of the repeated sample pairs that are dropped. Any hints as to why these are dropped from the expected output?

I've highlighted the first few in the context of the mapping file here:

`10249.C002.09SS`, `10249.C002.15SS`, `10249.C002.18SS`

`10249.C005.07SS`

`10249.C008.04SS`, `10249.C008.12SS`

nbokulich · 2022-02-09T12:23:40Z

Hey @ElDeveloper ,
Those first distances are not being calculated because they are missing the immediately preceding timepoint (i.e., it's not being dropped, rather that timepoint is just missing to begin with).

The first distances intervals are meant to be measured across the same time intervals in each subject. So if you have a table like this (rows are samples, cols are months, cells indicate if a sample was collected at that timepoint):

subject-id	0	1	2	3	4	5	10
A	+	+	+	+	+	+	+
B	+			+	+	+	+

A has all first distances calculated, but B only has distances starting at timepoint 4 (i.e., samples were collected at 0 and 3, but the immediately preceding timepoint is missing). On the other hand, one could drop those columns and all distances would be calculated for each sample in a table that looks like this:

subject-id	0	3	4	5	10
A	+	+	+	+	+
B	+	+	+	+	+

This is the case above, where 10249.C002.09SS is a sample at month 8, but month 7 is missing. If month 7 were missing in all subjects, the distance interval would be from months 6->8. But that's not the case here, as the interval is based on all samples in the table.

Does that make sense?

Note: the intervals are evenly measuring across samples, but the intervals are not necessarily even across time. One big enhancement here could be to also expose an interval parameter, to allow auto (status quo, variable spacing possible), or at explicit intervals (e.g., to ensure that all distances represent 1 month intervals, to set an explicit lag window, or even to skip shorter intervals). I see this as an ENH not needed now, but maybe based on how your code is set up you already have this implemented and it is easier to expose this now instead of totally refactoring.

ElDeveloper · 2022-02-11T16:03:40Z

Thanks so much for the explanation. I didn't realize all this was going on under the hood. I like the idea of auto vs shared? or something like that. Let me rework this current version and I'll see where I get. 👍🏽

ElDeveloper · 2022-03-03T06:48:07Z

@nbokulich all tests are passing and should be ready for review now. I've left out the interval selection out for this pass but it is something we could naturally add in a follow-up PR. Looking forward to hearing your thoughts, I know this sort of code can sometimes be hard to read.

nbokulich

thanks @ElDeveloper ! looks like a vast improvement 😁

I have not done a test run yet, but just finished reviewing the code. I have a few comments and questions throughout. Pls let me know what you think!

nbokulich · 2022-03-03T18:54:00Z

q2_longitudinal/_utilities.py

+        metadata.index[1:], state_column].values
+
+    states = metadata[state_column].unique()
+    states.sort()


why sort then flip? why not just sort descending?

nbokulich · 2022-03-03T18:56:48Z

q2_longitudinal/_utilities.py

+    ]['Distance']
+
+    output.index.name = '#SampleID'
+    output = output.iloc[::-1]


can you pls drop in a comment here — why are you flipping the dataframe again?

nbokulich · 2022-03-06T06:20:35Z

q2_longitudinal/_utilities.py

+
+    output = metadata.groupby(individual_id_column).apply(column_getter)
+
+    # When the output of groupby is a single series i.e. when there's only one


would this be better caught with a conditional, e.g., by checking the number of unique individuals?

nbokulich · 2022-03-06T06:22:33Z

q2_longitudinal/_utilities.py

+        distance_matrix: pd.DataFrame, metadata: pd.DataFrame,
+        state_column: str, individual_id_column: str) -> pd.Series:
+
+    # with the sorted table, we add an extra column with indices so we make


I am not really following this idea... adding, renaming, and handling this extra column seems to take quite a few lines. Wouldn't it just be easier to reset_index? instead of indexing on an artificial new column?

nbokulich · 2022-03-06T06:35:15Z

q2_longitudinal/_utilities.py

+    return output
+
+
+def _vectorized_distance_to_baseline(


I like the idea of merging metadata + distance matrix, then using a groupby by individual_id

but some of the logic involved here seems a bit convoluted. It might be easier + more transparent to drop the slicing by index etc. In the groupby operation you could drop the first row (the baseline) and grab the column name matching that index. The result would be a vector of distances to baseline for that individual.

nbokulich · 2022-03-06T06:37:10Z

q2_longitudinal/_utilities.py

+
+    # there's four "utility" columns that need to be ignored:
+    # individual, state, combo, and index
+    def column_getter(frame):


this part is a bit difficult to follow — see my comment above.

nbokulich · 2022-03-06T06:41:17Z

q2_longitudinal/_utilities.py

+    else:
+        # this way of finding duplicates is relevant for "random" too
+        duplicated = metadata.duplicated(subset=combo_column)
+        if replicate_handling == 'error':


change to:

if duplicated.any() and replicate_handling == 'error':

???

nbokulich · 2022-03-06T06:44:01Z

q2_longitudinal/_utilities.py

+    if replicate_handling == 'drop':
+        duplicated = metadata.duplicated(subset=combo_column, keep=False)
+    else:
+        # this way of finding duplicates is relevant for "random" too


where is randomization done? as far as I can tell, this rather takes the first

taking first is not all that bad either 😁 and could be exposed as a separate option. random could just randomly shuffle before duplicated(keep='first')

nbokulich · 2022-03-06T06:48:56Z

q2_longitudinal/_utilities.py

+                             f' state value of "{baseline}": {missing}')
+
+        # use -np.inf to always make the baseline the first in the list
+        metadata.replace({baseline: -np.inf}, inplace=True)


this will replace all matching values in the entire df, correct? instead of just the state_column column?

this also relates to my question about _vectorized_distance_to_baseline... instead of doing a pass over the df to replace baseline values now, maybe this could just be done inside that function when grabbing the baseline row?

ElDeveloper · 2022-04-07T16:08:15Z

@nbokulich thanks for the detailed review and apologies for the delay getting back to you on these comments. I'll be doing that in the near future. Thanks so much for considering, and for your patience! ⏳ 🚤

lizgehret · 2022-09-15T16:12:10Z

Hey @ElDeveloper, thanks for all of your hard work here! We are currently doing some PR triage and review right now - are you still in the process of addressing @nbokulich's review comments?

ElDeveloper · 2022-09-15T16:31:23Z

Hi @lizgehret thanks for the heads up. I have almost replied to all the comments (although I haven't posted the review yet), but there's a few where I still need to do some additional work. I'll respond ASAP.

PERF: Speed up distance computations

405ef73

Use Pandas vectorized operations to speed up the validation and computation of first distances and distances to baseline. In a dataset with ~6K samples runtime went from >1 minute to under a second.

BUG: Remove invalid distances

7850e60

nbokulich requested changes Mar 6, 2022

View reviewed changes

gregcaporaso marked this pull request as draft December 11, 2023 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Speed up distance computations #169

PERF: Speed up distance computations #169

ElDeveloper commented Feb 7, 2022

nbokulich commented Feb 7, 2022

ElDeveloper commented Feb 7, 2022

nbokulich commented Feb 7, 2022

ElDeveloper commented Feb 8, 2022

nbokulich commented Feb 9, 2022 •

edited

Loading

ElDeveloper commented Feb 11, 2022

ElDeveloper commented Mar 3, 2022

nbokulich left a comment

nbokulich Mar 3, 2022

nbokulich Mar 3, 2022

nbokulich Mar 6, 2022

nbokulich Mar 6, 2022

nbokulich Mar 6, 2022

nbokulich Mar 6, 2022

nbokulich Mar 6, 2022

nbokulich Mar 6, 2022

nbokulich Mar 6, 2022

nbokulich Mar 6, 2022

ElDeveloper commented Apr 7, 2022

lizgehret commented Sep 15, 2022

ElDeveloper commented Sep 15, 2022


		output = metadata.groupby(individual_id_column).apply(column_getter)

		# When the output of groupby is a single series i.e. when there's only one

PERF: Speed up distance computations #169

Are you sure you want to change the base?

PERF: Speed up distance computations #169

Conversation

ElDeveloper commented Feb 7, 2022

nbokulich commented Feb 7, 2022

ElDeveloper commented Feb 7, 2022

nbokulich commented Feb 7, 2022

ElDeveloper commented Feb 8, 2022

10249.C002.09SS, 10249.C002.15SS, 10249.C002.18SS

10249.C005.07SS

10249.C008.04SS, 10249.C008.12SS

nbokulich commented Feb 9, 2022 • edited Loading

ElDeveloper commented Feb 11, 2022

ElDeveloper commented Mar 3, 2022

nbokulich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElDeveloper commented Apr 7, 2022

lizgehret commented Sep 15, 2022

ElDeveloper commented Sep 15, 2022

`10249.C002.09SS`, `10249.C002.15SS`, `10249.C002.18SS`

`10249.C005.07SS`

`10249.C008.04SS`, `10249.C008.12SS`

nbokulich commented Feb 9, 2022 •

edited

Loading