Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filterPrecursorMzValue to support filtering by multiple m/z values #232

Merged
merged 8 commits into from
Jan 19, 2022

Conversation

jorainer
Copy link
Member

This PR adds the filterPrecursorMzValue function discussed in issue #230

- Add the `filterPrecursorMzValues` method to enable filtering based on multiple
  target precursor m/z values.
@jorainer jorainer requested review from sgibb and lgatto January 13, 2022 11:49
Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine! I found a typo and like to suggest a minor change using common instead of closest

R/MsBackend.R Outdated Show resolved Hide resolved
Comment on lines 88 to 90
mtch <- closest(x[keep][idx], sort(mz), tolerance = tolerance, ppm = ppm,
duplicates = "keep", .check = FALSE)
keep[!is.na(mtch[order(idx)])]
Copy link
Member

@sgibb sgibb Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the order irrelevant here?

Suggested change
mtch <- closest(x[keep][idx], sort(mz), tolerance = tolerance, ppm = ppm,
duplicates = "keep", .check = FALSE)
keep[!is.na(mtch[order(idx)])]
cmn <- common(x[keep][idx], sort(mz), tolerance = tolerance, ppm = ppm,
duplicates = "keep", .check = FALSE)
keep[cmn]

EDIT: or keep[idx][cmn]? It's already late and I am a bit confused (is hard to tell without trying it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order and idx was important (otherwise I wouldn't do such costly operations) - but I have to check it again. Thanks for the common suggestion - I'll have a try with that and see how it works/compares.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I get the same results with:

.values_match_mz2 <- function(x, mz, ppm = 20, tolerance = 0) {
    keep <- which(!is.na(x))
    idx <- order(x[keep])
    cmn <- common(x[keep][idx], sort(mz), tolerance = tolerance, ppm = ppm,
                  duplicates = "keep", .check = FALSE)
    sort(keep[idx][cmn])
}

(i.e. returning sort(keep[idx][cmn]))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in fact your implementation is still (slightly) faster:

> pmz <- c(12.4, 15, 3, 12.4, 3, 1234, 23, 5, 12.4, NA, 3)
> mz <- c(200, 12.4, 3)
> library(microbenchmark)
> microbenchmark(.values_match_mz(pmz, mz), .values_match_mz2(pmz, mz))
Unit: microseconds
                       expr     min       lq     mean   median       uq     max
  .values_match_mz(pmz, mz) 235.827 272.2785 292.6092 284.0745 317.7635 416.735
 .values_match_mz2(pmz, mz) 178.463 205.5580 221.5553 212.7210 224.7145 446.566
 neval cld
   100   b
   100  a 
> pmz <- rep(pmz, 100)
> mz <- rep(mz, 100)
> microbenchmark(.values_match_mz(pmz, mz), .values_match_mz2(pmz, mz))
Unit: microseconds
                       expr     min       lq     mean   median       uq     max
  .values_match_mz(pmz, mz) 318.139 349.4490 394.0592 376.8095 417.4760 704.230
 .values_match_mz2(pmz, mz) 267.281 298.0505 330.7795 320.6225 349.9465 572.168
 neval cld
   100   b
   100  a 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now changed it to your implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't have an idea how to remove the last sort step but you could avoid the which(!is.na(x)) call and get minimal more speed:

library("microbenchmark")
library("MsCoreUtils")
#> 
#> Attaching package: 'MsCoreUtils'
#> The following object is masked from 'package:stats':
#> 
#>     smooth

pmz <- c(12.4, 15, 3, 12.4, 3, 1234, 23, 5, 12.4, NA, 3)
mz <- c(200, 12.4, 3)

.values_match_mz2 <- function(x, mz, ppm = 20, tolerance = 0) {
    keep <- which(!is.na(x))
    idx <- order(x[keep])
    cmn <- common(x[keep][idx], sort(mz), tolerance = tolerance, ppm = ppm,
                  duplicates = "keep", .check = FALSE)
    sort(keep[idx][cmn])
}

.values_match_mz3 <- function(x, mz, ppm = 20, tolerance = 0) {
    o <- order(x, na.last = NA) # na.last = NA will remove NA
    cmn <- common(x[o], sort(mz), tolerance = tolerance, ppm = ppm,
                  duplicates = "keep", .check = FALSE)
    sort(o[cmn])
}
microbenchmark(.values_match_mz2(pmz, mz), .values_match_mz3(pmz, mz), check = "identical")
#> Unit: microseconds
#>                        expr     min      lq     mean   median       uq
#>  .values_match_mz2(pmz, mz) 114.598 117.081 185.6789 118.6825 122.5355
#>  .values_match_mz3(pmz, mz) 111.121 113.789 324.7867 114.7715 120.4950
#>        max neval
#>   6300.412   100
#>  20524.152   100

Created on 2022-01-16 by the reprex package (v2.0.1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I learned again something new :)

@jorainer jorainer requested a review from sgibb January 14, 2022 10:57
@lgatto
Copy link
Member

lgatto commented Jan 14, 2022

Code-wise, looks good to me, but I am wondering whether we aren't making things too complicated with that many functions. Wouldn't it be easier to have a single vectorised filterPrecursorMz() function? Same applies to other cases.

@jorainer
Copy link
Member Author

jorainer commented Jan 18, 2022

Yes, we're adding yet another filter function, but IMHO it is justified here, since the original filterPrecursorMz and the newly added filterPrecursorMzValues work differently, the former based on m/z ranges, the latter on individual m/z values. This is similar to the discussion in #133. I believe it is easier on the user to have dedicated functions instead of having to set multiple additional parameters. Also, changing filterPrecursorMz to a vectorized version might break backward compatibility (also based on what users are used from the MSnbase package).

But obviously that's open for discussion @lgatto @sgibb

@lgatto
Copy link
Member

lgatto commented Jan 19, 2022

Ok, indeed, thanks. I totally agree that different appropriately named functions are better that many arguments.

What about renaming filterPrecursorMz() to filterPrecursorMzRange() to clarify that it works on ranges, rather than values, and be consistent with #133 ?

And re vectorisation, it's easy enough to pipe multiple filterPrecursorMz[Range]() into each other to filter multiple ranges.

@jorainer
Copy link
Member Author

I've now added a filterPrecursorMzRange method and deprecated the filterPrecursorMz method - I wouldn't like to remove it yet, because we have many workflows depending on that method.

@jorainer jorainer merged commit 627df30 into master Jan 19, 2022
@jorainer jorainer deleted the filterPrecursorMzValue branch January 19, 2022 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants