Performance problems of multi process implementation of a block of code #28

wabomo · 2021-05-27T08:55:40Z

In the "inside_polygon" function of "cytopy/data/geometry.py", the performance of multi process implementation is very poor. A 10000 line of data takes nearly 30 seconds, but it only takes less than 0.04 seconds to change to normal programming.For the time being, I simply handle it like this, adding a row count judgment.Maybe you have a better way to deal with it.

if len(df) < 100000:  # row count judgment.
    # Single thread implementation
    xy = df[[x, y]].values
    (min_x, min_y, max_x, max_y) = poly.bounds
    mask = []
    for p in xy:
        bol = min_x <= p[0] <= max_x and min_y <= p[1] <= max_y and point_in_poly(p, poly) is True
        mask.append(bol)
else:
    # Multi process implementation
    if njobs < 0:
        njobs = cpu_count()
    xy = df[[x, y]].values
    f = partial(point_in_poly, poly=poly)
    with Pool(njobs) as pool:
        mask = list(pool.map(f, xy))
return df.iloc[mask]

My configuration:
python: 3.8.7
Memory: 32g
CPU: i7-1165g7 (4 cores and 8 threads)

The text was updated successfully, but these errors were encountered:

burtonrj · 2021-05-27T12:03:38Z

Hi, thanks for the feedback, is much appreciated. Sorry about the poor performance. To be honest, I haven't touched this particular functionality in a long time!

I don't think I ever did exhaustive performance checks (which is embarrassing) as it worked okay on the validation data I was running.

I'm currently working on a new release, albeit slowly (I'm the only dev on this project). If you're happy I would like to test your solution and include it in the new release with attribution ofc. Or you can make a PR on the v2.1 branch and make these changes.

Feel free to contact me at [email protected] for further discussion.

Thanks,
Ross

whitews · 2022-07-01T21:38:48Z

Hey Ross!

Not sure if you've already addressed this, but wanted to chime in. I recently moved my implementation for finding points in a polygon from FlowKit to FlowUtils. Basically, all C extension based functions will go in FlowUtils from now on, and I have binary distributions on PyPI for FlowUtils for all the major platforms & recent Python versions. This should go a long way in helping those users without C compilers get the dependencies set up.

The points in ellipsoid (supporting true n-dimensional data) function was moved to FlowUtils as well. Both of these are compatible with the GatingML definitions, i.e. how to handle points on the boundary, etc.

On a side note, have you considered using FlowKit for CytoPy? CytoPy is a much more ambitious project in many ways. FlowKit aims to be an API for the more traditional flow concepts, and will never be or have a GUI in it's code base. I've toyed with the idea of an Electron based front-end but it would be a separate project and I need to finish the FlowJo workspace support in FlowKit first. Would be happy to discuss via email if you are interested.

Regards,
-Scott

burtonrj · 2022-07-13T21:47:38Z

Hey Scott!

Thanks for reaching out, I think if I remember rightly I have resolved this issue in the unpublished dev branch (v3.0) of CytoPy, but this is good to know. My implementation is still in Python so your C extension will probably be quicker.

I haven't considered FlowKit, mostly because CytoPy is aimed more at high-dimensional cytometry data and biomarker discovery. Its designed around the assumption of the following workflow:

clean with autogates
batch correct if needed
1. cluster/classify
1. visualise and extract results.

I do really admire what you have done with FlowKit however and I think it's major advantage is the GatingML compatibility which is a huge gap in CytoPy currently.

The future for CytoPy, once I've finished writing my PhD thesis and have time to implement it, is a decoupled ecosystem.
v3.0 of CytoPy is going to decouple many of the tools so that it's not such a behemoth. Tbh, CytoPy was a bit too ambitious to start with and this created a lot of problems in the original code base.

So in the future there will be CytoPy v3.0, which will depend on some smaller packages that can be used in isolation for those that don't need automated gating and a complex MongoDB database to track meta-data. There is probably a lot of overlap here in our work - I use your FlowUtils and FlowIO package extensively! Great job btw. If you ever need help maintaining it, please let me know, I owe you one for creating such great packages!

As for future CytoPy work, I've manage to make a start by pulling out some of functionality into smaller packages:
CytoTools [https://github.com/burtonrj/CytoTools] - reading files, transformations, dimension reduction, peak alignment and more
CytoPlots [https://github.com/burtonrj/CytoPlots] - classic facs plots in Python and some more general tools for stuff like scatterplots to be used with UMAP, tSNE etc
CytoCluster [https://github.com/burtonrj/CytoCluster] - as long as you have your results in a Pandas DataFrame you can apply all your popular clustering algorithms plus a novel ensemble clustering algorithm currently under peer-review (pre-print here: https://doi.org/10.1101/2022.06.30.496829)

Perhaps there is some areas we can collaborate here. I'm of the same mind and have thought about an Electron frontend or perhaps Django and React (a bit like your ReFlow work) but again, it always comes down to time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problems of multi process implementation of a block of code #28

Performance problems of multi process implementation of a block of code #28

wabomo commented May 27, 2021

burtonrj commented May 27, 2021

whitews commented Jul 1, 2022

burtonrj commented Jul 13, 2022

Performance problems of multi process implementation of a block of code #28

Performance problems of multi process implementation of a block of code #28

Comments

wabomo commented May 27, 2021

burtonrj commented May 27, 2021

whitews commented Jul 1, 2022

burtonrj commented Jul 13, 2022