Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable HDBSCAN gpu training and cpu inference #6108

Merged

Conversation

divyegala
Copy link
Member

@divyegala divyegala commented Oct 11, 2024

Until now, we supported all combinations of GPU/CPU interoperability except the one mentioned in the title. This was because the CPU HDBSCAN package was missing attribute setters. With scikit-learn-contrib/hdbscan#657, attribute setters are now available which allow us to transfer GPU trained attributes to the CPU model. This feature is available as part of hdbscan=0.8.39

@divyegala divyegala added feature request New feature or request 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Oct 11, 2024
@divyegala divyegala self-assigned this Oct 11, 2024
@github-actions github-actions bot added conda conda issue Cython / Python Cython or Python issue labels Oct 11, 2024
@@ -526,7 +526,15 @@ dependencies:
- statsmodels
- umap-learn==0.5.6
- pynndescent
- setuptools # Needed on Python 3.12 for dask-glm, which requires pkg_resources but Python 3.12 doesn't have setuptools by default
Copy link
Member Author

@divyegala divyegala Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask-glm was removed by PR #6028 so this is now unnecessary

@divyegala divyegala removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 16, 2024
@divyegala divyegala marked this pull request as ready for review October 16, 2024 22:54
@divyegala divyegala requested review from a team as code owners October 16, 2024 22:54
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me, had a question about a comment but that's all

Comment on lines +172 to +174
# These attributes have to be reassigned to the CPU model
# as the raw arrays because the reference HDBSCAN implementation
# reconstructs the objects from the raw arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens in the setters in the hdbscan library, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially it happens in the getters. Here's the issue, consider CondensedTree object:

  1. We use setter to assign CondensedTree object to self.condensed_tree_ from cuML to hdbscan
  2. The getter for self.condensed_tree_ checks if it has a value already. If it does, it assumes that it is raw numpy arrays and creates another CondensedTree object without any value sanitization

That's why I re-assigned the raw arrays, so when hdbscan library internally calls the getters it reconstructs the object correctly.

@divyegala
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit fb4c8af into rapidsai:branch-24.12 Oct 17, 2024
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conda conda issue Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants