Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67

Open
wants to merge 38 commits into
base: submodulev3
Choose a base branch
from

Conversation

SamuelCarliles3
Copy link

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Includes splitter injection and adds refactor of DepthFirstTreeBuilder.build

Any other comments?

SamuelCarliles3 and others added 30 commits February 16, 2024 13:36
Copy link

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


cython-lint

cython-lint detected issues. Please fix them locally and push the changes. Here you can see the detected issues. Note that the installed cython-lint version is cython-lint=0.16.2.


/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:22:26: 'uintptr_t' imported but unused
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:23:33: 'free' imported but unused
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:76:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:99:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:120:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:148:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:152:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:176:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:180:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:206:53: E703 statement ends with a semicolon
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:292:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:376:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:380:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:385:5: E303 too many blank lines (2)
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:806:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:812:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:835:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:838:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:842:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:273:25: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:274:25: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:275:25: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:294:29: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:295:29: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:388:1: W293 blank line contains whitespace

Generated for commit: f225658. Link to the linter CI: here

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring of streaming code looks good to me. But these two errors occurred in checks:

FAILED tree/tests/test_tree.py::test_missing_values_on_equal_nodes_no_missing[squared_error] - AssertionError
FAILED tree/tests/test_tree.py::test_missing_values_on_equal_nodes_no_missing[friedman_mse] - AssertionError

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few questions and requests for explanation, or improved documentation. A few general comments:

  1. Can you copy/paste results of benchmarking into the PR description somewhere, so we can have this documented?
  2. Can we describe in the PR what is being changed and why? If any of this explanation is good to include also as in-line comments, then feel free to do so.

Comment on lines +46 to +56
# A record on the stack for depth-first tree growing
cdef struct StackRecord:
intp_t start
intp_t end
intp_t depth
intp_t parent
bint is_left
float64_t impurity
intp_t n_constant_features
float64_t lower_bound
float64_t upper_bound
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this moved for any particular reason? Just so I'm aware.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required by BuildEnv, which is defined in this file.

Comment on lines +58 to +65
cdef extern from "<stack>" namespace "std" nogil:
cdef cppclass stack[T]:
ctypedef T value_type
stack() except +
bint empty()
void pop()
void push(T&) except + # Raise c++ exception for bad_alloc -> MemoryError
T& top()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of code was moved from _tree.pyx so that BuildEnv would have the definition of stack. If the definition can be simplified, we should do that.

void push(T&) except + # Raise c++ exception for bad_alloc -> MemoryError
T& top()

cdef struct BuildEnv:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any technical reason we need the struct here besides encapsulation of all the parameters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as opposed to _tree.pyx? I put BuildEnv here because it will likely need to be visible to event handlers slated for addition, and _tree.pxd seemed like the place to put the "interface" to the module. It can go anywhere that can be made available to external event handlers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant, why do we need the BuildEnv in the first place.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean why is the struct required at all? It is just to encapsulate the function state; we could just as easily pass all the variables/references in as function args, but this seemed cleaner and definitely more expedient during development. More importantly, it's a pattern that will be required for the forthcoming tree build event handling, and might need to be added to the splitter injection.

In general we have this algorithm which remains broadly the same shape but with an arbitrary (and growing) number of variations we'd like to add without foreknowledge and without perpetual updates to the algorithm code itself. Those future additions will require differing degrees of visibility into the algorithm state. So IMO it seemed cleanest (and most performant) to simply encapsulate the algorithm state in a struct whose address we can pass around.

Comment on lines +22 to +24
# NICE IDEAS THAT DON'T APPEAR POSSIBLE
# - accessing elements of a memory view of cython extension types in a nogil block/function
# - storing cython extension types in cpp vectors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NICE IDEAS THAT DON'T APPEAR POSSIBLE
# - accessing elements of a memory view of cython extension types in a nogil block/function
# - storing cython extension types in cpp vectors
# NICE IDEAS THAT DON'T APPEAR POSSIBLE (Samuel)
# 1. accessing elements of a memory view of cython extension types in a nogil block/function
# 2. storing cython extension types in cpp vectors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to also comment on what these nice ideas are trying to accomplish. I.e. what's the problem for a new developer coming in and reading this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're simply trying to add a way of injecting functionality whose implementation details are TBD. We just want a way of saying "here's a candidate split, let me check it against any arbitrary validity constraints you may want to impose at some future date as of the time of this writing". So we want to accept a list, a memoryview, array, vector, whatever, of instantiated split constraints. Ideally the interface is a simple python one-liner, so at runtime I can just define an inline python list of constraints. But that list of constraints then needs to be executable performantly in a cython nogil block.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. Just hoping to document all the thoughts in a clean manner, so we don't lose this trains of thoughts when new developers come thru.

Comment on lines +26 to +28
# despite the fact that we can access scalar extension type properties in such a context,
# as for instance node_split_best does with Criterion and Partition,
# and we can access the elements of a memory view of primitive types in such a context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't follow what you mean here. Is this related to the "nice ideas" listed above?

Comment on lines +81 to +84
intp_t n_missing,
bint missing_go_to_left,
float64_t lower_bound,
float64_t upper_bound,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are part of SplitRecord

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read the existing code correctly, those values are set in current_split only after it is accepted as passing pre-and-post split conditions, and yielding a greater impurity improvement than best_split, so that the n_missing, missing_go_to_left, lower_bound, and upper_bound values in current_split are potentially garbage values at the time these split rejection conditions are tested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a better alternative to passing current_split into these split rejection conditions would be to simply pass the candidate feature dimension and split point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would say either:

i) pass in only SplitRecord and let the function implementation worry about what is garbage vs not, cuz you shouldn't use garbage values anyways
ii) only pass in parameters that are necessary

It shouldn't pass in split record and parameters explicitly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it will likely ultimately end up looking like is something similar to the BuildEnv struct pattern added to DepthFirstTreeBuilder.build... I haven't yet thought hard about what the final form of this signature should look like, but it would need to contain all the splitter state information necessary for arbitrary split constraints to decide thumbs up or down. The platonic ideal would make that part of a deliberately curated interface.

Comment on lines +104 to +107
intp_t n_missing,
bint missing_go_to_left,
float64_t lower_bound,
float64_t upper_bound,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are part of SplitRecord

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +128 to +192
# cdef struct HasDataEnv:
# int min_samples

# cdef bint has_data_condition(
# Splitter splitter,
# SplitRecord* current_split,
# intp_t n_missing,
# bint missing_go_to_left,
# float64_t lower_bound,
# float64_t upper_bound,
# SplitConditionEnv split_condition_env
# ) noexcept nogil:
# cdef HasDataEnv* e = <HasDataEnv*>split_condition_env
# return splitter.n_samples >= e.min_samples

# cdef class HasDataCondition(SplitCondition):
# def __cinit__(self, int min_samples):
# self.c.f = has_data_condition
# self.c.e = malloc(sizeof(HasDataEnv))
# (<HasDataEnv*>self.c.e).min_samples = min_samples

# def __dealloc__(self):
# if self.c.e is not NULL:
# free(self.c.e)

# super.__dealloc__(self)

# cdef struct AlphaRegularityEnv:
# float64_t alpha

# cdef bint alpha_regularity_condition(
# Splitter splitter,
# SplitRecord* current_split,
# intp_t n_missing,
# bint missing_go_to_left,
# float64_t lower_bound,
# float64_t upper_bound,
# SplitConditionEnv split_condition_env
# ) noexcept nogil:
# cdef AlphaRegularityEnv* e = <AlphaRegularityEnv*>split_condition_env

# return True

# cdef class AlphaRegularityCondition(SplitCondition):
# def __cinit__(self, float64_t alpha):
# self.c.f = alpha_regularity_condition
# self.c.e = malloc(sizeof(AlphaRegularityEnv))
# (<AlphaRegularityEnv*>self.c.e).alpha = alpha

# def __dealloc__(self):
# if self.c.e is not NULL:
# free(self.c.e)

# super.__dealloc__(self)


# from ._tree cimport Tree
# cdef class FooTree(Tree):
# cdef Splitter splitter

# def __init__(self):
# self.splitter = Splitter(
# presplit_conditions = [HasDataCondition(10)],
# postsplit_conditions = [AlphaRegularityCondition(0.1)],
# )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the lines above outdated fluff we can remove?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I mainly left them in as a demonstration of the need for the SplitConditionEnv; none of the currently existing SplitConditions require an env because their env is built into the legacy pattern of the omniscient Splitter. So for example if we wanted to do alpha regularity, alpha would be a hyperparameter that would ideally go into a closure. This is a cython implementation of a closure pattern, specifically one that avoids extension types due to the field lookup overhead.

@@ -59,6 +107,8 @@ cdef class BaseSplitter:

cdef const float64_t[:] sample_weight

cdef SplitRecordFactoryClosure split_record_factory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it named Closure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is a cython implementation of a closure. C doesn't support closures as a language level feature, but a struct of a function pointer bound with a struct of variable values functions the same.

@@ -485,6 +679,8 @@ cdef inline intp_t node_split_best(
# n_total_constants = n_known_constants + n_found_constants
cdef intp_t n_total_constants = n_known_constants

cdef bint conditions_hold = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdef bint conditions_hold = True
cdef bint split_is_valid = True

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a more explicit name to me

@adam2392 adam2392 changed the title Scarliles/update node refactor3 ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism Jun 12, 2024
@adam2392
Copy link
Collaborator

Another dumb question: why is depthfirsttreebuilder need to change, but not bestfirsttreebuilder? @SamuelCarliles3

@SamuelCarliles3
Copy link
Author

Another dumb question: why is depthfirsttreebuilder need to change, but not bestfirsttreebuilder? @SamuelCarliles3

BFTB will most certainly need to change as well, I'm just starting with DFTB, and have not yet gotten to BFTB. IIRC the update functionality had not been added to BFTB(?), and so it did not require an analogous refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants