Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added FUSION metrics #70

Merged
merged 43 commits into from
Sep 16, 2024
Merged

Added FUSION metrics #70

merged 43 commits into from
Sep 16, 2024

Conversation

bmcgaughey1
Copy link
Collaborator

Added FUSION metrics except the cover-related metrics that require a height threshold. I think I have everything in place but haven't looked at the values that come out. I suspect there are much faster ways to compute some things. For example, I compute 15 percentile height metrics using a function call for each. It is easy with numpy.percentile() to compute all with one call but things aren't set up to pull a list of metrics into TileDB. Same thing goes for the l-moments and l-moment ratios. Having each in a separate function means the l-moments are computed 7 times. Same thing could be done with a single call.

@hobu
Copy link
Collaborator

hobu commented Feb 14, 2024

it should be a default that the type for all metrics is np.float32 so we don't have to copypasta that allover the place.

I also wonder if each metric shouldn't provide a test() harness that can be run by pytest to protect from dry rot.

@bmcgaughey1
Copy link
Collaborator Author

Is there an example of what the test() should look like? I poked around in the tests a bit but didn't see anything.

Most of the metrics are computed in simple ways but i suppose the underlying numpy methods could change and cause problems or differences.

@hobu
Copy link
Collaborator

hobu commented Feb 15, 2024

Is there an example of what the test() should look like? I

I was thinking add something like this to Metric:

    def test(self, vector):
        try:
            results = self._method(vector)
        except Exception as e:
            # if we div0 or or throw or something we failed
            raise e

        # running the method over a vector of input 
        # should only return a single value that is a number (for now)
        assert isinstance(results, (int, float, complex)) and not isinstance(results, bool)
        return True

Then the test harness would pass in stuff that could be weird (like a vector of 0s or a string or other bad input). If any of the test calls raise an exception, we report to the user, etc.

@kylemann16
Copy link
Collaborator

kylemann16 commented Feb 23, 2024

@bmcgaughey1 I made a PR with some changes to fix some small stuff as well as adding a simple framework for testing metrics. If you want to look that over, I can merge that into this PR whenever.

One thing that I'm finding with this PR is that some of the methods end up dividing by zero with data that hasn't been sanitized. The ones I can think of off the top of my head are m_crr and lmom4.

@bmcgaughey1
Copy link
Collaborator Author

The divide by zero makes sense. Do you think it is best to test for divide by zero or just trap the exception and return a value indicating the metric couldn't be computed?

@bmcgaughey1
Copy link
Collaborator Author

I accepted the change to the call for np.percentile 1 to 100 instead of 1 to 99. the 100th percentile is max and it isn't used in the profilearea metric because of potential outliers. No problem leaving it at 100 but it adds time (arguable a small amount of time but time nonetheless)

@kylemann16
Copy link
Collaborator

kylemann16 commented Feb 23, 2024

I accepted the change to the call for np.percentile 1 to 100 instead of 1 to 99. the 100th percentile is max and it isn't used in the profilearea metric because of potential outliers. No problem leaving it at 100 but it adds time (arguable a small amount of time but time nonetheless)

The range method in python returns results that are inclusive on the bottom and exclusive on the top, so p[98] was throwing an out of bounds error here

>>> print(list(range(9)))
[0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> print(list(range(1,10)))
[1, 2, 3, 4, 5, 6, 7, 8, 9]

@hobu
Copy link
Collaborator

hobu commented Feb 23, 2024

The divide by zero makes sense. Do you think it is best to test for divide by zero or just trap the exception and return a value indicating the metric couldn't be computed?

I guess we should leave it up to the Metric to decide what it wants to do, but it should be a requirement that if a Metric throws for ANY input on computation, it's default value is written into TileDB. Then a Metric writer has the option of doing nan or min/max or -9999 or whatever it wants.

@bmcgaughey1
Copy link
Collaborator Author

Still waiting for how we want to handle the height thresholds. There is a "base" threshold for all of the simple metrics, a cover threshold, and a "no threshold" case for computing metrics by height strata. The last actually needs a list of thresholds that define the height strata. Seems to me that we want these are part of init. Calls to simple metric functions could filter the data prior to the call or pass the threshold value to the function and let the filtering happen in the function. For the cover metrics, ratios of return counts above the threshold to total return count, the function will need all the points and the threshold. For strata, we might need to move ahead with the ability to have the metric function return a list of values. However, we could call the strata-metric functions with a lower and upper threshold...one call for each strata. just one more case where it could be done more efficiently (compute multiple metrics with a single function call).

@kylemann16
Copy link
Collaborator

So what we definitely need in order to function correctly are optional arguments for low and high thresholds for each metric.

I agree that it would be simpler and more efficient for the Shatter process if we allowed for metrics to output multiple results, I think the complexity is transferred to the Extract step, because we would have to include a lot more information in what's being passed to it in order to know how to extract that information.

The biggest problem I see with this is how to generalize this information across Metrics, and what is the relevance of each entry in a list passed to Extract? For percentile, each entry may correspond to it's percentage, but how can we generalize that across multiple Metrics?

I don't immediately see an elegant way to solve this problem. I think we should probably do it eventually, but maybe we should just do the thresholds for now and wait to do multiple value outputs until we have a fleshed out approach to it?

What are your thoughts @bmcgaughey1 ?

@bmcgaughey1
Copy link
Collaborator Author

The single output metric functions makes sense for now. Adding thresholds passed to each metric function also makes sense. These are set globally for all metrics (but one set (high & low) for the base metrics, another for the cover metrics, and then either a list for the strata metrics or the function is called for each strata with strata lower & upper thresholds). I would see these thresholds stored in TileDB along with the list of metrics.

There aren't many of the FUSION metrics that would have multiple return values. percentiles and l-moment metrics are the ones that stand out (just because there are easy ways to compute them). I agree that having a flexible way to handle the multiple return cases is not trivial (or at least not trivial to make it robust).

@kylemann16
Copy link
Collaborator

I'm going to go ahead and merge this. It's in working order, and I would like to apply issues to this as a base and work in smaller increments as it's begun to outgrow its scope.

@kylemann16 kylemann16 merged commit 1454d5f into main Sep 16, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants