Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use spectral decomposition as more accurate fallback from Cholesky #2930

Conversation

david-cortes-intel
Copy link
Contributor

Description

This PR changes the logic in solving non-positive linear systems to use spectral decomposition as a fallback instead of PLU.

Motivation

Linear systems of the form $\mathbf{Ax} = \mathbf{b}$ are currently solved through Cholesky factorization of $\mathbf{A}$.

When the matrix $\mathbf{A}$ is not positive-definite, it falls back to performing PLU factorization. This factorization might be able to produce a solution to the linear system if the matrix is close to positive, in the sense that the solution will satisfy $\lVert \mathbf{Ax} - \mathbf{b} \rVert \approx 0$.

However, an overdetermined system (e.g. when $\mathbf{A}$ is the cross product of a matrix $\mathbf{X}$ that has more columns than rows) might have an infinite number of solutions, of which either the one with minimum rank or minimum norm is typically preferred, such as by LAPACK's own least-squares routines (prefers minimum norm). In such cases, Cholesky and PLU will not produce the most desirable solution, assuming that tey are able to produce a solution at all, which isn't always the case.

Solution

This PR changes the logic to instead fall back to inversion based on spectral decomposition, but discarding too small singular values. It follows the same logic for discarding small singular values as LAPACK's gelsd with default tolerance.

This has the added bonus that, in such cases, the results will match with those produced by Python software such as Scikit-Learn, SciPy, or StatsModels, which is not currently the case for linear regression problems in which there are more columns than rows.

It additionally adds an extra check in the Cholesky approach for whether the diagonal elements might be too small, which could be indicative of numerical instability in the results that is also better handled by the fallback added here.

It doesn't guarantee 100% agreement in results with Python libraries, since there isn't a 1-to-1 correspondence between values of Cholesky diagonal and singular values and hence there can be edge cases in which the criteria here will end up using Cholesky in situations in which SciPy would discard singular values, but it should make mismatches much less frequent.

Other changes

This PR also fixes an issue by which calling the function to solve linear systems with more than 1 RHS would produce an incorrect solution when it falls back to PLU.

Additionally, it adds a few comments in the LAPACK wrappers that might help other people when looking at the code and function signatures.

Next steps

There are cases in which one can know beforehand that Cholesky would fail, and in those situations, it'd be better to go straight away for the spectral decomposition. A potential next step would be to add it as an additional solver option, and make wrappers such as scikit-learn-intelex request that solver when there are more columns than rows in regression scenarios.

Another next step would be to allow scikit-learn-intelex to use oneDAL's routines in cases in which there are more columns than rows, which is currently disabled there.


Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • I have provided justification why performance has changed or why changes are not expected.
  • I have provided justification why quality metrics have changed or why changes are not expected.
  • I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@Vika-F Vika-F self-requested a review October 2, 2024 08:38
Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me.
The biggest concern is that it is possible that syev also fail to converge by some reason. What should we do then?

Wouldn't it be safer to use some general form of spectral decomposition functions like gesvd?

Another comment is about the testing.
I think the case of RHS > 1 also needs to be covered. And also it would be better to have a test that covers wider variety of input data sizes like it is done for run_and_check_linear case.

cpp/daal/src/algorithms/service_kernel_math.h Outdated Show resolved Hide resolved
cpp/daal/src/algorithms/service_kernel_math.h Show resolved Hide resolved
cpp/daal/src/algorithms/service_kernel_math.h Outdated Show resolved Hide resolved
LapackInst<FPType, cpu>::xsyev(&jobz, &uplo, (DAAL_INT *)&n, a, (DAAL_INT *)&n, eigenvalues.get(), work_buffer.get(), &work_buffer_size,
&info);
}
if (info) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if symmetric matrix decomposition fail to converge. For example, due to some rounding errors.
Isn't it required to fall back to general SVD decomposition (like gesvd which cannot fail)?

Or maybe it would be even more stable approach to use gesvd instead of syev for a spectral decomposition from the beginning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, gesvd on $\mathbf{X}$ would be more accurate than syev on $\mathbf{X}^T \mathbf{X}$, but it would be slower. However the solver normEq is based on $\mathbf{X}^T \mathbf{X}$ (I guess that's what the name implies but not 100% sure). gesvd could potentially be a fallback from QR, but that'd be a different PR.

I think scikit-learn-intelex only calls oneDAL's solver with normEq BTW.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant not to apply gesvd to $X$, but rather to $X^T X$.
Due to the factors like rounding errors $X^T X$ might be not positive-semidefinite and syev might fail to converge. And we will come to the same issue as we have now with Cholesky.
With SVD we can do:
$X^T X \beta = X^T y$
$X^T X = U \Sigma V^T$
$\beta = V \Sigma^{-1} U^T X^T y$

And there will should be no possibilities to fail.
Or maybe this is too precautious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems that sklearnex only uses normal equations method now. We have QR method implemented, but on CPUs only.
And it is rather inefficient in terms of performance to use normal equations at first, and if fail then apply QR to the whole $X$ matrix -> double cost in terms of performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow.

The Eigendecomposition part doesn't require positiveness - it would succeed even if the input had negative entries in the main diagonal. If by numerical inaccuracies it happens that some eigenvalue is close to zero in theory but has negative sign when calculated in practice (negative-indefinite matrices would have negative eigenvalues), it would anyway be removed as part of the procedure, so the end result would not take that component.

These "sy" routines takes the values only from the lower triangle of the input matrix, so there shouldn't be any issues with symmetry either if the matrix is getting filled in both corners. I understand they should also be more accurate than gesvd when the input matrix is actually symmetric, but I'm not 100% sure about it.

About the QR part: sorry, to clarify - what I meant was that gesvd on $\mathbf{X}$ could be a fallback when geqrf on $\mathbf{X}$ fails. In those cases, $\mathbf{X}^T \mathbf{X}$ would not get computed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right. Forgot that positiveness is not a problem for eigenvalue decomposition, only for Cholesky.
Sorry for misleading info.
My comment was based on the experience with eigenvalue decomposition methods. But I forgot what exactly led to failures. So, I mean those methods can fail and trying to avoid that.

Probably it was not positiveness, but numerical stability issues.
If the condition number of $X^T X$ is too small, i.e. all the eigenvalues are close, some eigenvalue decomposition methods can fail. Similar things might happen when the condition number is too big.

But I am not too familiar in terms of which MKL decomposition routine is better in the case of small or big condition numbers [thought gesvd is the best, but it looks like it's not].

Anyway from your comment I see that for symmetric matrices sy methods should be better than ge. So, the choice of syevr looks reasonable now.

I also understand the point about QR and gesvd for $X$ decomposition, but this is not the scope of this PR.

Copy link
Contributor Author

@david-cortes-intel david-cortes-intel Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It looks from the syevd docs that it is indeed possible for this routine to fail due to values in the inputs, although it doesn't explain when that could happen.

That being said, in case it makes it look less bad, if Eigendecomposition fails, it follows that other methods like SciPy's lstsq or scikit-learn's LinearRegression would also fail since they are also based on spectral decomposition.

cpp/daal/src/algorithms/service_kernel_math.h Outdated Show resolved Hide resolved
cpp/daal/src/algorithms/service_kernel_math.h Outdated Show resolved Hide resolved
@david-cortes-intel
Copy link
Contributor Author

Another comment is about the testing. I think the case of RHS > 1 also needs to be covered. And also it would be better to have a test that covers wider variety of input data sizes like it is done for run_and_check_linear case.

Yes, I was thinking about adding such a test with more than one regression target. But I unfortunately cannot find the signature of this train method that is called in the tests and wasn't sure if it supports it.

Testing with more input sizes would be trickier though - the logic for the current tests would not generalize to non-positive systems, and here we're looking for the minimum-norm solution instead of just any solution. Ideas would be welcome, but the only potential approach I see would be to just compare against LAPACK's gelsd, which might not be as desirable for a test.

@Vika-F
Copy link
Contributor

Vika-F commented Oct 2, 2024

@david-cortes-intel

Yes, I was thinking about adding such a test with more than one regression target. But I unfortunately cannot find the signature of this train method that is called in the tests and wasn't sure if it supports it.

Yes, I really cannot find the information about the support of multiple right hand sides in oneDAL linear regression docs.
This probably needs to be fixed.
But from the code of the CPU part I can see that oneDAL gets the number of RHS from the number of columns in the responses table:
https://github.com/oneapi-src/oneDAL/blob/main/cpp/oneapi/dal/algo/linear_regression/backend/cpu/train_kernel_norm_eq.cpp#L83
And this number is passed to DAAL.

So, it is possible to test multiple RHS just by providing the responses table with more than 1 column.

I will also try to provide ideas about wide matrices testing.

@icfaust
Copy link
Contributor

icfaust commented Oct 2, 2024

/intelci: run

@icfaust
Copy link
Contributor

icfaust commented Oct 2, 2024

So @Vika-F , I'm leaning towards asking for benchmarks on it, just to be careful. To be honest I'm not sure if it occurs in the datasets of the benchmarks, but its worth checking (since its so core). What do you think? Could be a bit of a pain sorry @david-cortes-intel

@david-cortes-intel
Copy link
Contributor Author

david-cortes-intel commented Oct 2, 2024

So @Vika-F , I'm leaning towards asking for benchmarks on it, just to be careful. To be honest I'm not sure if it occurs in the datasets of the benchmarks, but its worth checking (since its so core). What do you think? Could be a bit of a pain sorry @david-cortes-intel

Do you mean running the existing benchmarks after this PR, or to generate some other kind of benchmark comparing the eigenvalue solvers?

I don't think re-running the existing benchmarks would make much sense - there's three possible cases:

  • Cholesky succeeded before, and will keep succeeding, in which case the fallback will not be called. You might see a very small increase in latency in multi-output regression since this PR fixes the issue of insufficient amount of elements being copied to a back up, but that should be nanoseconds for most problems.
  • Cholesky and PLU failed before - here the new method would not fail, so they aren't comparable.
  • Cholesky failed before, whereas PLU didn't - here the new PR should be slower, but this would be a very uncommon occurrence, since PLU just like Cholesky is only meant for positive-definite systems. There could potentially be edge cases where the matrix is singular but due to numerical inaccuracies the factorization still succeeds, but that should be very uncommon and it's likely that Cholesky/PLU results in that case would have very large numbers (e.g. > 10^10) which is not a desirable solution.

@david-cortes-intel
Copy link
Contributor Author

So @Vika-F , I'm leaning towards asking for benchmarks on it, just to be careful. To be honest I'm not sure if it occurs in the datasets of the benchmarks, but its worth checking (since its so core). What do you think? Could be a bit of a pain sorry @david-cortes-intel

Do you mean running the existing benchmarks after this PR, or to generate some other kind of benchmark comparing the eigenvalue solvers?

I don't think re-running the existing benchmarks would make much sense - there's three possible cases:

  • Cholesky succeeded before, and will keep succeeding, in which case the fallback will not be called. You might see a very small increase in latency in multi-output regression since this PR fixes the issue of insufficient amount of elements being copied to a back up, but that should be nanoseconds for most problems.
  • Cholesky and PLU failed before - here the new method would not fail, so they aren't comparable.
  • Cholesky failed before, whereas PLU didn't - here the new PR should be slower, but this would be a very uncommon occurrence, since PLU just like Cholesky is only meant for positive-definite systems. There could potentially be edge cases where the matrix is singular but due to numerical inaccuracies the factorization still succeeds, but that should be very uncommon and it's likely that Cholesky/PLU results in that case would have very large numbers (e.g. > 10^10) which is not a desirable solution.

Actually I was wrong about the second point - there are many cases even in the tests where Cholesky fails but PLU succeeds, which would now fall back to spectral decomposition. Need to investigate about what characteristics do they have ...

@david-cortes-intel
Copy link
Contributor Author

@david-cortes-intel

Yes, I was thinking about adding such a test with more than one regression target. But I unfortunately cannot find the signature of this train method that is called in the tests and wasn't sure if it supports it.

Yes, I really cannot find the information about the support of multiple right hand sides in oneDAL linear regression docs. This probably needs to be fixed. But from the code of the CPU part I can see that oneDAL gets the number of RHS from the number of columns in the responses table: https://github.com/oneapi-src/oneDAL/blob/main/cpp/oneapi/dal/algo/linear_regression/backend/cpu/train_kernel_norm_eq.cpp#L83 And this number is passed to DAAL.

So, it is possible to test multiple RHS just by providing the responses table with more than 1 column.

I will also try to provide ideas about wide matrices testing.

Adding tests for multi-output regression seems to be quite tricky, as there's one issue that's currently a blocker: the tests here assume that coefficients come with shape [num_targets, num_cols]:

te::dataframe_builder{ std::int64_t(1), this->r_count_ }.fill_uniform(-15.5,

Whereas the current solver produces them with shape [num_cols, num_targets]:

LapackInst<FPType, cpu>::xxpotrs(&uplo, (DAAL_INT *)&n, (DAAL_INT *)&nX, a, (DAAL_INT *)&n, b, (DAAL_INT *)&n, &info);

There do seem to be some tests currently running that execute multi-output regressions, but they use the coefficients assuming shape [num_cols, num_targets], ignoring the shape they have in the table container.

Hence, if I add a test with the expected results like it was done in previous commits, it will end up erroring out either due to shape mismatches (if I change the table to [num_cols, num_targets]) or due to the coefficients being transposed (if I set the table shape to [num_targets, num_cols]).

So I'd say adding tests for multi-output regression would first require fixing that issue with the mismatched dimensions, which maybe could be left for a different PR.

@david-cortes-intel
Copy link
Contributor Author

Sorry, I actually had some mistakes in my code:

  • I was incorrectly checking the diagonal elements of Cholesky and that's why it was falling back in cases in which it shouldn't. So in this regard the earlier comment about benchmarks still stands: it wouldn't make much sense to re-run current benchmarks, since this PR is extending it to cover cases which previously would have failed, and doesn't change the cases that would not have failed.
  • There actually are tests for multi-output regression, and the shapes are correct and match with the storage order - just had some confusion between dimensions that refer to the coefficients and to the intercepts. I've added an additional test for a multi-output case with non-positive system which gets handled by the fallback.

@david-cortes-intel david-cortes-intel marked this pull request as ready for review October 3, 2024 08:06
@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback.
The changes are good to me.
I am almost ready to approve. I have only 1 comment regarding gesvd use. Please see below. What do you think?

LapackInst<FPType, cpu>::xsyev(&jobz, &uplo, (DAAL_INT *)&n, a, (DAAL_INT *)&n, eigenvalues.get(), work_buffer.get(), &work_buffer_size,
&info);
}
if (info) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant not to apply gesvd to $X$, but rather to $X^T X$.
Due to the factors like rounding errors $X^T X$ might be not positive-semidefinite and syev might fail to converge. And we will come to the same issue as we have now with Cholesky.
With SVD we can do:
$X^T X \beta = X^T y$
$X^T X = U \Sigma V^T$
$\beta = V \Sigma^{-1} U^T X^T y$

And there will should be no possibilities to fail.
Or maybe this is too precautious.

LapackInst<FPType, cpu>::xsyev(&jobz, &uplo, (DAAL_INT *)&n, a, (DAAL_INT *)&n, eigenvalues.get(), work_buffer.get(), &work_buffer_size,
&info);
}
if (info) return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems that sklearnex only uses normal equations method now. We have QR method implemented, but on CPUs only.
And it is rather inefficient in terms of performance to use normal equations at first, and if fail then apply QR to the whole $X$ matrix -> double cost in terms of performance.

cpp/daal/src/algorithms/service_kernel_math.h Show resolved Hide resolved
@david-cortes-intel
Copy link
Contributor Author

Looks like my earlier comment somehow got lost, so reposting.


I don't follow.

The Eigendecomposition part doesn't require positiveness - it would succeed even if the input had negative entries in the main diagonal. If by numerical inaccuracies it happens that some eigenvalue is close to zero in theory but has negative sign when calculated in practice (negative-indefinite matrices would have negative eigenvalues), it would anyway be removed as part of the procedure, so the end result would not take that component.

These "sy" routines takes the values only from the lower triangle of the input matrix, so there shouldn't be any issues with symmetry either if the matrix is getting filled in both corners. I understand they should also be more accurate than gesvd when the input matrix is actually symmetric, but I'm not 100% sure about it.

About the QR part: sorry, to clarify - what I meant was that gesvd on $\mathbf{X}$ could be a fallback when geqrf on $\mathbf{X}$ fails. In those cases, $\mathbf{X}^T \mathbf{X}$ would not get computed.

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

@@ -89,6 +89,10 @@ class float_algo_fixture : public algo_fixture {
return is_double && !this->get_policy().has_native_float64();
}

bool running_on_gpu() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this function into linear_regression fixture: https://github.com/david-cortes-intel/oneDAL/blob/overdetermined_minimum_norm/cpp/oneapi/dal/algo/linear_regression/test/fixture.hpp

Also, it would be better to rename to something more descriptive like: "non_psd_system_not_supported_on_device"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to that file and renamed it.

@Vika-F
Copy link
Contributor

Vika-F commented Oct 11, 2024

@icfaust it seems I've missed your comment.
Do you mean we need benchmarks for the cases of wide matrices? Or matrices with multicollinearity across features? Or both?

Or do you mean that David needs to check that the performance of the algorithm, as we measure it now, have not degraded after those changes?

@icfaust
Copy link
Contributor

icfaust commented Oct 11, 2024

@icfaust it seems I've missed your comment. Do you mean we need benchmarks for the cases of wide matrices? Or matrices with multicollinearity across features? Or both?

Or do you mean that David needs to check that the performance of the algorithm, as we measure it now, have not degraded after those changes?

I just wonder if we need to verify that performance have not degraded with these changes. What do you think @Vika-F

@david-cortes-intel
Copy link
Contributor Author

/intelci: run

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@david-cortes-intel david-cortes-intel merged commit e93063f into oneapi-src:main Oct 16, 2024
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants