Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exponent overflow in strings-to-double conversion #15517

Merged
merged 7 commits into from
Apr 15, 2024

Conversation

davidwendt
Copy link
Contributor

Description

Adds a check when computing the exponent in the strings-to-double conversion to prevent an integer overflow.

Closes #15508

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Apr 11, 2024
@davidwendt davidwendt self-assigned this Apr 11, 2024
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 11, 2024
@davidwendt davidwendt marked this pull request as ready for review April 11, 2024 21:46
@davidwendt davidwendt requested a review from a team as a code owner April 11, 2024 21:46
@@ -102,6 +102,7 @@ __device__ inline double stod(string_view const& d_str)
ch = *in_ptr++;
if (ch < '0' || ch > '9') break;
exp_ten = (exp_ten * 10) + (int)(ch - '0');
if (exp_ten >= 1e8) { break; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a logical flaw here. We know the exponent of a finite double value can only be as large as std::numeric_limits<double>::max_exponent10 == 308, so why check against 100'000'000? Anything else would go to infinity, unless I'm missing something.

Maybe this is what I am expecting to see:

Suggested change
if (exp_ten >= 1e8) { break; }
if (exp_ten >= (exp_sign == 1 ? std::numeric_limits<double>::max_exponent10 : std::numeric_limits<double>::min_exponent10)) { break; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm mistaken about the logic above, let's at least use integer-integer comparisons (1e8 is not an integer).

Suggested change
if (exp_ten >= 1e8) { break; }
if (exp_ten >= 100'000'000) { break; }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exceeding max_exponent is actually handled below to map to infinity (or -infinity).
The check here is to make sure we don't overflow the integer which is UB.

I had assumed the compiler would convert 1e8 to an integer but it appears that is incorrect.
I'll change it to an integer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an exp_off to store the extra exponent part of the mantissa, and it will be added to exp_ten after this while loop. So for a large number (which we expect to be an infinity), if the mantissa is very long and the exp_ten is not large enough, the final exp_ten could be wrong. So we need to pick a limit that is as large as possible.

Even if we set it to 100'000'000, the above case would still happen for a string of length more than 1e8. It's a very edge case, just want to point it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g.: a very very long number(length is Int.max_value):

0.00...[There are about Int.max_value zeros].....1E999999999
Because of the the following adjustment of exp_ten, the exp_len will be wrong.

Propose to use long to save exp_ten as currently max string length is Int.max_value.

And not sure If cuDF will support Long.max_value length string in future.

  exp_ten *= exp_sign;
  exp_ten += exp_off;
  exp_ten += num_digits - 1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A int.max long string would only be a one row column in libcudf since we have a max column size of int.max right now.
Regardless, I don't feel we need to increase the register size of this function to handle such a case. Likewise, a 100M length string would only be about 20 rows. I think this is a reasonable limit and could even be convinced a lower value is more practical.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about nullptr -> "" in tests. Looks good otherwise considering the discussion with Bradley above.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add more detail to the comment. Otherwise LGTM.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 74b39e2 into rapidsai:branch-24.06 Apr 15, 2024
71 checks passed
@davidwendt davidwendt deleted the stod-overflow-exp branch April 15, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] String to Double return 0.0 for very large number and vice versa
5 participants