Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exponent overflow in strings-to-double conversion #15517

Merged
merged 7 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -102,6 +102,7 @@ __device__ inline double stod(string_view const& d_str)
ch = *in_ptr++;
if (ch < '0' || ch > '9') break;
exp_ten = (exp_ten * 10) + (int)(ch - '0');
if (exp_ten >= 1e8) { break; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a logical flaw here. We know the exponent of a finite double value can only be as large as std::numeric_limits<double>::max_exponent10 == 308, so why check against 100'000'000? Anything else would go to infinity, unless I'm missing something.

Maybe this is what I am expecting to see:

Suggested change
if (exp_ten >= 1e8) { break; }
if (exp_ten >= (exp_sign == 1 ? std::numeric_limits<double>::max_exponent10 : std::numeric_limits<double>::min_exponent10)) { break; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm mistaken about the logic above, let's at least use integer-integer comparisons (1e8 is not an integer).

Suggested change
if (exp_ten >= 1e8) { break; }
if (exp_ten >= 100'000'000) { break; }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exceeding max_exponent is actually handled below to map to infinity (or -infinity).
The check here is to make sure we don't overflow the integer which is UB.

I had assumed the compiler would convert 1e8 to an integer but it appears that is incorrect.
I'll change it to an integer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an exp_off to store the extra exponent part of the mantissa, and it will be added to exp_ten after this while loop. So for a large number (which we expect to be an infinity), if the mantissa is very long and the exp_ten is not large enough, the final exp_ten could be wrong. So we need to pick a limit that is as large as possible.

Even if we set it to 100'000'000, the above case would still happen for a string of length more than 1e8. It's a very edge case, just want to point it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g.: a very very long number(length is Int.max_value):

0.00...[There are about Int.max_value zeros].....1E999999999
Because of the the following adjustment of exp_ten, the exp_len will be wrong.

Propose to use long to save exp_ten as currently max string length is Int.max_value.

And not sure If cuDF will support Long.max_value length string in future.

  exp_ten *= exp_sign;
  exp_ten += exp_off;
  exp_ten += num_digits - 1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A int.max long string would only be a one row column in libcudf since we have a max column size of int.max right now.
Regardless, I don't feel we need to increase the register size of this function to handle such a case. Likewise, a 100M length string would only be about 20 rows. I think this is a reasonable limit and could even be convinced a lower value is more practical.

}
}
}
Expand Down
37 changes: 20 additions & 17 deletions cpp/tests/strings/floats_tests.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -17,6 +17,8 @@
#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_utilities.hpp>
#include <cudf_test/column_wrapper.hpp>
#include <cudf_test/debug_utilities.hpp>
#include <cudf_test/iterator_utilities.hpp>

#include <cudf/strings/convert/convert_floats.hpp>
#include <cudf/strings/strings_column_view.hpp>
Expand All @@ -25,8 +27,6 @@

#include <vector>

constexpr cudf::test::debug_output_level verbosity{cudf::test::debug_output_level::ALL_ERRORS};

struct StringsConvertTest : public cudf::test::BaseFixture {};

TEST_F(StringsConvertTest, IsFloat)
Expand Down Expand Up @@ -89,7 +89,7 @@ TEST_F(StringsConvertTest, ToFloats32)
h_expected.begin(),
h_expected.end(),
thrust::make_transform_iterator(h_strings.begin(), [](auto str) { return str != nullptr; }));
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected, verbosity);
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected);
}

TEST_F(StringsConvertTest, FromFloats32)
Expand Down Expand Up @@ -118,38 +118,41 @@ TEST_F(StringsConvertTest, FromFloats32)
h_expected.end(),
thrust::make_transform_iterator(h_expected.begin(), [](auto str) { return str != nullptr; }));

CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected, verbosity);
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected);
}

TEST_F(StringsConvertTest, ToFloats64)
{
// clang-format off
std::vector<const char*> h_strings{
"1234", nullptr, "-876", "543.2", "-0.12", ".25",
"1234", "", "-876", "543.2", "-0.12", ".25",
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
"-.002", "", "-0.0", "1.28e256", "NaN", "abc123",
"123abc", "456e", "-1.78e+5", "-122.33644782", "12e+309", "1.7976931348623159E308",
"-Inf", "-INFINITY", "1.0", "1.7976931348623157e+308", "1.7976931348623157e-307",
// subnormal numbers: v--- smallest double v--- result is 0
"4e-308", "3.3333333333e-320", "4.940656458412465441765688e-324", "1.e-324" };
"4e-308", "3.3333333333e-320", "4.940656458412465441765688e-324", "1.e-324",
// another very small number
"9.299999257686047e-0005603333574677677" };
// clang-format on
cudf::test::strings_column_wrapper strings(
h_strings.begin(),
h_strings.end(),
thrust::make_transform_iterator(h_strings.begin(), [](auto str) { return str != nullptr; }));
auto validity = cudf::test::iterators::null_at(1);
cudf::test::strings_column_wrapper strings(h_strings.begin(), h_strings.end(), validity);

std::vector<double> h_expected;
std::for_each(h_strings.begin(), h_strings.end(), [&](char const* str) {
h_expected.push_back(str ? std::atof(str) : 0);
h_expected.push_back(std::atof(str));
});

auto strings_view = cudf::strings_column_view(strings);
auto results = cudf::strings::to_floats(strings_view, cudf::data_type{cudf::type_id::FLOAT64});

cudf::test::fixed_width_column_wrapper<double> expected(
h_expected.begin(),
h_expected.end(),
thrust::make_transform_iterator(h_strings.begin(), [](auto str) { return str != nullptr; }));
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected, verbosity);
h_expected.begin(), h_expected.end(), validity);
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected);

results = cudf::strings::is_float(strings_view);
cudf::test::fixed_width_column_wrapper<bool> is_expected(
{1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}, validity);
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, is_expected);
}

TEST_F(StringsConvertTest, FromFloats64)
Expand Down Expand Up @@ -178,7 +181,7 @@ TEST_F(StringsConvertTest, FromFloats64)
h_expected.end(),
thrust::make_transform_iterator(h_expected.begin(), [](auto str) { return str != nullptr; }));

CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected, verbosity);
CUDF_TEST_EXPECT_COLUMNS_EQUIVALENT(*results, expected);
}

TEST_F(StringsConvertTest, ZeroSizeStringsColumnFloat)
Expand Down
Loading