Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc++] Speed up set_intersection() by fast-forwarding over ranges of non-matching elements with one-sided binary search. #75230

Merged
merged 62 commits into from
Jul 18, 2024

Conversation

ichaer
Copy link
Contributor

@ichaer ichaer commented Dec 12, 2023

One-sided binary search, aka meta binary search, has been in the public domain for decades, and has the general advantage of being constant time in the best case, with the downside of executing at most 2*log(N) comparisons vs classic binary search's exact log(N). There are two scenarios in which it really shines: the first one is when operating over non-random-access iterators, because the classic algorithm requires knowing the container's size upfront, which adds N iterator increments to the complexity. The second one is when traversing the container in order, trying to fast-forward to the next value: in that case the classic algorithm requires at least O(N*log(N)) comparisons and, for non-random-access iterators, O(N^2) iterator increments, whereas the one-sided version will yield O(N) operations on both counts, with a best-case of O(log(N)) comparisons which is very common in practice.

Benchmark results for libcxx/benchmarks/set_intersection.libcxx.out on the following system:
Run on (16 X 4800.01 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1280 KiB (x8)
L3 Unified 24576 KiB (x1)

Comparison between main branch at b926f75 and this branch at be6c5c8, with a fixed seed of 0 hardcoded in getRandomEngine() for consistent and reproducible results -- the full output has 1.3k lines, so I'm only pasting here the lines with best and worst CPU time differences, plus the summary:

Benchmark                                                Time      CPU     Time Old     Time New   CPU Old   CPU New
--------------------------------------------------------------------------------------------------------------------
SetIntersection_None_Vector_uint32_262144_262144      -0.0210  -0.7069  13352003320  13071795797   1204337    352960
SetIntersection_Interlaced_Set_uint64_262144_262144   -0.0207  +0.0459  13404448283  13127521495  53605744  56068263
OVERALL_GEOMEAN                                       -0.0222  -0.1220            4            4         0         0

…ntroducing one-sided binary search for non-random iterators.
…dom iterators.

One-sided binary search, aka meta binary search, has been in the public domain for decades, and has the general
advantage of being Ω(1) rather than the classic algorithm's Ω(log(n)), with the downside of executing at most
2*log(n) comparisons vs the classic algorithm's exact log(n). There are two scenarios in which it really shines:
the first one is when operating over non-random iterators, because the classic algorithm requires knowing the
container's size upfront, which adds Ω(n) iterator increments to the complexity. The second one is when you're
traversing the container in order, trying to fast-forward to the next value: in that case, the classic algorithm
would yield Ω(n*log(n)) comparisons and, for non-random iterators, Ω(n^2) iterator increments, whereas the one-sided
version will yield O(n) operations on both counts, with a Ω(log(n)) bound on the number of comparisons.
… to introducing use of one-sided binary search to fast-forward over ranges of elements.
@ichaer ichaer requested a review from a team as a code owner December 12, 2023 18:18
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Dec 12, 2023
@llvmbot
Copy link

llvmbot commented Dec 12, 2023

@llvm/pr-subscribers-libcxx

Author: None (ichaer)

Changes

One-sided binary search, aka meta binary search, has been in the public domain for decades, and has the general advantage of being Ω(1) rather than the classic algorithm's Ω(log(n)), with the downside of executing at most 2log(n) comparisons vs the classic algorithm's exact log(n). There are two scenarios in which it really shines: the first one is when operating over non-random iterators, because the classic algorithm requires knowing the container's size upfront, which adds Ω(n) iterator increments to the complexity. The second one is when you're traversing the container in order, trying to fast-forward to the next value: in that case, the classic algorithm would yield Ω(nlog(n)) comparisons and, for non-random iterators, Ω(n^2) iterator increments, whereas the one-sided version will yield O(n) operations on both counts, with a Ω(log(n)) bound on the number of comparisons.

One use case that can often fit both scenarios is a std::set_intersection() on std::set instances.

This change is split into 4 separate commits. First we add tests validating the algorithmic complexity of std::lower_bound(), then we introduce the use of one-sided binary search for non-random iterators, then we add complexity validation tests for std::set_intersection(), and, finally, we introduce explicit use of the one-sided binary search version of std::lower_bound() for the instances of std::set_intersection() where there is no question about it improving time complexity.


Patch is 34.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75230.diff

7 Files Affected:

  • (modified) libcxx/include/__algorithm/iterator_operations.h (+47)
  • (modified) libcxx/include/__algorithm/lower_bound.h (+63-6)
  • (modified) libcxx/include/__algorithm/set_intersection.h (+150-4)
  • (modified) libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound.pass.cpp (+14-5)
  • (modified) libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound_comp.pass.cpp (+23-5)
  • (modified) libcxx/test/std/algorithms/alg.sorting/alg.set.operations/set.intersection/ranges_set_intersection.pass.cpp (+234-6)
  • (modified) libcxx/test/support/test_iterators.h (+42-13)
diff --git a/libcxx/include/__algorithm/iterator_operations.h b/libcxx/include/__algorithm/iterator_operations.h
index e6176da4f5606..d73573747087e 100644
--- a/libcxx/include/__algorithm/iterator_operations.h
+++ b/libcxx/include/__algorithm/iterator_operations.h
@@ -87,6 +87,53 @@ struct _IterOps<_ClassicAlgPolicy> {
     std::advance(__iter, __count);
   }
 
+  // advance with sentinel, a la std::ranges::advance
+  // it's unclear whether _Iter has a difference_type and whether that's signed, so we play it safe:
+  // use the incoming type for returning and steer clear of negative overflows
+  template <class _Iter, class _Distance>
+  _LIBCPP_HIDE_FROM_ABI constexpr static _Distance advance(_Iter& __iter, _Distance __count, const _Iter& __sentinel) {
+    return _IterOps::__advance(__iter, __count, __sentinel, typename iterator_traits<_Iter>::iterator_category());
+  }
+
+  // advance with sentinel, a la std::ranges::advance -- InputIterator specialization
+  template <class _InputIter, class _Distance>
+  _LIBCPP_HIDE_FROM_ABI constexpr static _Distance
+  __advance(_InputIter& __iter, _Distance __count, const _InputIter& __sentinel, input_iterator_tag) {
+    _Distance __dist{};
+    for (; __dist < __count && __iter != __sentinel; ++__dist)
+      ++__iter;
+    return __count - __dist;
+  }
+
+  // advance with sentinel, a la std::ranges::advance -- BidirectionalIterator specialization
+  template <class _BiDirIter, class _Distance>
+  _LIBCPP_HIDE_FROM_ABI constexpr static _Distance
+  __advance(_BiDirIter& __iter, _Distance __count, const _BiDirIter& __sentinel, bidirectional_iterator_tag) {
+    _Distance __dist{};
+    if (__count >= 0)
+      for (; __dist < __count && __iter != __sentinel; ++__dist)
+        ++__iter;
+    else
+      for (__count = -__count; __dist < __count && __iter != __sentinel; ++__dist)
+        --__iter;
+    return __count - __dist;
+  }
+
+  // advance with sentinel, a la std::ranges::advance -- RandomIterator specialization
+  template <class _RandIter, class _Distance>
+  _LIBCPP_HIDE_FROM_ABI constexpr static _Distance
+  __advance(_RandIter& __iter, _Distance __count, const _RandIter& __sentinel, random_access_iterator_tag) {
+    auto __dist = _IterOps::distance(__iter, __sentinel);
+    _LIBCPP_ASSERT_UNCATEGORIZED(
+        __count == 0 || (__dist < 0) == (__count < 0), "__sentinel must precede __iter when __count<0");
+    if (__count < 0)
+      __dist = __dist > __count ? __dist : __count;
+    else
+      __dist = __dist < __count ? __dist : __count;
+    __iter += __dist;
+    return __count - __dist;
+  }
+
   // distance
   template <class _Iter>
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14
diff --git a/libcxx/include/__algorithm/lower_bound.h b/libcxx/include/__algorithm/lower_bound.h
index 91c3bdaafd0cf..b432829667fa9 100644
--- a/libcxx/include/__algorithm/lower_bound.h
+++ b/libcxx/include/__algorithm/lower_bound.h
@@ -27,11 +27,13 @@
 
 _LIBCPP_BEGIN_NAMESPACE_STD
 
-template <class _AlgPolicy, class _Iter, class _Sent, class _Type, class _Proj, class _Comp>
-_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20
-_Iter __lower_bound(_Iter __first, _Sent __last, const _Type& __value, _Comp& __comp, _Proj& __proj) {
-  auto __len = _IterOps<_AlgPolicy>::distance(__first, __last);
-
+template <class _AlgPolicy, class _Iter, class _Type, class _Proj, class _Comp>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _Iter __lower_bound_bisecting(
+    _Iter __first,
+    const _Type& __value,
+    typename iterator_traits<_Iter>::difference_type __len,
+    _Comp& __comp,
+    _Proj& __proj) {
   while (__len != 0) {
     auto __l2 = std::__half_positive(__len);
     _Iter __m = __first;
@@ -46,13 +48,68 @@ _Iter __lower_bound(_Iter __first, _Sent __last, const _Type& __value, _Comp& __
   return __first;
 }
 
+// One-sided binary search, aka meta binary search, has been in the public domain for decades, and has the general
+// advantage of being Ω(1) rather than the classic algorithm's Ω(log(n)), with the downside of executing at most
+// 2*(log(n)-1) comparisons vs the classic algorithm's exact log(n). There are two scenarios in which it really shines:
+// the first one is when operating over non-random iterators, because the classic algorithm requires knowing the
+// container's size upfront, which adds Ω(n) iterator increments to the complexity. The second one is when you're
+// traversing the container in order, trying to fast-forward to the next value: in that case, the classic algorithm
+// would yield Ω(n*log(n)) comparisons and, for non-random iterators, Ω(n^2) iterator increments, whereas the one-sided
+// version will yield O(n) operations on both counts, with a Ω(log(n)) bound on the number of comparisons.
+template <class _AlgPolicy, class _Iter, class _Sent, class _Type, class _Proj, class _Comp>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _Iter
+__lower_bound_onesided(_Iter __first, _Sent __last, const _Type& __value, _Comp& __comp, _Proj& __proj) {
+  // static_assert(std::is_base_of<std::forward_iterator_tag, typename _IterOps<_AlgPolicy>::template
+  // __iterator_category<_Iter>>::value,
+  //       "lower_bound() is a multipass algorithm and requires forward iterator or better");
+
+  using _Distance = typename iterator_traits<_Iter>::difference_type;
+  for (_Distance __step = 1; __first != __last; __step <<= 1) {
+    auto __it   = __first;
+    auto __dist = __step - _IterOps<_AlgPolicy>::advance(__it, __step, __last);
+    // once we reach the last range where needle can be we must start
+    // looking inwards, bisecting that range
+    if (__it == __last || !std::__invoke(__comp, std::__invoke(__proj, *__it), __value)) {
+      return std::__lower_bound_bisecting<_AlgPolicy>(__first, __value, __dist, __comp, __proj);
+    }
+    // range not found, move forward!
+    __first = std::move(__it);
+  }
+  return __first;
+}
+
+template <class _AlgPolicy, class _InputIter, class _Sent, class _Type, class _Proj, class _Comp>
+_LIBCPP_NODISCARD_EXT inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIter __lower_bound(
+    _InputIter __first, _Sent __last, const _Type& __value, _Comp& __comp, _Proj& __proj, std::input_iterator_tag) {
+  return std::__lower_bound_onesided<_AlgPolicy>(__first, __last, __value, __comp, __proj);
+}
+
+template <class _AlgPolicy, class _RandIter, class _Sent, class _Type, class _Proj, class _Comp>
+_LIBCPP_NODISCARD_EXT inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _RandIter __lower_bound(
+    _RandIter __first,
+    _Sent __last,
+    const _Type& __value,
+    _Comp& __comp,
+    _Proj& __proj,
+    std::random_access_iterator_tag) {
+  const auto __dist = _IterOps<_AlgPolicy>::distance(__first, __last);
+  return std::__lower_bound_bisecting<_AlgPolicy>(__first, __value, __dist, __comp, __proj);
+}
+
+template <class _AlgPolicy, class _Iter, class _Sent, class _Type, class _Proj, class _Comp>
+_LIBCPP_NODISCARD_EXT inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _Iter
+__lower_bound(_Iter __first, _Sent __last, const _Type& __value, _Comp&& __comp, _Proj&& __proj) {
+  return std::__lower_bound<_AlgPolicy>(
+      __first, __last, __value, __comp, __proj, typename _IterOps<_AlgPolicy>::template __iterator_category<_Iter>());
+}
+
 template <class _ForwardIterator, class _Tp, class _Compare>
 _LIBCPP_NODISCARD_EXT inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20
 _ForwardIterator lower_bound(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value, _Compare __comp) {
   static_assert(__is_callable<_Compare, decltype(*__first), const _Tp&>::value,
                 "The comparator has to be callable");
   auto __proj = std::__identity();
-  return std::__lower_bound<_ClassicAlgPolicy>(__first, __last, __value, __comp, __proj);
+  return std::__lower_bound<_ClassicAlgPolicy>(__first, __last, __value, std::move(__comp), std::move(__proj));
 }
 
 template <class _ForwardIterator, class _Tp>
diff --git a/libcxx/include/__algorithm/set_intersection.h b/libcxx/include/__algorithm/set_intersection.h
index f2603fe1365ac..556738022f485 100644
--- a/libcxx/include/__algorithm/set_intersection.h
+++ b/libcxx/include/__algorithm/set_intersection.h
@@ -12,9 +12,13 @@
 #include <__algorithm/comp.h>
 #include <__algorithm/comp_ref_type.h>
 #include <__algorithm/iterator_operations.h>
+#include <__algorithm/lower_bound.h>
 #include <__config>
+#include <__functional/identity.h>
 #include <__iterator/iterator_traits.h>
 #include <__iterator/next.h>
+#include <__type_traits/is_same.h>
+#include <__utility/exchange.h>
 #include <__utility/move.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
@@ -36,9 +40,122 @@ struct __set_intersection_result {
 };
 
 template <class _AlgPolicy, class _Compare, class _InIter1, class _Sent1, class _InIter2, class _Sent2, class _OutIter>
-_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 __set_intersection_result<_InIter1, _InIter2, _OutIter>
-__set_intersection(
-    _InIter1 __first1, _Sent1 __last1, _InIter2 __first2, _Sent2 __last2, _OutIter __result, _Compare&& __comp) {
+struct _LIBCPP_NODISCARD_EXT __set_intersector {
+  _InIter1& __first1_;
+  const _Sent1& __last1_;
+  _InIter2& __first2_;
+  const _Sent2& __last2_;
+  _OutIter& __result_;
+  _Compare& __comp_;
+  static constexpr auto __proj_ = std::__identity();
+  bool __prev_advanced_         = true;
+
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 __set_intersector(
+      _InIter1& __first1, _Sent1& __last1, _InIter2& __first2, _Sent2& __last2, _OutIter& __result, _Compare& __comp)
+      : __first1_(__first1),
+        __last1_(__last1),
+        __first2_(__first2),
+        __last2_(__last2),
+        __result_(__result),
+        __comp_(__comp) {}
+
+  _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI
+      _LIBCPP_CONSTEXPR_SINCE_CXX20 __set_intersection_result<_InIter1, _InIter2, _OutIter>
+      operator()() && {
+    while (__first2_ != __last2_) {
+      __advance1_and_maybe_add_result();
+      if (__first1_ == __last1_)
+        break;
+      __advance2_and_maybe_add_result();
+    }
+    return __set_intersection_result<_InIter1, _InIter2, _OutIter>(
+        _IterOps<_AlgPolicy>::next(std::move(__first1_), std::move(__last1_)),
+        _IterOps<_AlgPolicy>::next(std::move(__first2_), std::move(__last2_)),
+        std::move(__result_));
+  }
+
+private:
+  // advance __iter to the first element in the range where !__comp_(__iter, __value)
+  // add result if this is the second consecutive call without advancing
+  // this method only works if you alternate calls between __advance1_and_maybe_add_result() and
+  // __advance2_and_maybe_add_result()
+  template <class _Iter, class _Sent, class _Value>
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void
+  __advance_and_maybe_add_result(_Iter& __iter, const _Sent& __sentinel, const _Value& __value) {
+    // use one-sided lower bound for improved algorithmic complexity bounds
+    const auto __tmp =
+        std::exchange(__iter, std::__lower_bound_onesided<_AlgPolicy>(__iter, __sentinel, __value, __comp_, __proj_));
+    __add_output_unless(__tmp != __iter);
+  }
+
+  // advance __first1_ to the first element in the range where !__comp_(*__first1_, *__first2_)
+  // add result if neither __first1_ nor __first2_ advanced in the last attempt (meaning they are equal)
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __advance1_and_maybe_add_result() {
+    __advance_and_maybe_add_result(__first1_, __last1_, *__first2_);
+  }
+
+  // advance __first2_ to the first element in the range where !__comp_(*__first2_, *__first1_)
+  // add result if neither __first1_ nor __first2_ advanced in the last attempt (meaning they are equal)
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __advance2_and_maybe_add_result() {
+    __advance_and_maybe_add_result(__first2_, __last2_, *__first1_);
+  }
+
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 void __add_output_unless(bool __advanced) {
+    if (__advanced | __prev_advanced_) {
+      __prev_advanced_ = __advanced;
+    } else {
+      *__result_ = *__first1_;
+      ++__result_;
+      ++__first1_;
+      ++__first2_;
+      __prev_advanced_ = true;
+    }
+  }
+};
+
+// with forward iterators we can use binary search to skip over entries
+template <class _AlgPolicy,
+          class _Compare,
+          class _InForwardIter1,
+          class _Sent1,
+          class _InForwardIter2,
+          class _Sent2,
+          class _OutIter>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI
+    _LIBCPP_CONSTEXPR_SINCE_CXX20 __set_intersection_result<_InForwardIter1, _InForwardIter2, _OutIter>
+    __set_intersection(
+        _InForwardIter1 __first1,
+        _Sent1 __last1,
+        _InForwardIter2 __first2,
+        _Sent2 __last2,
+        _OutIter __result,
+        _Compare&& __comp,
+        std::forward_iterator_tag,
+        std::forward_iterator_tag) {
+  std::__set_intersector<_AlgPolicy, _Compare, _InForwardIter1, _Sent1, _InForwardIter2, _Sent2, _OutIter>
+      __intersector(__first1, __last1, __first2, __last2, __result, __comp);
+  return std::move(__intersector)();
+}
+
+// input iterators are not suitable for multipass algorithms, so we stick to the classic single-pass version
+template <class _AlgPolicy,
+          class _Compare,
+          class _InInputIter1,
+          class _Sent1,
+          class _InInputIter2,
+          class _Sent2,
+          class _OutIter>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI
+    _LIBCPP_CONSTEXPR_SINCE_CXX20 __set_intersection_result<_InInputIter1, _InInputIter2, _OutIter>
+    __set_intersection(
+        _InInputIter1 __first1,
+        _Sent1 __last1,
+        _InInputIter2 __first2,
+        _Sent2 __last2,
+        _OutIter __result,
+        _Compare&& __comp,
+        std::input_iterator_tag,
+        std::input_iterator_tag) {
   while (__first1 != __last1 && __first2 != __last2) {
     if (__comp(*__first1, *__first2))
       ++__first1;
@@ -52,12 +169,41 @@ __set_intersection(
     }
   }
 
-  return __set_intersection_result<_InIter1, _InIter2, _OutIter>(
+  return std::__set_intersection_result<_InInputIter1, _InInputIter2, _OutIter>(
       _IterOps<_AlgPolicy>::next(std::move(__first1), std::move(__last1)),
       _IterOps<_AlgPolicy>::next(std::move(__first2), std::move(__last2)),
       std::move(__result));
 }
 
+template <class _AlgPolicy, class _Iter>
+class __set_intersection_iter_category {
+  template <class _It>
+  using __cat = typename std::_IterOps<_AlgPolicy>::template __iterator_category<_It>;
+  template <class _It>
+  static auto test(__cat<_It>*) -> __cat<_It>;
+  template <class>
+  static std::input_iterator_tag test(...);
+
+public:
+  using __type = decltype(test<_Iter>(nullptr));
+};
+
+template <class _AlgPolicy, class _Compare, class _InIter1, class _Sent1, class _InIter2, class _Sent2, class _OutIter>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI
+    _LIBCPP_CONSTEXPR_SINCE_CXX20 __set_intersection_result<_InIter1, _InIter2, _OutIter>
+    __set_intersection(
+        _InIter1 __first1, _Sent1 __last1, _InIter2 __first2, _Sent2 __last2, _OutIter __result, _Compare&& __comp) {
+  return std::__set_intersection<_AlgPolicy>(
+      std::move(__first1),
+      std::move(__last1),
+      std::move(__first2),
+      std::move(__last2),
+      std::move(__result),
+      std::forward<_Compare>(__comp),
+      typename std::__set_intersection_iter_category<_AlgPolicy, _InIter1>::__type(),
+      typename std::__set_intersection_iter_category<_AlgPolicy, _InIter2>::__type());
+}
+
 template <class _InputIterator1, class _InputIterator2, class _OutputIterator, class _Compare>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _OutputIterator set_intersection(
     _InputIterator1 __first1,
diff --git a/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound.pass.cpp b/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound.pass.cpp
index a2d8ab632303c..5c11962d13777 100644
--- a/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound.pass.cpp
@@ -39,11 +39,20 @@ template <class Iter, class T>
 void
 test(Iter first, Iter last, const T& value)
 {
-    Iter i = std::lower_bound(first, last, value);
-    for (Iter j = first; j != i; ++j)
-        assert(*j < value);
-    for (Iter j = i; j != last; ++j)
-        assert(!(*j < value));
+  std::size_t strides{};
+  std::size_t displacement{};
+  stride_counting_iterator f(first, &strides, &displacement);
+  stride_counting_iterator l(last, &strides, &displacement);
+
+  auto i = std::lower_bound(f, l, value);
+  for (auto j = f; j != i; ++j)
+    assert(*j < value);
+  for (auto j = i; j != l; ++j)
+    assert(!(*j < value));
+
+  auto len = std::distance(first, last);
+  assert(strides <= 2.5 * len + 1);
+  assert(displacement <= 2.5 * len + 1);
 }
 
 template <class Iter>
diff --git a/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound_comp.pass.cpp b/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound_comp.pass.cpp
index b9133028d9ade..05fd43eada461 100644
--- a/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound_comp.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.sorting/alg.binary.search/lower.bound/lower_bound_comp.pass.cpp
@@ -17,6 +17,7 @@
 #include <vector>
 #include <cassert>
 #include <cstddef>
+#include <cmath>
 
 #include "test_macros.h"
 #include "test_iterators.h"
@@ -38,11 +39,28 @@ template <class Iter, class T>
 void
 test(Iter first, Iter last, const T& value)
 {
-    Iter i = std::lower_bound(first, last, value, std::greater<int>());
-    for (Iter j = first; j != i; ++j)
-        assert(std::greater<int>()(*j, value));
-    for (Iter j = i; j != last; ++j)
-        assert(!std::greater<int>()(*j, value));
+  std::size_t strides{};
+  std::size_t displacement{};
+  stride_counting_iterator f(first, &strides, &displacement);
+  stride_counting_iterator l(last, &strides, &displacement);
+
+  std::size_t comparisons{};
+  auto cmp = [&comparisons](int rhs, int lhs) {
+    ++comparisons;
+    return std::greater<int>()(rhs, lhs);
+  };
+
+  auto i = std::lower_bound(f, l, value, cmp);
+
+  for (auto j = f; j != i; ++j)
+    assert(std::greater<int>()(*j, value));
+  for (auto j = i; j != l; ++j)
+    assert(!std::greater<int>()(*j, value));
+
+  auto len = std::distance(first, last);
+  assert(strides <= 2.5 * len + 1);
+  assert(displacement <= 2.5 * len + 1);
+  assert(comparisons <= 2 * ceil(log(len + 1) + 2));
 }
 
 template <class Iter>
diff --git a/libcxx/test/std/algorithms/alg.sorting/alg.set.operations/set.intersection/ranges_set_intersection.pass.cpp b/libcxx/test/std/algorithms/alg.sorting/alg.set.operations/set.intersection/ranges_set_intersection.pass.cpp
index 0ee89e0131a07..30cedd19038d7 100644
--- a/libcxx/test/std/algorithms/alg.sorting/alg.set.operations/set.intersection/ranges_set_intersection.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.sorting/alg.set.operations/set.intersection/ranges_set_intersection.pass.cpp
@@ -28,6 +28,9 @@
 #include <algorithm>
 #include <array>
 #include <concepts>
+#include <cstddef>
+#include <iterator>
+#include <type_traits>
 
 #include "almost_satisfies_types.h"
 #include "MoveOnly.h"
@@ -93,14 +96,17 @@ static_assert(!HasSetIntersectionRange<UncheckedRange<MoveOnly*>, UncheckedRange
 
 using std::ranges::set_intersection_result;
 
+// TODO: std::ranges::set_intersection calls std::ranges::copy
+// std::ranges::copy(contiguous_iterator<int*>, sentinel_wrapper<contiguous_iterator<int*>>, contiguous_iterator<int*>) doesn't seem to work.
+// It seems that std::ranges::copy calls std::copy, which unwraps contiguous_iterator<int*> into int*,
+// and then it failed because there is no == between int* and sentinel_wrapper<contiguous_iterator<int*>>
+template <typename Iter>
+using SentinelWorkaround = std::conditional_t<std::contiguous_iterator<Iter>, Iter, sentinel_wrapper<Iter>>;
+
 template <class In1, class In2, class Out, std::size_t N1, std::size_t N2, std::size_t N3>
 constexpr void testSetIntersectionImpl(std::array<int, N1> in1, std::array<int, N2> in2, std::array<int, N3> expected) {
-  ...
[truncated]

Copy link

github-actions bot commented Dec 12, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@mordante
Copy link
Member

Thanks for the patch. If I understand it right this might violate the complexity requirements in http://eel.is/c++draft/lower.bound#4. Is that correct?

@ichaer
Copy link
Contributor Author

ichaer commented Dec 12, 2023

Hi Mark, thanks for your reply!

Strictly speaking, yes, that is true: the number of comparisons may violate the requirement of there being at most log(last - first) + O(1) projections and comparisons. However, that is only being done for non-random access iterators, where we are also executing 2*(last-first) iterator mutations, which are not part of the explicit algorithm guarantees, and are the reason why std::set and std::map have their own lower_bound -- std::lower_bound is really hard to use on non-random iterators because of this additional cost.

Would this be a blocker for accepting the contribution?

@philnik777
Copy link
Contributor

Hi Mark, thanks for your reply!

Strictly speaking, yes, that is true: the number of comparisons may violate the requirement of there being at most log(last - first) + O(1) projections and comparisons. However, that is only being done for non-random access iterators, where we are also executing 2*(last-first) iterator mutations, which are not part of the explicit algorithm guarantees, and are the reason why std::set and std::map have their own lower_bound -- std::lower_bound is really hard to use on non-random iterators because of this additional cost.

Would this be a blocker for accepting the contribution?

Yes. We have to stay standards-conforming - libc++ is an implementation of the C++ standard after all, so even if the trade-off seems like a good idea we can't make the call in this case. If this improves the performance for trivial calls like an integer and std::less or friends as the comparator, that should be allowable, since that's not user-visible. Anything as generic as you currently have won't be possible.

@ichaer
Copy link
Contributor Author

ichaer commented Dec 12, 2023

OK, that's fair. How about I just stop exposing one-sided lower bound through std::lower_bound, and just use it for std::set_intersection, where all complexity guarantees are preserved, would that be OK?

@philnik777
Copy link
Contributor

OK, that's fair. How about I just stop exposing one-sided lower bound through std::lower_bound, and just use it for std::set_intersection, where all complexity guarantees are preserved, would that be OK?

If all the complexity guarantees are met, that would be OK. We always like to see some benchmarks to have some data that the optimization actually improves performance.

@ichaer
Copy link
Contributor Author

ichaer commented Dec 12, 2023

SGTM! I'll amend my submission, thanks!

@mordante
Copy link
Member

SGTM! I'll amend my submission, thanks!

Great.

As @philnik777 said we need to be careful about observable behaviour. I love to see performance improvements :-)

@ichaer
Copy link
Contributor Author

ichaer commented Dec 12, 2023

Cool, thanks :). Just for context, this code is actually being used in production out there. I work for Splunk, and we've implemented this outside of the standard library to speed up a component that does a number of set intersections, overlaying containers on top of each other (it's more complex, but that's the relevant part in a nutshell). The huge majority of the sets involved contain a tiny subset of what the target contains, so this makes a huge difference, and we thought we should contribute upstream.

Obviously I won't ask you to take me on my word :). Code is truth, I'll spend a bit of time understanding existing benchmarks in the LLVM repo to include something relevant in the pull request.

@philnik777
Copy link
Contributor

I don't think you're wrong that this improves performance, but being able to point to hard data is always great. With algorithms it's almost always really easy to write micro benchmarks, so there we really like to have them.

@ichaer
Copy link
Contributor Author

ichaer commented Dec 13, 2023

Sounds good to me! I'll be out on holiday for a couple of weeks and should be able to update this PR in early January. Thanks for your feedback so far, I appreciate it :).

…)` must start with `__step==0`, otherwise we can't match the complexity of linear search when continually matching (like a std::set_intersection() of matching containers will).
@ldionne
Copy link
Member

ldionne commented May 24, 2024

Also let's add a release note for this in libcxx/docs/ReleaseNotes/19.rst

@ldionne ldionne added this to the LLVM 19.X Release milestone Jul 10, 2024
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically LGTM but I have a few small comments left.

_InForwardIter1 __first1_next =
std::__lower_bound_onesided<_AlgPolicy>(__first1, __last1, *__first2, __comp, __proj);
std::swap(__first1_next, __first1);
std::__set_intersection_add_output_unless(__first1 != __first1_next, __first1, __first2, __result, __prev_advanced);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the algorithm would be easier to read if we inlined this function at its two call sites. Why do we need to keep track of whether we previously advanced the iterators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe I should have documented that! It's a (hopefully :)) clever optimisation to preserve the number of comparisons. In set_intersection() we compute a==b by means of !(a<b || b<a), right? We can avoid further comparison operator calls by checking whether we've advanced any side. If, at any time, we have two subsequent attempts to advance iterators failing, then we know elements are equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe simply calling it an optimisation is unfair. This is what allows us to preserve the complexity guarantees from the current standard.)

std::forward_iterator_tag,
std::forward_iterator_tag) {
_LIBCPP_CONSTEXPR std::__identity __proj;
bool __prev_advanced = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels off that we start with __prev_advanced = true. I'm not saying the algorithm's wrong, but is it possible that another name for this variable is trying to emerge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. Would you be happier with something like __maybe_equals? I'd have to invert the logic, but maybe that would help make it more readable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might help, yes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I've renamed the variables and added a comment at the point of call which I'm hoping captures what I said here :). Let me know what you think, and whether you'd still prefer for the code inside __set_intersection_add_output_if_equal() to be inlined. My personal feeling about it is that the function wraps a fiddly idea, and that it's easier for me to unpack the algorithm when I give that idea a name, but I defer to you.

* s/__set_intersection_add_output_unless/__set_intersection_add_output_if_equal/
* Add comments to explain the logic for equality comparison
* llvm/main: (8718 commits)
  [LLVM] Add `llvm.experimental.vector.compress` intrinsic (llvm#92289)
  [Clang] [C23] Implement N2653: u8 strings are char8_t[] (llvm#97208)
  [SLP][REVEC] Make SLP support revectorization (-slp-revec) and add simple test. (llvm#98269)
  [Flang] Exclude the reference to TIME_UTC for AIX. (llvm#99069)
  [clang][Sema] Improve `Sema::CheckCXXDefaultArguments` (llvm#97338)
  adjust the Xtensa backend after change f270a4d
  [lldb][Bazel]: Second attempt to adapt for a751f65
  Revert "[lldb][Bazel]: Adapt BUILD.bazel file for a751f65"
  [lldb][Bazel]: Adapt BUILD.bazel file for a751f65
  [emacs] Fix autoloading for llvm-mir-mode (llvm#98984)
  [SLP]Improve minbitwidth analysis for trun'ed gather nodes.
  [Clang][Concepts] Avoid substituting into constraints for invalid TemplateDecls (llvm#75697)
  [LV][NFC]Introduce isScalableVectorizationAllowed() to refactor getMaxLegalScalableVF().
  Revert "[AArch64] Remove superfluous sxtw in peephole opt (llvm#96293)"
  [AArch64][GISel] Add test cases for folding shifts into load/store addressing modes (NFC)
  [LV] Process dead interleave pointer ops in reverse order.
  [AMDGPU] Remove SIWholeQuadMode pass early exit (llvm#98450)
  [InstrRef][NFC] Avoid un-necessary DenseMap queries (llvm#99048)
  [gn build] Port e94e72a
  [LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (llvm#95819)
  ...
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ichaer I personally find the helper unhelpful in understanding the algorithm, but I'm OK with merging as-is. Will merge once CI is green.

Thanks for the patch and for all the iterations you did on this.

@ichaer
Copy link
Contributor Author

ichaer commented Jul 17, 2024

Thank you, @ldionne :).

@ldionne ldionne merged commit a066217 into llvm:main Jul 18, 2024
55 checks passed
sgundapa pushed a commit to sgundapa/upstream_effort that referenced this pull request Jul 23, 2024
…of non-matching elements with one-sided binary search. (llvm#75230)

One-sided binary search, aka meta binary search, has been in the public
domain for decades, and has the general advantage of being constant time
in the best case, with the downside of executing at most 2*log(N)
comparisons vs classic binary search's exact log(N). There are two
scenarios in which it really shines: the first one is when operating
over non-random-access iterators, because the classic algorithm requires
knowing the container's size upfront, which adds N iterator increments
to the complexity. The second one is when traversing the container in
order, trying to fast-forward to the next value: in that case the
classic algorithm requires at least O(N*log(N)) comparisons and, for
non-random-access iterators, O(N^2) iterator increments, whereas the
one-sided version will yield O(N) operations on both counts, with a
best-case of O(log(N)) comparisons which is very common in practice.
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
…of non-matching elements with one-sided binary search. (#75230)

One-sided binary search, aka meta binary search, has been in the public
domain for decades, and has the general advantage of being constant time
in the best case, with the downside of executing at most 2*log(N)
comparisons vs classic binary search's exact log(N). There are two
scenarios in which it really shines: the first one is when operating
over non-random-access iterators, because the classic algorithm requires
knowing the container's size upfront, which adds N iterator increments
to the complexity. The second one is when traversing the container in
order, trying to fast-forward to the next value: in that case the
classic algorithm requires at least O(N*log(N)) comparisons and, for
non-random-access iterators, O(N^2) iterator increments, whereas the
one-sided version will yield O(N) operations on both counts, with a
best-case of O(log(N)) comparisons which is very common in practice.
@alexfh
Copy link
Contributor

alexfh commented Jul 31, 2024

Thanks for the nice optimization!

However, I'd like to point out one real-world complication it brings: it breaks some invalid usages of std::set_intersection in a hard-to-debug way. I've just investigated a test failure that boiled down to unsorted sequences passed to std::set_intersection. The only clue was that the failure started appearing after this libc++ commit. Without this hint I can imagine many more hours of poking around.

I wonder, if it would be possible to add runtime checks in one of the hardening modes to assist with finding this sort of an issue? For example, a check for the correct ordering of the bounds could be made on each step of the binary search. This should at least retain the complexity. A more comprehensive check (ordering of all elements in both input sequences) could be done in _LIBCPP_HARDENING_MODE_EXTENSIVE or _LIBCPP_HARDENING_MODE_DEBUG mode. What do you think about this?

An example of a problematic use: https://gcc.godbolt.org/z/Y5dxsxzbe.

@ichaer
Copy link
Contributor Author

ichaer commented Jul 31, 2024

@alexfh, I did think of that when proposing the change, but I couldn't think of a good way to help users. Your idea of adding validation to the hardening modes sounds good to me, I've got my hands full right now, but I'll open a PR for that as soon as I can.

The case you shared, with {944, 3982, 10041, 443} and {10041}, is easy to detect cheaply, but if the last entry were 3982 instead it would be trickier... I'm torn on this. What did you have in mind, a full scan over inputs to check if they are sorted, or the cheap but much weaker validation that we can't reach a value which is less than the previous one we've inspected?

Edit: I was going to say that I think that the cheap check is too weak to be valuable, but I'm really torn on this... We do have a full linear scan in std::stable_sort() via __check_strict_weak_ordering_sorted(), and the thing really does only add a constant-time factor, even if it is multiplicative. On the other hand, the general unordered range will likely have many items out of order, and sampling log(n) of them sounds like it could be good enough in many, if not most, cases, and a slow accurate validation can be less useful than a fast but not-great one... tradeoffs, tradeoffs. Maybe the way to go is to use the weaker one with _LIBCPP_HARDENING_MODE_EXTENSIVE and the stricter one with _LIBCPP_HARDENING_MODE_DEBUG?

@alexfh
Copy link
Contributor

alexfh commented Jul 31, 2024

I didn't think too much about tradeoffs here, but it seems like a full linear scan in _LIBCPP_HARDENING_MODE_DEBUG should be totally uncontroversial.

As for a cheaper check for _LIBCPP_HARDENING_MODE_EXTENSIVE or even _LIBCPP_HARDENING_MODE_FAST, a log(n) sampling while doing the binary search could be helpful and cheap enough, but I'd actually expect some measurements there, especially in case of _LIBCPP_HARDENING_MODE_FAST.

@philnik777
Copy link
Contributor

@alexfh Can this result in any security-concerning bugs like out-of-bounds reads? If not I'd just go for a _LIBCPP_ASSERT_WHATEVER(std::is_sorted(__first, __last, __comp), "Passed unsored range to set_intersection") that only runs in debug mode and call it a day. Asking for people to run the debug mode when debugging stuff seems reasonable to me.

@ichaer
Copy link
Contributor Author

ichaer commented Jul 31, 2024

No out-of-bounds reads, we still respect the end sentinel. What happens is that we now skip over potential matches, and if the sequence is unsorted we may not go back to see what we skipped over.

ichaer added a commit to ichaer/llvm-project-onesided_lower_bound that referenced this pull request Aug 1, 2024
The use of one-sided binary search introduced by a066217 changes behaviour on invalid, unsorted input (see llvm#75230 (comment)). Add input validation on `_LIBCPP_HARDENING_MODE_DEBUG` to help users.

* Change interface of `__is_sorted_until()` so that it accepts a sentinel that's of a different type than the beginning iterator, and to ensure it won't try to copy the comparison function object.
* Add one assertion for each input range confirming that they are sorted.
* Stop validating complexity of `set_intersection()` in debug mode, it's hopeless and also not meaningful: there are no complexity guarantees in debug mode, we're happy to trade performance for diagnosability.
* Fix bugs in `ranges_robust_against_differing_projections.pass`: we were using an input range as output for `std::ranges::partial_sort_copy()`, and using projections which return the opposite value means that algorithms requiring a sorted range can't use ranges sorted with ascending values if the comparator is `std::ranges::less`. Added `const` where appropriate to make sure we weren't using inputs as outputs in other places.
@ichaer
Copy link
Contributor Author

ichaer commented Aug 1, 2024

@alexfh, @philnik777: I've opened #101508 with what we discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants