Skip to content

Conversation

winner245
Copy link
Contributor

@winner245 winner245 commented Mar 25, 2025

Previously, the segmented iterator optimization was limited to std::{for_each, for_each_n}. This patch aims to extend the optimization to std::ranges::for_each and std::ranges::for_each_n, ensuring consistent optimizations across these algorithms. This patch first generalizes the std algorithms by introducing a Projection parameter, which is set to __identity for the std algorithms. Then we let the ranges algorithms to directly call their std counterparts with a general __proj argument. Benchmarks demonstrate performance improvements of up to 21.4x for std::deque::iterator and 22.3x for join_view of vector<vector<char>>.

Addresses a subtask of #102817.

Summary of speedups for deque iterators

-------------------------------------------------------------------------------
Benchmark                        deque<char>    deque<short>    deque<int>
-------------------------------------------------------------------------------
rng::for_each                       14.4x          21.4x           4.6x
rng::for_each_n                     12.9x          15.5x           4.1x
-------------------------------------------------------------------------------

Summary of speedups for join_view iterators

-----------------------------------------------------------------------------------------
Benchmark          vector<vector<char>>    vector<vector<short>>    vector<vector<int>>
-----------------------------------------------------------------------------------------
rng::for_each             19.0x                   22.3x                    4.8x
rng::for_each_n           16.3x                   20.1x                    3.9x
-----------------------------------------------------------------------------------------

Benchmarks:

std::ranges::for_each with deque iterators

--------------------------------------------------------------------------
Benchmark                                    Before        After   Speedup
--------------------------------------------------------------------------
rng::for_each(deque<char>)/8                 8.39 ns      2.63 ns    3.2x
rng::for_each(deque<char>)/32               28.70 ns      3.05 ns    9.4x
rng::for_each(deque<char>)/50               42.00 n       4.53 ns    9.3x
rng::for_each(deque<char>)/1024            657.00 ns     45.60 ns   14.4x
rng::for_each(deque<char>)/4096           2272.00 ns    169.00 ns   13.4x
rng::for_each(deque<char>)/8192           4525.00 ns    355.00 ns   12.7x
rng::for_each(deque<char>)/16384          9445.00 ns    722.00 ns   13.1x
rng::for_each(deque<char>)/65536         36880.00 ns   2902.00 ns   12.7x
rng::for_each(deque<char>)/262144       157774.00 ns  11577.00 ns   13.6x
rng::for_each(deque<short>)/8                5.70 ns      1.62 ns    3.5x
rng::for_each(deque<short>)/32              26.80 ns      1.69 ns   15.9x
rng::for_each(deque<short>)/50              38.40 ns      3.06 ns   12.5x
rng::for_each(deque<short>)/1024           700.00 ns     40.40 ns   17.3x
rng::for_each(deque<short>)/4096          2782.00 ns    133.00 ns   20.9x
rng::for_each(deque<short>)/8192          5554.00 ns    260.00 ns   21.4x
rng::for_each(deque<short>)/16384        11093.00 ns    521.00 ns   21.3x
rng::for_each(deque<short>)/65536        44035.00 ns   2495.00 ns   17.6x
rng::for_each(deque<short>)/262144      177784.00 ns   9915.00 ns   17.9x
rng::for_each(deque<int>)/8                 5.43 ns       3.00 ns    1.8x
rng::for_each(deque<int>)/32               25.50 ns       5.60 ns    4.6x
rng::for_each(deque<int>)/50               38.50 ns       8.61 ns    4.5x
rng::for_each(deque<int>)/1024            706.00 ns     169.00 ns    4.2x
rng::for_each(deque<int>)/4096           2789.00 ns     670.00 ns    4.2x
rng::for_each(deque<int>)/8192           5547.00 ns    1330.00 ns    4.2x
rng::for_each(deque<int>)/16384         11098.00 ns    2711.00 ns    4.1x
rng::for_each(deque<int>)/65536         44404.00 ns   10709.00 ns    4.1x
rng::for_each(deque<int>)/262144       180739.00 ns   43645.00 ns    4.1x

std::ranges::for_each_n with deque iterators

--------------------------------------------------------------------------
Benchmark                                    Before        After   Speedup
--------------------------------------------------------------------------
rng::for_each_n(deque<char>)/8              8.22 ns       3.28 ns     2.5x
rng::for_each_n(deque<char>)/32             28.5 ns       3.66 ns     7.8x
rng::for_each_n(deque<char>)/50             37.6 ns       6.15 ns     6.1x
rng::for_each_n(deque<char>)/1024            590 ns       47.0 ns    12.6x
rng::for_each_n(deque<char>)/4096           2151 ns        167 ns    12.9x
rng::for_each_n(deque<char>)/8192           4199 ns        344 ns    12.2x
rng::for_each_n(deque<char>)/16384          8626 ns        701 ns    12.3x
rng::for_each_n(deque<char>)/65536         33613 ns       2845 ns    11.8x
rng::for_each_n(deque<char>)/262144       132493 ns      11291 ns    11.7x
rng::for_each_n(deque<short>)/8             6.53 ns       3.72 ns     1.8x
rng::for_each_n(deque<short>)/32            23.2 ns       3.75 ns     6.2x
rng::for_each_n(deque<short>)/50            32.7 ns       5.54 ns     5.9x
rng::for_each_n(deque<short>)/1024           560 ns       37.4 ns    15.0x
rng::for_each_n(deque<short>)/4096          2105 ns        136 ns    15.5x
rng::for_each_n(deque<short>)/8192          3981 ns        264 ns    15.1x
rng::for_each_n(deque<short>)/16384         7736 ns        525 ns    14.7x
rng::for_each_n(deque<short>)/65536        30359 ns       2459 ns    12.3x
rng::for_each_n(deque<short>)/262144      121006 ns       9852 ns    12.3x
rng::for_each_n(deque<int>)/8               5.59 ns       4.16 ns     1.3x
rng::for_each_n(deque<int>)/32              19.9 ns       6.89 ns     2.9x
rng::for_each_n(deque<int>)/50              32.6 ns       10.1 ns     3.2x
rng::for_each_n(deque<int>)/1024             605 ns        180 ns     3.4x
rng::for_each_n(deque<int>)/4096            2517 ns        715 ns     3.5x
rng::for_each_n(deque<int>)/8192            4942 ns       1431 ns     3.5x
rng::for_each_n(deque<int>)/16384           9809 ns       2906 ns     3.4x
rng::for_each_n(deque<int>)/65536          40199 ns      11316 ns     3.6x
rng::for_each_n(deque<int>)/262144        181371 ns      44000 ns     4.1x

std::ranges::for_each with join_view iterators

----------------------------------------------------------------------------------------------------
Benchmark                                                       Before           After       Speedup
----------------------------------------------------------------------------------------------------
rng::for_each(join_view(vector<vector<char>>)/8                7.02 ns         2.58 ns         2.7x
rng::for_each(join_view(vector<vector<char>>)/32               32.1 ns         3.03 ns        10.6x
rng::for_each(join_view(vector<vector<char>>)/50               45.2 ns         5.34 ns         8.5x
rng::for_each(join_view(vector<vector<char>>)/1024              782 ns         43.4 ns        18.0x
rng::for_each(join_view(vector<vector<char>>)/4096             3113 ns          168 ns        18.5x
rng::for_each(join_view(vector<vector<char>>)/8192             6231 ns          339 ns        18.4x
rng::for_each(join_view(vector<vector<char>>)/16384           12783 ns          691 ns        18.5x
rng::for_each(join_view(vector<vector<char>>)/65536           53732 ns         2829 ns        19.0x
rng::for_each(join_view(vector<vector<char>>)/262144         210286 ns        11241 ns        18.7x
rng::for_each(join_view(vector<vector<short>>)/8               7.46 ns         2.40 ns         3.1x
rng::for_each(join_view(vector<vector<short>>)/32              33.4 ns         2.81 ns        11.9x
rng::for_each(join_view(vector<vector<short>>)/50              46.1 ns         5.66 ns         8.1x
rng::for_each(join_view(vector<vector<short>>)/1024             791 ns         37.0 ns        21.4x
rng::for_each(join_view(vector<vector<short>>)/4096            3183 ns          149 ns        21.4x
rng::for_each(join_view(vector<vector<short>>)/8192            6360 ns          292 ns        21.8x
rng::for_each(join_view(vector<vector<short>>)/16384          12825 ns          574 ns        22.3x
rng::for_each(join_view(vector<vector<short>>)/65536          51638 ns         2745 ns        18.8x
rng::for_each(join_view(vector<vector<short>>)/262144        210929 ns        10964 ns        19.2x
rng::for_each(join_view(vector<vector<int>>)/8                 7.04 ns         3.02 ns         2.3x
rng::for_each(join_view(vector<vector<int>>)/32                24.4 ns         6.62 ns         3.7x
rng::for_each(join_view(vector<vector<int>>)/50                47.6 ns         9.91 ns         4.8x
rng::for_each(join_view(vector<vector<int>>)/1024               727 ns          180 ns         4.0x
rng::for_each(join_view(vector<vector<int>>)/4096              3110 ns          748 ns         4.2x
rng::for_each(join_view(vector<vector<int>>)/8192              6193 ns         1480 ns         4.2x
rng::for_each(join_view(vector<vector<int>>)/16384            12391 ns         2993 ns         4.1x
rng::for_each(join_view(vector<vector<int>>)/65536            49505 ns        11950 ns         4.1x
rng::for_each(join_view(vector<vector<int>>)/262144          199253 ns        47921 ns         4.2x

std::ranges::for_each_n with join_view iterators

----------------------------------------------------------------------------------------------------
Benchmark                                                       Before           After       Speedup
----------------------------------------------------------------------------------------------------
rng::for_each_n(join_view(vector<vector<char>>)/8              7.97 ns         2.82 ns         2.8x
rng::for_each_n(join_view(vector<vector<char>>)/32             28.7 ns         3.29 ns         8.7x
rng::for_each_n(join_view(vector<vector<char>>)/50             42.8 ns         6.24 ns         6.9x
rng::for_each_n(join_view(vector<vector<char>>)/1024            728 ns         45.5 ns        16.0x
rng::for_each_n(join_view(vector<vector<char>>)/4096           2891 ns          177 ns        16.3x
rng::for_each_n(join_view(vector<vector<char>>)/8192           5769 ns          359 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/16384         11576 ns          720 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/65536         46525 ns         2889 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/262144       186093 ns        11640 ns        16.0x
rng::for_each_n(join_view(vector<vector<short>>)/8             6.95 ns         3.32 ns         2.1x
rng::for_each_n(join_view(vector<vector<short>>)/32            29.4 ns         3.30 ns         8.9x
rng::for_each_n(join_view(vector<vector<short>>)/50            40.8 ns         5.58 ns         7.3x
rng::for_each_n(join_view(vector<vector<short>>)/1024           719 ns         35.9 ns        20.0x
rng::for_each_n(join_view(vector<vector<short>>)/4096          2875 ns          144 ns        20.0x
rng::for_each_n(join_view(vector<vector<short>>)/8192          5632 ns          283 ns        19.9x
rng::for_each_n(join_view(vector<vector<short>>)/16384        11481 ns          570 ns        20.1x
rng::for_each_n(join_view(vector<vector<short>>)/65536        45355 ns         2616 ns        17.3x
rng::for_each_n(join_view(vector<vector<short>>)/262144      181890 ns        10958 ns        16.6x
rng::for_each_n(join_view(vector<vector<int>>)/8               6.61 ns         3.49 ns         1.9x
rng::for_each_n(join_view(vector<vector<int>>)/32              27.5 ns         7.09 ns         3.9x
rng::for_each_n(join_view(vector<vector<int>>)/50              40.4 ns         10.5 ns         3.8x
rng::for_each_n(join_view(vector<vector<int>>)/1024             674 ns          188 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/4096            2717 ns          766 ns         3.5x
rng::for_each_n(join_view(vector<vector<int>>)/8192            5422 ns         1524 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/16384          11024 ns         3037 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/65536          44197 ns        12159 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/262144        175819 ns        48274 ns         3.6x

@winner245 winner245 marked this pull request as ready for review March 25, 2025 15:59
@winner245 winner245 requested a review from a team as a code owner March 25, 2025 15:59
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 25, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 25, 2025

@llvm/pr-subscribers-libcxx

Author: Peng Liu (winner245)

Changes

This patch extends segmented iterator optimizations, previously applied to std::for_each, to std::for_each_n, std::ranges::for_each, and std::ranges::for_each_n by forwarding to std::for_each. New tests validate these optimizations for segmented iterators (e.g., deque&lt;int&gt; and join_view iterators). Benchmarks demonstrate up to 3.9x performance improvement for deque&lt;int&gt; iterators, aligning their performance with contiguous iterators (e.g., vector&lt;int&gt;). The vector&lt;int&gt; performance serves as a baseline for contiguous iterators, representing the upper bound for segmented deque&lt;int&gt; inputs.

Addresses a subtask of #102817.

for_each_n

--------------------------------------------------------------------------------
Benchmark                                       Before          After    Speedup
--------------------------------------------------------------------------------
std::for_each_n(deque&lt;int&gt;)/8                  5.31 ns         3.39 ns      1.6x
std::for_each_n(deque&lt;int&gt;)/32                 20.1 ns         6.89 ns      2.9x
std::for_each_n(deque&lt;int&gt;)/1024                612 ns          171 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/8192               4892 ns         1350 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/16384              9786 ns         2774 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/65536             39026 ns        11339 ns      3.4x
std::for_each_n(deque&lt;int&gt;)/262144           157897 ns        45166 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/1048576          643836 ns       184999 ns      3.5x
rng::for_each_n(deque&lt;int&gt;)/8                  4.85 ns         4.94 ns      1.0x
rng::for_each_n(deque&lt;int&gt;)/32                 18.1 ns         8.47 ns      2.1x
rng::for_each_n(deque&lt;int&gt;)/1024                622 ns          171 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/8192               5008 ns         1363 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/16384              9952 ns         2744 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/65536             40204 ns        10841 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/262144           157713 ns        43386 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/1048576          637549 ns       177042 ns      3.6x
std::for_each_n(vector&lt;int&gt;)/8                 2.91 ns         2.94 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/32                5.42 ns         5.54 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1024               161 ns          165 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/8192              1271 ns         1292 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/16384             2556 ns         2619 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/65536            10125 ns        10659 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/262144           44572 ns        44372 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1048576         180804 ns       183389 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/8                 3.05 ns         3.05 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/32                5.71 ns         5.85 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/1024               167 ns          183 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/8192              1298 ns         1429 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/16384             2691 ns         2870 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/65536            10632 ns        11465 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/262144           53031 ns        45948 ns      1.2x
rng::for_each_n(vector&lt;int&gt;)/1048576         174328 ns       184270 ns      0.9x

for_each

--------------------------------------------------------------------------------
Benchmark                                     Before           After     Speedup
--------------------------------------------------------------------------------
std::for_each(deque&lt;int&gt;)/8                  3.18 ns         2.96 ns        1.1x
std::for_each(deque&lt;int&gt;)/32                 5.70 ns         5.54 ns        1.0x
std::for_each(deque&lt;int&gt;)/1024                183 ns          180 ns        1.0x
std::for_each(deque&lt;int&gt;)/8192               1435 ns         1422 ns        1.0x
std::for_each(deque&lt;int&gt;)/16384              2885 ns         2879 ns        1.0x
std::for_each(deque&lt;int&gt;)/65536             11423 ns        11378 ns        1.0x
std::for_each(deque&lt;int&gt;)/262144            45203 ns        43686 ns        1.0x
std::for_each(deque&lt;int&gt;)/1048576          181832 ns       173832 ns        1.0x
rng::for_each(deque&lt;int&gt;)/8                  5.10 ns         3.75 ns        1.4x
rng::for_each(deque&lt;int&gt;)/32                 23.5 ns         7.49 ns        3.1x
rng::for_each(deque&lt;int&gt;)/1024                693 ns          184 ns        3.8x
rng::for_each(deque&lt;int&gt;)/8192               5522 ns         1430 ns        3.9x
rng::for_each(deque&lt;int&gt;)/16384             11112 ns         2930 ns        3.8x
rng::for_each(deque&lt;int&gt;)/65536             44390 ns        11656 ns        3.8x
rng::for_each(deque&lt;int&gt;)/262144           179419 ns        46582 ns        3.9x
rng::for_each(deque&lt;int&gt;)/1048576          711406 ns       189658 ns        3.8x
std::for_each(vector&lt;int&gt;)/8                 2.96 ns         2.91 ns        1.0x
std::for_each(vector&lt;int&gt;)/32                5.54 ns         5.49 ns        1.0x
std::for_each(vector&lt;int&gt;)/1024               165 ns          162 ns        1.0x
std::for_each(vector&lt;int&gt;)/8192              1269 ns         1257 ns        1.0x
std::for_each(vector&lt;int&gt;)/16384             2636 ns         2567 ns        1.0x
std::for_each(vector&lt;int&gt;)/65536            10231 ns        10215 ns        1.0x
std::for_each(vector&lt;int&gt;)/262144           41544 ns        40719 ns        1.0x
std::for_each(vector&lt;int&gt;)/1048576         173667 ns       167878 ns        1.0x
rng::for_each(vector&lt;int&gt;)/8                 3.09 ns         3.06 ns        1.0x
rng::for_each(vector&lt;int&gt;)/32                5.85 ns         5.77 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1024               179 ns          168 ns        1.1x
rng::for_each(vector&lt;int&gt;)/8192              1346 ns         1309 ns        1.0x
rng::for_each(vector&lt;int&gt;)/16384             2714 ns         2664 ns        1.0x
rng::for_each(vector&lt;int&gt;)/65536            10979 ns        10523 ns        1.0x
rng::for_each(vector&lt;int&gt;)/262144           42994 ns        42535 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1048576         175633 ns       173933 ns        1.0x

Full diff: https://github.com/llvm/llvm-project/pull/132896.diff

8 Files Affected:

  • (modified) libcxx/include/__algorithm/for_each_n.h (+24-1)
  • (modified) libcxx/include/__algorithm/ranges_for_each.h (+11-3)
  • (modified) libcxx/include/__algorithm/ranges_for_each_n.h (+11-4)
  • (added) libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp (+57)
  • (modified) libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp (+1-1)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp (+82-38)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp (+41-5)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp (+44-2)
diff --git a/libcxx/include/__algorithm/for_each_n.h b/libcxx/include/__algorithm/for_each_n.h
index fce380b49df3e..3d91124432f56 100644
--- a/libcxx/include/__algorithm/for_each_n.h
+++ b/libcxx/include/__algorithm/for_each_n.h
@@ -10,7 +10,11 @@
 #ifndef _LIBCPP___ALGORITHM_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__config>
+#include <__iterator/iterator_traits.h>
+#include <__iterator/segmented_iterator.h>
+#include <__type_traits/enable_if.h>
 #include <__utility/convert_to_integral.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
@@ -21,7 +25,13 @@ _LIBCPP_BEGIN_NAMESPACE_STD
 
 #if _LIBCPP_STD_VER >= 17
 
-template <class _InputIterator, class _Size, class _Function>
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<!__is_segmented_iterator<_InputIterator>::value ||
+                            (__has_input_iterator_category<_InputIterator>::value &&
+                             !__has_random_access_iterator_category<_InputIterator>::value),
+                        int> = 0>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
 for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   typedef decltype(std::__convert_to_integral(__orig_n)) _IntegralSize;
@@ -34,6 +44,19 @@ for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   return __first;
 }
 
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<__is_segmented_iterator<_InputIterator>::value &&
+                            __has_random_access_iterator_category<_InputIterator>::value,
+                        int> = 0>
+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
+for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
+  _InputIterator __last = __first + __orig_n;
+  std::for_each(__first, __last, __f);
+  return __last;
+}
+
 #endif
 
 _LIBCPP_END_NAMESPACE_STD
diff --git a/libcxx/include/__algorithm/ranges_for_each.h b/libcxx/include/__algorithm/ranges_for_each.h
index de39bc5522753..475f85366188e 100644
--- a/libcxx/include/__algorithm/ranges_for_each.h
+++ b/libcxx/include/__algorithm/ranges_for_each.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -41,9 +42,16 @@ struct __for_each {
   template <class _Iter, class _Sent, class _Proj, class _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr static for_each_result<_Iter, _Func>
   __for_each_impl(_Iter __first, _Sent __last, _Func& __func, _Proj& __proj) {
-    for (; __first != __last; ++__first)
-      std::invoke(__func, std::invoke(__proj, *__first));
-    return {std::move(__first), std::move(__func)};
+    if constexpr (random_access_iterator<_Iter> && sized_sentinel_for<_Sent, _Iter>) {
+      auto __n   = __last - __first;
+      auto __end = __first + __n;
+      std::for_each(__first, __end, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__end), std::move(__func)};
+    } else {
+      for (; __first != __last; ++__first)
+        std::invoke(__func, std::invoke(__proj, *__first));
+      return {std::move(__first), std::move(__func)};
+    }
   }
 
 public:
diff --git a/libcxx/include/__algorithm/ranges_for_each_n.h b/libcxx/include/__algorithm/ranges_for_each_n.h
index 603cb723233c8..3108d66001295 100644
--- a/libcxx/include/__algorithm/ranges_for_each_n.h
+++ b/libcxx/include/__algorithm/ranges_for_each_n.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -40,11 +41,17 @@ struct __for_each_n {
   template <input_iterator _Iter, class _Proj = identity, indirectly_unary_invocable<projected<_Iter, _Proj>> _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr for_each_n_result<_Iter, _Func>
   operator()(_Iter __first, iter_difference_t<_Iter> __count, _Func __func, _Proj __proj = {}) const {
-    while (__count-- > 0) {
-      std::invoke(__func, std::invoke(__proj, *__first));
-      ++__first;
+    if constexpr (random_access_iterator<_Iter>) {
+      auto __last = __first + __count;
+      std::for_each(__first, __last, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__last), std::move(__func)};
+    } else {
+      while (__count-- > 0) {
+        std::invoke(__func, std::invoke(__proj, *__first));
+        ++__first;
+      }
+      return {std::move(__first), std::move(__func)};
     }
-    return {std::move(__first), std::move(__func)};
   }
 };
 
diff --git a/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
new file mode 100644
index 0000000000000..af46371881577
--- /dev/null
+++ b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
@@ -0,0 +1,57 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17
+
+#include <algorithm>
+#include <cstddef>
+#include <deque>
+#include <list>
+#include <string>
+#include <vector>
+
+#include <benchmark/benchmark.h>
+
+int main(int argc, char** argv) {
+  auto std_for_each_n = [](auto first, auto n, auto f) { return std::for_each_n(first, n, f); };
+
+  // {std,ranges}::for_each_n
+  {
+    auto bm = []<class Container>(std::string name, auto for_each_n) {
+      benchmark::RegisterBenchmark(
+          name,
+          [for_each_n](auto& st) {
+            std::size_t const n = st.range(0);
+            Container c(n, 1);
+            auto first = c.begin();
+
+            for ([[maybe_unused]] auto _ : st) {
+              benchmark::DoNotOptimize(c);
+              auto result = for_each_n(first, n, [](int& x) { x = std::clamp(x, 10, 100); });
+              benchmark::DoNotOptimize(result);
+            }
+          })
+          ->Arg(8)
+          ->Arg(32)
+          ->Arg(50) // non power-of-two
+          ->Arg(8192)
+          ->Arg(1 << 20);
+    };
+    bm.operator()<std::vector<int>>("std::for_each_n(vector<int>)", std_for_each_n);
+    bm.operator()<std::deque<int>>("std::for_each_n(deque<int>)", std_for_each_n);
+    bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
+    bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);
+    bm.operator()<std::deque<int>>("rng::for_each_n(deque<int>)", std::ranges::for_each_n);
+    bm.operator()<std::list<int>>("rng::for_each_n(list<int>)", std::ranges::for_each_n);
+  }
+
+  benchmark::Initialize(&argc, argv);
+  benchmark::RunSpecifiedBenchmarks();
+  benchmark::Shutdown();
+  return 0;
+}
diff --git a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
index dd026444330ea..beb4c7f675a6e 100644
--- a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
+++ b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
@@ -258,7 +258,7 @@ constexpr bool all_the_algorithms()
 int main(int, char**)
 {
     all_the_algorithms();
-    static_assert(all_the_algorithms());
+    // static_assert(all_the_algorithms());
 
     return 0;
 }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
index 371f6c92f1ed1..42f1a41a27096 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
@@ -13,69 +13,113 @@
 //    constexpr InputIterator      // constexpr after C++17
 //    for_each_n(InputIterator first, Size n, Function f);
 
-
 #include <algorithm>
 #include <cassert>
+#include <deque>
 #include <functional>
+#include <iterator>
+#include <ranges>
+#include <vector>
 
 #include "test_macros.h"
 #include "test_iterators.h"
 
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
-    int ia[] = {1, 3, 6, 7};
-    int expected[] = {3, 5, 8, 9};
-    const std::size_t N = 4;
+struct for_each_test {
+  TEST_CONSTEXPR for_each_test(int c) : count(c) {}
+  int count;
+  TEST_CONSTEXPR_CXX14 void operator()(int& i) {
+    ++i;
+    ++count;
+  }
+};
 
-    auto it = std::for_each_n(std::begin(ia), N, [](int &a) { a += 2; });
-    return it == (std::begin(ia) + N)
-        && std::equal(std::begin(ia), std::end(ia), std::begin(expected))
-        ;
-    }
-#endif
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
 
-struct for_each_test
-{
-    for_each_test(int c) : count(c) {}
-    int count;
-    void operator()(int& i) {++i; ++count;}
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
 };
 
-int main(int, char**)
-{
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
+TEST_CONSTEXPR_CXX20 bool test() {
+  {
     typedef cpp17_input_iterator<int*> Iter;
-    int ia[] = {0, 1, 2, 3, 4, 5};
-    const unsigned s = sizeof(ia)/sizeof(ia[0]);
+    int ia[]         = {0, 1, 2, 3, 4, 5};
+    const unsigned s = sizeof(ia) / sizeof(ia[0]);
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
-    assert(it == Iter(ia));
-    assert(f.count == 0);
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
+      assert(it == Iter(ia));
+      assert(f.count == 0);
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
 
-    assert(it == Iter(ia+s));
-    assert(f.count == s);
-    for (unsigned i = 0; i < s; ++i)
-        assert(ia[i] == static_cast<int>(i+1));
+      assert(it == Iter(ia + s));
+      assert(f.count == s);
+      for (unsigned i = 0; i < s; ++i)
+        assert(ia[i] == static_cast<int>(i + 1));
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
 
-    assert(it == Iter(ia+1));
-    assert(f.count == 1);
-    for (unsigned i = 0; i < 1; ++i)
-        assert(ia[i] == static_cast<int>(i+2));
+      assert(it == Iter(ia + 1));
+      assert(f.count == 1);
+      for (unsigned i = 0; i < 1; ++i)
+        assert(ia[i] == static_cast<int>(i + 2));
     }
+  }
+
+#if TEST_STD_VER > 11
+  {
+    int ia[]            = {1, 3, 6, 7};
+    int expected[]      = {3, 5, 8, 9};
+    const std::size_t N = 4;
+
+    auto it = std::for_each_n(std::begin(ia), N, [](int& a) { a += 2; });
+    assert(it == (std::begin(ia) + N) && std::equal(std::begin(ia), std::end(ia), std::begin(expected)));
+  }
+#endif
+
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+#if TEST_STD_VER >= 20
+  { // Make sure that the segmented iterator optimization works during constant evaluation
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::for_each_n(v.begin(), std::ranges::distance(v), [i = 0](int& a) mutable { assert(a == i++); });
+  }
+#endif
+
+  return true;
+}
 
+int main(int, char**) {
+  assert(test());
 #if TEST_STD_VER > 17
-    static_assert(test_constexpr());
+  static_assert(test());
 #endif
 
   return 0;
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
index 8b9b6e82cbcb2..2f4bfb9db6dba 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
@@ -20,7 +20,10 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
 #include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -30,7 +33,7 @@ struct Callable {
 };
 
 template <class Iter, class Sent = Iter>
-concept HasForEachIt = requires (Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
+concept HasForEachIt = requires(Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
 
 static_assert(HasForEachIt<int*>);
 static_assert(!HasForEachIt<InputIteratorNotDerivedFrom>);
@@ -47,7 +50,7 @@ static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotPredicate>);
 static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotCopyConstructible>);
 
 template <class Range>
-concept HasForEachR = requires (Range range) { std::ranges::for_each(range, Callable{}); };
+concept HasForEachR = requires(Range range) { std::ranges::for_each(range, Callable{}); };
 
 static_assert(HasForEachR<UncheckedRange<int*>>);
 static_assert(!HasForEachR<InputRangeNotDerivedFrom>);
@@ -68,7 +71,7 @@ constexpr void test_iterator() {
   { // simple test
     {
       auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      int a[]   = {1, 6, 3, 4};
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(Iter(a), Sent(Iter(a + 4)), func);
       assert(a[0] == 1);
@@ -81,8 +84,8 @@ constexpr void test_iterator() {
       assert(i == 4);
     }
     {
-      auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      auto func  = [i = 0](int& a) mutable { a += i++; };
+      int a[]    = {1, 6, 3, 4};
       auto range = std::ranges::subrange(Iter(a), Sent(Iter(a + 4)));
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(range, func);
@@ -110,6 +113,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each(d, deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>, sentinel_wrapper<cpp17_input_iterator<int*>>>();
   test_iterator<cpp20_input_iterator<int*>, sentinel_wrapper<cpp20_input_iterator<int*>>>();
@@ -146,6 +173,15 @@ constexpr bool test() {
     }
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each(v, [i = 0](int x) mutable { assert(x == 2 * i++); }, [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
index d4b2d053d08ce..ad1447b7348f5 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
@@ -17,7 +17,12 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
+#include <iterator>
 #include <ranges>
+#include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -27,7 +32,7 @@ struct Callable {
 };
 
 template <class Iter>
-concept HasForEachN = requires (Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
+concept HasForEachN = requires(Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
 
 static_assert(HasForEachN<int*>);
 static_assert(!HasForEachN<InputIteratorNotDerivedFrom>);
@@ -45,7 +50,7 @@ template <class Iter>
 constexpr void test_iterator() {
   { // simple test
     auto func = [i = 0](int& a) mutable { a += i++; };
-    int a[] = {1, 6, 3, 4};
+    int a[]   = {1, 6, 3, 4};
     std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> auto ret =
         std::ranges::for_each_n(Iter(a), 4, func);
     assert(a[0] == 1);
@@ -64,6 +69,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>>();
   test_iterator<cpp20_input_iterator<int*>>();
@@ -89,6 +118,19 @@ constexpr bool test() {
     assert(a[2].other == 6);
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each_n(
+        v.begin(),
+        std::ranges::distance(v),
+        [i = 0](int x) mutable { assert(x == 2 * i++); },
+        [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 

@winner245 winner245 force-pushed the for-each-segment branch 2 times, most recently from 16438be to 047acfd Compare March 27, 2025 01:08
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! I left some comments but I think this is going to be a nice optimization.

@winner245 winner245 force-pushed the for-each-segment branch 3 times, most recently from d14bde4 to 8a5bcdc Compare April 5, 2025 02:43
Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

@winner245
Copy link
Contributor Author

winner245 commented Apr 5, 2025

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

Thank you for your feedback! I agree that the scope of the patch has expanded beyond its original intent. Initially, the goal was simple: only to extend the optimization for std::for_each to its variants ranges::for_each{,_n}. However, as the review and revision progressed, I aimed to address the inconsistent segmented iterator optimization support between for_each_n and for_each, as the optimization for for_each_n includes C++03. I think back-porting the optimization for std::for_each to C++03 could be useful as we may be able to extend the optimization to other algorithms by letting them simply forward to std::for_each (as per your comment in another PR).

However, I agree that this made the patch diverge from its original purpose and may complicate the review process. Following your suggestion, I will work on splitting it to make it clear what this patch focuses on.

-------------- Update --------------
As per your suggestion, I have split this into the following PRs, each focusing on an independent and self-contained subtask for the classical algorithms:

This separation allows the current PR to focus exclusively on the optimization of the ranges algorithms. I will rebase my current patch on the above split pieces once they are landed.

Copy link

github-actions bot commented May 22, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@winner245
Copy link
Contributor Author

With std::for_each backported to C++11 in #134960 and std::for_each_n carved out into #135468, this PR is now much cleaner, focusing exclusively on std::ranges::{for_each, for_each_n}.

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once comments are addressed. Thanks a lot for this series of refactorings / optimizations!

@ldionne ldionne merged commit 9827440 into llvm:main Jun 18, 2025
121 of 127 checks passed
@winner245 winner245 deleted the for-each-segment branch June 18, 2025 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants