std.mem.indexOfScalarPos: ~20% faster #24772

GiuseppeCesarano · 2025-08-09T15:08:41Z

This patch simply cuts in half the number of checks for matches in the wider SIMD code path, so the measured impact is on larger files.

The ~20% claim comes from the following test:

pub fn main() !void {
    var buf: [4096]u8 = undefined;
    var file = try std.fs.cwd().openFile("test", .{});
    defer file.close();

    const slice = buf[0..try file.readAll(buf[0..])];

    var indx: usize = 0;
    for (0..4096) |s| {
        for (0..255) |v| {
            indx = indx + (std.mem.indexOfScalarPos(u8, slice, s, @truncate(v)) orelse 0);
        }
    }
    std.debug.print("{}\n", .{indx});
}

The test file is 4KiB and is composed of random alphanumerical characters

Benchmark 1 (198 runs): ./old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          24.6ms ±  411us    24.5ms … 30.2ms         33 (17%)        0%
  peak_rss            778KB ±  411       774KB …  778KB          2 ( 1%)        0%
  cpu_cycles          110M  ±  366K      109M  …  112M          14 ( 7%)        0%
  instructions        395M  ± 3.38       395M  …  395M          22 (11%)        0%
  cache_references    890   ±  255       462   … 1.83K           9 ( 5%)        0%
  cache_misses       62.5   ± 26.6         2   …  232            9 ( 5%)        0%
  branch_misses       242K  ±  859       240K  …  244K           0 ( 0%)        0%
Benchmark 2 (343 runs): ./glibc
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          14.0ms ±  394us    13.0ms … 15.3ms          6 ( 2%)        ⚡- 43.3% ±  0.3%
  peak_rss           1.62MB ± 12.4KB    1.50MB … 1.63MB         30 ( 9%)        💩+108.7% ±  0.2%
  cpu_cycles         61.1M  ± 1.70M     56.3M  … 67.0M           8 ( 2%)        ⚡- 44.4% ±  0.2%
  instructions        224M  ± 1.57M      221M  …  227M           0 ( 0%)        ⚡- 43.2% ±  0.1%
  cache_references   13.8K  ±  402      12.5K  … 15.0K           6 ( 2%)        💩+1447.4% ±  7.0%
  cache_misses       1.87K  ± 1.11K      589   … 7.93K           1 ( 0%)        💩+2898.9% ± 246.7%
  branch_misses       189K  ± 50.1K     76.1K  …  375K           4 ( 1%)        ⚡- 21.9% ±  2.9%
Benchmark 3 (245 runs): ./new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          19.9ms ± 65.2us    19.8ms … 20.5ms         16 ( 7%)        ⚡- 19.4% ±  0.2%
  peak_rss            807KB ±    0       807KB …  807KB          0 ( 0%)        💩+  3.7% ±  0.0%
  cpu_cycles         88.5M  ±  277K     88.0M  … 91.1M          26 (11%)        ⚡- 19.4% ±  0.1%
  instructions        322M  ± 0.89       322M  …  322M           6 ( 2%)        ⚡- 18.4% ±  0.0%
  cache_references    734   ±  177       374   … 1.90K          11 ( 4%)        ⚡- 17.5% ±  4.5%
  cache_misses       40.7   ± 29.9         0   …  206            1 ( 0%)        ⚡- 34.9% ±  8.5%
  branch_misses       259K  ±  741       256K  …  260K           1 ( 0%)        💩+  6.8% ±  0.1%

1KiB Benchmark:

Benchmark 1 (1633 runs): ./old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          2.72ms ± 73.7us    2.67ms … 4.49ms         41 ( 3%)        0%
  peak_rss            778KB ±  320       774KB …  778KB         10 ( 1%)        0%
  cpu_cycles         11.4M  ± 55.8K     11.2M  … 12.7M          42 ( 3%)        0%
  instructions       36.5M  ± 0.65      36.5M  … 36.5M           6 ( 0%)        0%
  cache_references    376   ± 54.7       251   …  624           19 ( 1%)        0%
  cache_misses       10.2   ± 22.9         0   …  201          240 (15%)        0%
  branch_misses      51.6K  ±  382      50.4K  … 52.9K          14 ( 1%)        0%
Benchmark 2 (1355 runs): ./glibc
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.33ms ±  811us    1.91ms … 4.15ms          0 ( 0%)        💩+ 22.4% ±  1.5%
  peak_rss           1.62MB ± 15.5KB    1.50MB … 1.63MB        118 ( 9%)        💩+108.7% ±  0.1%
  cpu_cycles         13.4M  ± 3.63M     7.12M  … 16.8M           0 ( 0%)        💩+ 17.3% ±  1.5%
  instructions       24.3M  ±  335K     23.7M  … 24.9M           0 ( 0%)        ⚡- 33.4% ±  0.0%
  cache_references   13.3K  ±  361      11.8K  … 14.7K          14 ( 1%)        💩+3432.2% ±  4.7%
  cache_misses        559   ±  733        30   … 8.14K         134 (10%)        💩+5372.2% ± 348.1%
  branch_misses       196K  ±  107K     18.0K  …  284K           0 ( 0%)        💩+279.8% ± 10.1%
Benchmark 3 (1851 runs): ./new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          2.37ms ± 14.5us    2.34ms … 2.45ms         28 ( 2%)        ⚡- 12.8% ±  0.1%
  peak_rss            807KB ±    0       807KB …  807KB          0 ( 0%)        💩+  3.7% ±  0.0%
  cpu_cycles         9.88M  ± 30.6K     9.80M  … 10.1M          48 ( 3%)        ⚡- 13.6% ±  0.0%
  instructions       32.7M  ± 0.63      32.7M  … 32.7M           8 ( 0%)        ⚡- 10.4% ±  0.0%
  cache_references    401   ± 48.5       274   …  624           20 ( 1%)        💩+  6.6% ±  0.9%
  cache_misses       2.42   ± 6.59         0   …  188          228 (12%)        ⚡- 76.4% ± 10.7%
  branch_misses      55.8K  ±  441      54.3K  … 57.4K          15 ( 1%)        💩+  8.2% ±  0.1%

The processor is an i7-7700K CPU @ 4.20GHz

nektro · 2025-08-10T06:11:56Z

can you edit the benchmark so that ./main is fist/baseline?

GiuseppeCesarano · 2025-08-13T22:43:29Z

Got closer to glibc:

4KiB benchmark:

Benchmark 1 (195 runs): ./old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          24.9ms ±  977us    24.7ms … 36.8ms          6 ( 3%)        0%
  peak_rss            778KB ±  293       774KB …  778KB          1 ( 1%)        0%
  cpu_cycles          111M  ± 2.09M      110M  …  132M          17 ( 9%)        0%
  instructions        395M  ± 4.39       395M  …  395M          21 (11%)        0%
  cache_references    979   ±  460       391   … 4.45K           9 ( 5%)        0%
  cache_misses        117   ±  122        13   …  995           18 ( 9%)        0%
  branch_misses       261K  ± 1.71K      258K  …  278K           2 ( 1%)        0%
Benchmark 2 (337 runs): ./glibc
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          14.2ms ±  466us    13.2ms … 16.1ms         10 ( 3%)        ⚡- 42.9% ±  0.5%
  peak_rss           1.62MB ± 14.3KB    1.50MB … 1.63MB         23 ( 7%)        💩+108.7% ±  0.3%
  cpu_cycles         62.4M  ± 2.10M     57.8M  … 70.7M           8 ( 2%)        ⚡- 43.7% ±  0.3%
  instructions        226M  ± 1.55M      223M  …  228M           0 ( 0%)        ⚡- 42.9% ±  0.1%
  cache_references   13.6K  ±  455      12.1K  … 15.4K          15 ( 4%)        💩+1292.7% ±  8.2%
  cache_misses       2.58K  ± 1.60K      662   … 8.64K          13 ( 4%)        💩+2106.1% ± 191.8%
  branch_misses       231K  ± 62.7K      108K  …  478K           8 ( 2%)        ⚡- 11.4% ±  3.4%
Benchmark 3 (288 runs): ./new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.8ms ±  868us    16.6ms … 27.3ms         13 ( 5%)        ⚡- 32.7% ±  0.7%
  peak_rss            807KB ±    0       807KB …  807KB          0 ( 0%)        💩+  3.7% ±  0.0%
  cpu_cycles         74.5M  ± 3.85M     73.7M  …  122M          12 ( 4%)        ⚡- 32.7% ±  0.5%
  instructions        231M  ± 1.40       231M  …  231M          13 ( 5%)        ⚡- 41.5% ±  0.0%
  cache_references    807   ±  552       365   … 6.75K          12 ( 4%)        ⚡- 17.6% ±  9.6%
  cache_misses       80.2   ±  113         0   … 1.36K          15 ( 5%)        ⚡- 31.4% ± 18.2%
  branch_misses       218K  ± 1.50K      214K  …  227K          21 ( 7%)        ⚡- 16.3% ±  0.1%

I think we win over glibc on small buffers because they pay an initial cost to align their search to the memory page boundary. But if the buffer spans multiple pages, they quickly pull ahead. If we want that feature, it could be implemented fairly easily.

I’ve checked the x86 code generation for the unrolled section, and it’s essentially identical to glibc’s hand-written AVX2 assembly.

An interesting quirk: if we change the unrolled loop condition from

while (i <= slice.len -| block_len * 4)

to

while (i + block_len * 4 <= slice.len)

the performance of this patch drops to match the old implementation. This happens because the generated code in the hot loop changes slightly, and with that form, unrolling by 4 actually hurts performance instead of helping.

andrewrk · 2025-08-13T22:48:15Z

A nice property of the implementation here is that it obeys pointer provenance rules. It's technically undefined behavior to assume you can read past a memory allocation, even if you stay within a page boundary.

I'll keep working on the language spec and compiler implementation to try and resolve this problem, but in the meantime, it would be good to not assume you can read past a memory allocation.

GiuseppeCesarano · 2025-08-13T23:27:10Z

Simply removing the = sign in the while should resolve the thing, the benchmarks are within 1% of the previous results.

Edit:
I just realized your comment might have been referring to my note about memory page alignment. In that case, what I meant shouldn’t involve reading unallocated memory. Rather, when a search spans multiple pages, glibc aligns the loads to avoid loading vectors that contain elements from both pages.

GiuseppeCesarano · 2025-08-18T15:21:42Z

I experimented with aligning the read to the page boundary, but the gains were outweighed by the branch mispredictions. As @andrewrk pointed out, glibc mitigates this issue by handling the tail with SIMD loads that extend beyond the array boundary. Nevertheless, I thought it was worth trying.

GiuseppeCesarano · 2025-08-19T11:55:48Z

@andrewrk I think that there is a bug in std.simd.firstTrue. Starting from my last commit which doesn't pass tests in the CI, I can reproduce one of the errors with that command:

zig build test -fqemu -Dtest-filter="priority_queue.test.siftUp in remove" -Dtest-target-filter=aarch64_be-linux-musl

Inserting the following print statement before the return in std.simd.firstTrue:

if (@TypeOf(vec) == @Vector(32, bool) or @TypeOf(vec) == @Vector(4, bool)) std.debug.print("vec{}\nindices:{}\n\n", .{ vec, indices });

Gives the following output when the test is run (cleaned to only relevant parts):

vec{ false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, false, false }
indices:{ 31, 1, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 26, 31, 31, 31, 31, 31 }

vec{ false, false, true, false }
indices:{ 3, 1, 3, 3 }

Notice how the behavior is flipped, the shorter vector starts counting from right to left but the assigned position is still from left to right, since the counting is reversed the returned value is incorrect tho.

Meanwhile the 32 long vector counts from left to right and the result is correct.

I think the reverse index could be the big endianess in play but in that case std.simd.firsTrue would have a bug where it didn't account for it.

But then i can't explain why the behavior simply flips for the 32 long vector.

Even weirder, I could not get this behavior to appear in a standalone binary even if compiled for the same arch. So everything reported here is observed in the zig test specified above

GiuseppeCesarano · 2025-08-20T13:17:18Z

I've spent some time reading how llvm treats BE vectors and investigating the issue, i can confirm it's not a bug in std.simd.firstTrue, it's output is wrong in the test but the function itself should not account for BE vectors which is a backend concern.

Another thing which I've noticed is that one of the failing tests priority_queue.test.siftUp in remove fails under aarch64_be-linux-musl debug, but it passes if compiled in release mode, while also passing for aarch64_be-linux-gnu in debug mode.

I've checked the llvm-ir for those targets, the release mode has a slightly different ir for the std.mem.indexOfScalarPos that handles 4 x u32 function while the two debug modes are identical, same for the std.simd.firstTrue 4 x i1 funciton, it has the same ir.

Still, only the musl debug version shows the behavior that counts lanes from right.

Could the problem be how the ir gets lowered to asm under the musl-debug target?

Edit:
I was finally able to reproduce the bug in a standalone binary, relevant issue: #24920

GiuseppeCesarano added 2 commits August 9, 2025 16:43

std.mem.indexOfScalarPos: ~20% faster

e604921

Avoid usage of wide integer

25aa925

rewrite

2dcb953

fix

765776b

Regaining the first unrolled iteration

123e76a

GiuseppeCesarano added 2 commits August 18, 2025 17:39

Restoring erroneously deleted saturated sub

48d4aa9

Merge branch 'master' into indexOfScalarPos

a4b70f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

std.mem.indexOfScalarPos: ~20% faster #24772

std.mem.indexOfScalarPos: ~20% faster #24772

GiuseppeCesarano commented Aug 9, 2025 •

edited

Loading

Uh oh!

nektro commented Aug 10, 2025

Uh oh!

GiuseppeCesarano commented Aug 13, 2025

Uh oh!

andrewrk commented Aug 13, 2025

Uh oh!

GiuseppeCesarano commented Aug 13, 2025 •

edited

Loading

Uh oh!

GiuseppeCesarano commented Aug 18, 2025

Uh oh!

GiuseppeCesarano commented Aug 19, 2025 •

edited

Loading

Uh oh!

GiuseppeCesarano commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

std.mem.indexOfScalarPos: ~20% faster #24772

Are you sure you want to change the base?

std.mem.indexOfScalarPos: ~20% faster #24772

Conversation

GiuseppeCesarano commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nektro commented Aug 10, 2025

Uh oh!

GiuseppeCesarano commented Aug 13, 2025

Uh oh!

andrewrk commented Aug 13, 2025

Uh oh!

GiuseppeCesarano commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GiuseppeCesarano commented Aug 18, 2025

Uh oh!

GiuseppeCesarano commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GiuseppeCesarano commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

GiuseppeCesarano commented Aug 9, 2025 •

edited

Loading

GiuseppeCesarano commented Aug 13, 2025 •

edited

Loading

GiuseppeCesarano commented Aug 19, 2025 •

edited

Loading

GiuseppeCesarano commented Aug 20, 2025 •

edited

Loading