Skip to content

std.mem.indexOfScalarPos: ~20% faster #24772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

GiuseppeCesarano
Copy link
Contributor

@GiuseppeCesarano GiuseppeCesarano commented Aug 9, 2025

This patch simply cuts in half the number of checks for matches in the wider SIMD code path, so the measured impact is on larger files.

The ~20% claim comes from the following test:

pub fn main() !void {
    var buf: [4096]u8 = undefined;
    var file = try std.fs.cwd().openFile("test", .{});
    defer file.close();

    const slice = buf[0..try file.readAll(buf[0..])];

    var indx: usize = 0;
    for (0..4096) |s| {
        for (0..255) |v| {
            indx = indx + (std.mem.indexOfScalarPos(u8, slice, s, @truncate(v)) orelse 0);
        }
    }
    std.debug.print("{}\n", .{indx});
}

The test file is 4KiB and is composed of random alphanumerical characters

Benchmark 1 (198 runs): ./old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          24.6ms ±  411us    24.5ms … 30.2ms         33 (17%)        0%
  peak_rss            778KB ±  411       774KB …  778KB          2 ( 1%)        0%
  cpu_cycles          110M  ±  366K      109M  …  112M          14 ( 7%)        0%
  instructions        395M  ± 3.38       395M  …  395M          22 (11%)        0%
  cache_references    890   ±  255       462   … 1.83K           9 ( 5%)        0%
  cache_misses       62.5   ± 26.6         2   …  232            9 ( 5%)        0%
  branch_misses       242K  ±  859       240K  …  244K           0 ( 0%)        0%
Benchmark 2 (343 runs): ./glibc
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          14.0ms ±  394us    13.0ms … 15.3ms          6 ( 2%)        ⚡- 43.3% ±  0.3%
  peak_rss           1.62MB ± 12.4KB    1.50MB … 1.63MB         30 ( 9%)        💩+108.7% ±  0.2%
  cpu_cycles         61.1M  ± 1.70M     56.3M  … 67.0M           8 ( 2%)        ⚡- 44.4% ±  0.2%
  instructions        224M  ± 1.57M      221M  …  227M           0 ( 0%)        ⚡- 43.2% ±  0.1%
  cache_references   13.8K  ±  402      12.5K  … 15.0K           6 ( 2%)        💩+1447.4% ±  7.0%
  cache_misses       1.87K  ± 1.11K      589   … 7.93K           1 ( 0%)        💩+2898.9% ± 246.7%
  branch_misses       189K  ± 50.1K     76.1K  …  375K           4 ( 1%)        ⚡- 21.9% ±  2.9%
Benchmark 3 (245 runs): ./new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          19.9ms ± 65.2us    19.8ms … 20.5ms         16 ( 7%)        ⚡- 19.4% ±  0.2%
  peak_rss            807KB ±    0       807KB …  807KB          0 ( 0%)        💩+  3.7% ±  0.0%
  cpu_cycles         88.5M  ±  277K     88.0M  … 91.1M          26 (11%)        ⚡- 19.4% ±  0.1%
  instructions        322M  ± 0.89       322M  …  322M           6 ( 2%)        ⚡- 18.4% ±  0.0%
  cache_references    734   ±  177       374   … 1.90K          11 ( 4%)        ⚡- 17.5% ±  4.5%
  cache_misses       40.7   ± 29.9         0   …  206            1 ( 0%)        ⚡- 34.9% ±  8.5%
  branch_misses       259K  ±  741       256K  …  260K           1 ( 0%)        💩+  6.8% ±  0.1%

1KiB Benchmark:

Benchmark 1 (1633 runs): ./old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          2.72ms ± 73.7us    2.67ms … 4.49ms         41 ( 3%)        0%
  peak_rss            778KB ±  320       774KB …  778KB         10 ( 1%)        0%
  cpu_cycles         11.4M  ± 55.8K     11.2M  … 12.7M          42 ( 3%)        0%
  instructions       36.5M  ± 0.65      36.5M  … 36.5M           6 ( 0%)        0%
  cache_references    376   ± 54.7       251   …  624           19 ( 1%)        0%
  cache_misses       10.2   ± 22.9         0   …  201          240 (15%)        0%
  branch_misses      51.6K  ±  382      50.4K  … 52.9K          14 ( 1%)        0%
Benchmark 2 (1355 runs): ./glibc
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.33ms ±  811us    1.91ms … 4.15ms          0 ( 0%)        💩+ 22.4% ±  1.5%
  peak_rss           1.62MB ± 15.5KB    1.50MB … 1.63MB        118 ( 9%)        💩+108.7% ±  0.1%
  cpu_cycles         13.4M  ± 3.63M     7.12M  … 16.8M           0 ( 0%)        💩+ 17.3% ±  1.5%
  instructions       24.3M  ±  335K     23.7M  … 24.9M           0 ( 0%)        ⚡- 33.4% ±  0.0%
  cache_references   13.3K  ±  361      11.8K  … 14.7K          14 ( 1%)        💩+3432.2% ±  4.7%
  cache_misses        559   ±  733        30   … 8.14K         134 (10%)        💩+5372.2% ± 348.1%
  branch_misses       196K  ±  107K     18.0K  …  284K           0 ( 0%)        💩+279.8% ± 10.1%
Benchmark 3 (1851 runs): ./new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          2.37ms ± 14.5us    2.34ms … 2.45ms         28 ( 2%)        ⚡- 12.8% ±  0.1%
  peak_rss            807KB ±    0       807KB …  807KB          0 ( 0%)        💩+  3.7% ±  0.0%
  cpu_cycles         9.88M  ± 30.6K     9.80M  … 10.1M          48 ( 3%)        ⚡- 13.6% ±  0.0%
  instructions       32.7M  ± 0.63      32.7M  … 32.7M           8 ( 0%)        ⚡- 10.4% ±  0.0%
  cache_references    401   ± 48.5       274   …  624           20 ( 1%)        💩+  6.6% ±  0.9%
  cache_misses       2.42   ± 6.59         0   …  188          228 (12%)        ⚡- 76.4% ± 10.7%
  branch_misses      55.8K  ±  441      54.3K  … 57.4K          15 ( 1%)        💩+  8.2% ±  0.1%

The processor is an i7-7700K CPU @ 4.20GHz

@nektro
Copy link
Contributor

nektro commented Aug 10, 2025

can you edit the benchmark so that ./main is fist/baseline?

@GiuseppeCesarano
Copy link
Contributor Author

Got closer to glibc:

4KiB benchmark:

Benchmark 1 (195 runs): ./old
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          24.9ms ±  977us    24.7ms … 36.8ms          6 ( 3%)        0%
  peak_rss            778KB ±  293       774KB …  778KB          1 ( 1%)        0%
  cpu_cycles          111M  ± 2.09M      110M  …  132M          17 ( 9%)        0%
  instructions        395M  ± 4.39       395M  …  395M          21 (11%)        0%
  cache_references    979   ±  460       391   … 4.45K           9 ( 5%)        0%
  cache_misses        117   ±  122        13   …  995           18 ( 9%)        0%
  branch_misses       261K  ± 1.71K      258K  …  278K           2 ( 1%)        0%
Benchmark 2 (337 runs): ./glibc
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          14.2ms ±  466us    13.2ms … 16.1ms         10 ( 3%)        ⚡- 42.9% ±  0.5%
  peak_rss           1.62MB ± 14.3KB    1.50MB … 1.63MB         23 ( 7%)        💩+108.7% ±  0.3%
  cpu_cycles         62.4M  ± 2.10M     57.8M  … 70.7M           8 ( 2%)        ⚡- 43.7% ±  0.3%
  instructions        226M  ± 1.55M      223M  …  228M           0 ( 0%)        ⚡- 42.9% ±  0.1%
  cache_references   13.6K  ±  455      12.1K  … 15.4K          15 ( 4%)        💩+1292.7% ±  8.2%
  cache_misses       2.58K  ± 1.60K      662   … 8.64K          13 ( 4%)        💩+2106.1% ± 191.8%
  branch_misses       231K  ± 62.7K      108K  …  478K           8 ( 2%)        ⚡- 11.4% ±  3.4%
Benchmark 3 (288 runs): ./new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          16.8ms ±  868us    16.6ms … 27.3ms         13 ( 5%)        ⚡- 32.7% ±  0.7%
  peak_rss            807KB ±    0       807KB …  807KB          0 ( 0%)        💩+  3.7% ±  0.0%
  cpu_cycles         74.5M  ± 3.85M     73.7M  …  122M          12 ( 4%)        ⚡- 32.7% ±  0.5%
  instructions        231M  ± 1.40       231M  …  231M          13 ( 5%)        ⚡- 41.5% ±  0.0%
  cache_references    807   ±  552       365   … 6.75K          12 ( 4%)        ⚡- 17.6% ±  9.6%
  cache_misses       80.2   ±  113         0   … 1.36K          15 ( 5%)        ⚡- 31.4% ± 18.2%
  branch_misses       218K  ± 1.50K      214K  …  227K          21 ( 7%)        ⚡- 16.3% ±  0.1%

I think we win over glibc on small buffers because they pay an initial cost to align their search to the memory page boundary. But if the buffer spans multiple pages, they quickly pull ahead. If we want that feature, it could be implemented fairly easily.

I’ve checked the x86 code generation for the unrolled section, and it’s essentially identical to glibc’s hand-written AVX2 assembly.

An interesting quirk: if we change the unrolled loop condition from

while (i <= slice.len -| block_len * 4)

to

while (i + block_len * 4 <= slice.len)

the performance of this patch drops to match the old implementation. This happens because the generated code in the hot loop changes slightly, and with that form, unrolling by 4 actually hurts performance instead of helping.

@andrewrk
Copy link
Member

A nice property of the implementation here is that it obeys pointer provenance rules. It's technically undefined behavior to assume you can read past a memory allocation, even if you stay within a page boundary.

I'll keep working on the language spec and compiler implementation to try and resolve this problem, but in the meantime, it would be good to not assume you can read past a memory allocation.

@GiuseppeCesarano
Copy link
Contributor Author

GiuseppeCesarano commented Aug 13, 2025

Simply removing the = sign in the while should resolve the thing, the benchmarks are within 1% of the previous results.

Edit:
I just realized your comment might have been referring to my note about memory page alignment. In that case, what I meant shouldn’t involve reading unallocated memory. Rather, when a search spans multiple pages, glibc aligns the loads to avoid loading vectors that contain elements from both pages.

@GiuseppeCesarano
Copy link
Contributor Author

I experimented with aligning the read to the page boundary, but the gains were outweighed by the branch mispredictions. As @andrewrk pointed out, glibc mitigates this issue by handling the tail with SIMD loads that extend beyond the array boundary. Nevertheless, I thought it was worth trying.

@GiuseppeCesarano
Copy link
Contributor Author

GiuseppeCesarano commented Aug 19, 2025

@andrewrk I think that there is a bug in std.simd.firstTrue. Starting from my last commit which doesn't pass tests in the CI, I can reproduce one of the errors with that command:

zig build test -fqemu -Dtest-filter="priority_queue.test.siftUp in remove" -Dtest-target-filter=aarch64_be-linux-musl

Inserting the following print statement before the return in std.simd.firstTrue:

if (@TypeOf(vec) == @Vector(32, bool) or @TypeOf(vec) == @Vector(4, bool)) std.debug.print("vec{}\nindices:{}\n\n", .{ vec, indices });

Gives the following output when the test is run (cleaned to only relevant parts):

vec{ false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, false, false }
indices:{ 31, 1, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 26, 31, 31, 31, 31, 31 }

vec{ false, false, true, false }
indices:{ 3, 1, 3, 3 }

Notice how the behavior is flipped, the shorter vector starts counting from right to left but the assigned position is still from left to right, since the counting is reversed the returned value is incorrect tho.

Meanwhile the 32 long vector counts from left to right and the result is correct.

I think the reverse index could be the big endianess in play but in that case std.simd.firsTrue would have a bug where it didn't account for it.

But then i can't explain why the behavior simply flips for the 32 long vector.

Even weirder, I could not get this behavior to appear in a standalone binary even if compiled for the same arch. So everything reported here is observed in the zig test specified above

@GiuseppeCesarano
Copy link
Contributor Author

GiuseppeCesarano commented Aug 20, 2025

I've spent some time reading how llvm treats BE vectors and investigating the issue, i can confirm it's not a bug in std.simd.firstTrue, it's output is wrong in the test but the function itself should not account for BE vectors which is a backend concern.

Another thing which I've noticed is that one of the failing tests priority_queue.test.siftUp in remove fails under aarch64_be-linux-musl debug, but it passes if compiled in release mode, while also passing for aarch64_be-linux-gnu in debug mode.

I've checked the llvm-ir for those targets, the release mode has a slightly different ir for the std.mem.indexOfScalarPos that handles 4 x u32 function while the two debug modes are identical, same for the std.simd.firstTrue 4 x i1 funciton, it has the same ir.

Still, only the musl debug version shows the behavior that counts lanes from right.

Could the problem be how the ir gets lowered to asm under the musl-debug target?

Edit:
I was finally able to reproduce the bug in a standalone binary, relevant issue: #24920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants