-
-
Notifications
You must be signed in to change notification settings - Fork 3k
std.mem.indexOfScalarPos: ~20% faster #24772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
std.mem.indexOfScalarPos: ~20% faster #24772
Conversation
can you edit the benchmark so that |
Got closer to glibc: 4KiB benchmark:
I think we win over glibc on small buffers because they pay an initial cost to align their search to the memory page boundary. But if the buffer spans multiple pages, they quickly pull ahead. If we want that feature, it could be implemented fairly easily. I’ve checked the x86 code generation for the unrolled section, and it’s essentially identical to glibc’s hand-written AVX2 assembly. An interesting quirk: if we change the unrolled loop condition from while (i <= slice.len -| block_len * 4) to while (i + block_len * 4 <= slice.len) the performance of this patch drops to match the old implementation. This happens because the generated code in the hot loop changes slightly, and with that form, unrolling by 4 actually hurts performance instead of helping. |
A nice property of the implementation here is that it obeys pointer provenance rules. It's technically undefined behavior to assume you can read past a memory allocation, even if you stay within a page boundary. I'll keep working on the language spec and compiler implementation to try and resolve this problem, but in the meantime, it would be good to not assume you can read past a memory allocation. |
Simply removing the Edit: |
I experimented with aligning the read to the page boundary, but the gains were outweighed by the branch mispredictions. As @andrewrk pointed out, glibc mitigates this issue by handling the tail with SIMD loads that extend beyond the array boundary. Nevertheless, I thought it was worth trying. |
@andrewrk I think that there is a bug in
Inserting the following print statement before the return in if (@TypeOf(vec) == @Vector(32, bool) or @TypeOf(vec) == @Vector(4, bool)) std.debug.print("vec{}\nindices:{}\n\n", .{ vec, indices });
Gives the following output when the test is run (cleaned to only relevant parts):
Notice how the behavior is flipped, the shorter vector starts counting from right to left but the assigned position is still from left to right, since the counting is reversed the returned value is incorrect tho. Meanwhile the 32 long vector counts from left to right and the result is correct. I think the reverse index could be the big endianess in play but in that case But then i can't explain why the behavior simply flips for the 32 long vector. Even weirder, I could not get this behavior to appear in a standalone binary even if compiled for the same arch. So everything reported here is observed in the zig test specified above |
I've spent some time reading how llvm treats BE vectors and investigating the issue, i can confirm it's not a bug in Another thing which I've noticed is that one of the failing tests I've checked the llvm-ir for those targets, the release mode has a slightly different ir for the Still, only the musl debug version shows the behavior that counts lanes from right. Could the problem be how the ir gets lowered to asm under the musl-debug target? Edit: |
This patch simply cuts in half the number of checks for matches in the wider SIMD code path, so the measured impact is on larger files.
The ~20% claim comes from the following test:
The test file is 4KiB and is composed of random alphanumerical characters
1KiB Benchmark:
The processor is an i7-7700K CPU @ 4.20GHz