Skip to content

crypto/internal/nistec: remove ppc64le assembly #52424

Open
@FiloSottile

Description

@FiloSottile

In #52182 (comment), @laboger reports that the fiat-crypto (#40171) code with @pmur's compiler improvements (https://go.dev/cl/393656) is within range of the assembly performance!

This is extremely impressive considering the fiat-crypto code also uses safer but slower complete formulas and a somewhat naive 4-bit scalar multiplication window.

ScalarBaseMult/P256                    237µs ± 0%      52µs ± 0%   -78.22%  (p=1.000 n=1+1)
ScalarMult/P256                        239µs ± 0%     213µs ± 0%   -10.95%  (p=1.000 n=1+1)

The ScalarBaseMult benchmark is still significantly slower, because the assembly uses a large precomputed table, while the fiat-crypto code just runs ScalarMult. This is very much fixable.

I will land the ScalarBaseMult optimization in the fiat-crypto code, and then we can remove the ppc64le assembly entirely!

Activity

self-assigned this
on Apr 19, 2022
added
NeedsFixThe path to resolution is known, but the work has not been done.
on Apr 22, 2022
FiloSottile

FiloSottile commented on May 4, 2022

@FiloSottile
ContributorAuthor

https://go.dev/cl/404174 is the promised ScalarBaseMult optimization, so it's possible that the assembly is now slower than the fiat-crypto code!

gopherbot

gopherbot commented on May 5, 2022

@gopherbot
Contributor

Change https://go.dev/cl/404174 mentions this issue: crypto/elliptic: precompute ScalarBaseMult doublings

laboger

laboger commented on May 11, 2022

@laboger
Contributor

Here are comparisons using noasm vs. asm using latest:

crypto/internal/nistec:
ScalarMult/P256         153µs ± 0%     145µs ± 0%   -5.84%  (p=1.000 n=1+1)
ScalarBaseMult/P256    45.2µs ± 0%    23.5µs ± 0%  -48.14%  (p=1.000 n=1+1)

crypto/elltipic:
ScalarBaseMult/P256                   52.3µs ± 0%    38.3µs ± 0%  -26.78%  (p=1.000 n=1+1)
ScalarMult/P256                        161µs ± 0%     160µs ± 0%   -1.02%  (p=1.000 n=1+1)

crypto/ecdsa:
Sign/P256           96.2µs ± 0%    87.8µs ± 0%   -8.71%  (p=1.000 n=1+1)
Verify/P256          212µs ± 0%     196µs ± 0%   -7.43%  (p=1.000 n=1+1)
GenerateKey/P256    53.8µs ± 0%    40.0µs ± 0%  -25.63%  (p=1.000 n=1+1)

No meaningful difference in the crypto/tls benchmarks.
Looks like the assembler version is still significantly faster than the native Go version for some.

added this to the Go1.20 milestone on Aug 20, 2022
modified the milestones: Go1.20, Go1.21 on Feb 1, 2023
modified the milestones: Go1.21, Go1.22 on Aug 8, 2023
modified the milestones: Go1.22, Go1.23 on Feb 6, 2024
modified the milestones: Go1.23, Go1.24 on Aug 13, 2024
modified the milestones: Go1.24, Go1.25 on Feb 11, 2025
modified the milestones: Go1.25, Backlog on Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

NeedsFixThe path to resolution is known, but the work has not been done.

Type

No type

Projects

No projects

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @FiloSottile@gopherbot@laboger@seankhliao

      Issue actions

        crypto/internal/nistec: remove ppc64le assembly · Issue #52424 · golang/go