-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
std/crypto: add chacha20 #1369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std/crypto: add chacha20 #1369
Conversation
@shawnl This is cool -- can we get more test vectors? |
std/crypto/chacha20.zig
Outdated
var remaining: usize = undefined; | ||
var cursor: usize = 0; | ||
|
||
//if (in.len > out.len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to keep commented lines out of master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already had noticed this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should error out if the lengths are not equal, but it seemed like so little to make the function return an error.
std/crypto/chacha20.zig
Outdated
} | ||
} | ||
|
||
// https://tools.ietf.org/html/draft-agl-tls-chacha20poly1305-04#section-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's test for all available test vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first 4 are trivial, I added the complicated one. I think we only need one of the trivial ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
3244390
to
f084d52
Compare
std/crypto/chacha20.zig
Outdated
return out; | ||
} | ||
|
||
pub fn chaCha20(in: []const u8, key: [32]u8, nonce: [8]u8, out: *[]u8) void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using a slice as an output result, you don't need to pass it as a pointer.
Use out: []u8
and then when calling just use result[0..]
to pass the slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that resulted in a segfault in the compiler!: #1372
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to remove the current out
dereferences in your code such as out.*
. A slice in zig is just a pointer and a length pair. Since this is already a pointer to memory passing it as a pointer is necessary indirection and not required.
The segfault is a bug though and I've clarified in that issue.
Should |
ef1c30c
to
2d068ce
Compare
I would be more inclined to have the core functionality not return an error at all. Instead I'd just add debug assertions on what is required by the caller and make them handle the required invariants. So remove both the overflow and different buffer slice length errors. |
431ff2e
to
78b6134
Compare
Thanks! I'll do a full review when I get back from vacation in 1 week. Alternately I trust @tiehuis's judgment on this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't made any comments or looked into the algorithm details since I'm not that familiar so they are mostly style.
I did find the original Poly1305 and X25519 reference sources pretty opaque and terse (especially with regards to just using arrays directly everywhere). Here is an alternative which could provide some extra insight/clarity. This isn't a suggestion to completely change them though, since I can see these work fine as is right now.
You will also want to add entries to CMakeLists.txt
in the root to add these new files as part of the install process.
std/crypto/chacha20.zig
Outdated
@@ -0,0 +1,426 @@ | |||
// Based on public domain Supercop by Daniel J. Bernstein | |||
|
|||
const mem = @import("../mem.zig"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find doing a top level const std = @import("../index.zig")
and referring to submodules from that base is the cleanest way to do this in std.
std/crypto/chacha20.zig
Outdated
}; | ||
} | ||
|
||
fn rotate(a: u32, b: u5) u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use std.math.rotl
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std/crypto/chacha20.zig
Outdated
Rp( 3, 4, 9,14), | ||
}; | ||
comptime var j: usize = 20; | ||
inline while (j > 0) : (j -=2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd check the performance here when inlining things like this. In particular I would expect inlining the inner rounds to work better than inlining the outer loop. I don't think inlining the outer loop would provide much and the smaller code is probably more valuable. Of course you need to measure this before just taking my word.
Small note too but counting up with usize
is usually slightly better in my opinion as it avoids the possible zero underflow that is very prevalent when attempting to port C code. It doesn't matter here however.
std/crypto/chacha20.zig
Outdated
var remaining: usize = undefined; | ||
var cursor: usize = 0; | ||
|
||
if (in.len > out.len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simulate a ternary with assignment on single expression if statements
var remaining = if (in.len > out.len) in.len else out.len
.
std/crypto/chacha20.zig
Outdated
} else | ||
remaining = out.len; | ||
|
||
comptime const c = "expand 32-byte k"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would omit these explicit comptime statements. In general I'd only comptime if semantically required or there is measured performance improvement. LLVM can probably figure this out by itself anyway.
std/crypto/x25519.zig
Outdated
} | ||
|
||
fn mainloop(work: [*]u32, e: [*]u8) void { | ||
var xzm1: [64]u32 = undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at the source you took this off and it seems slightly questionable here in how it defines these arrays. It only ever uses these in halves so when switching to a specific field element it should be sufficient to simply split all the [64]u32
into two [32]u32
, replace those with a specific field element and then adjust the usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it uses them together in select()
std/crypto/x25519.zig
Outdated
e[31] &= 127; | ||
e[31] |= 64; | ||
i = 0; | ||
inline while (i < 32) : (i += 1) work[i] = p[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a std.mem.copy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it isn't. It is moving from u8 to u32
std/crypto/x25519.zig
Outdated
mult(out,&t1,&z11); | ||
} | ||
|
||
fn crypto_scalarmult(n: []const u8, p: []const u8) [32]u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder from Poly1305, avoid returning an array. This is internal so isn't as important but for consistency I still would pass a result slice.
std/crypto/x25519.zig
Outdated
|
||
// https://tools.ietf.org/html/rfc7748#section-5.2 | ||
test "crypto.X25519Mul test vector 1" { | ||
const assert = @import("../debug/index.zig").assert; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you only require two asserts for these tests just @import("std")
at the top-level and fully qualify std.debug.assert
.
std/crypto/poly1305.zig
Outdated
// length 16, but the type system is annoying me | ||
//poly1305.zig:75:26: error: cast discards const qualifier | ||
// s.wholeBlock(@ptrCast(*[16]u8, buf[i - 16..i].ptr).*); | ||
fn wholeBlock(s: *Poly1305, buf: [*]const u8) void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage I'd simply recommend using []const u8
slices as arguments and never arrays, even internally. The extra length field being passed is negligible and they are much more versatile and easy to use. You can debug.assert
as I've mentioned a few times to enforce a length (at runtime which is the downside).
We can adjust at a later date if something like #863 is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I ran into #863, which is why I used raw pointers everywhere.
I had a closer look at the reference for Poly1305 and I'd strongly suggest instead basing your implementation on the ref10 version instead. The ref version as here appears an order of magnitude (x10) slower than that. The complexity isn't much greater and I think it is worthwhile to do this from the start. |
@tiehuis good catch |
Sorry meant the Curve25519 implementation. |
I found this: https://github.com/project-everest/hacl-star Formally verified reference cryptographic primitives. And now I am questioning all of these. |
I wouldn't get too caught up in formal verification of these and aiming for absolute perfection. This could be considered a preliminary version so we can work towards say, a fully-zig https server. |
std/crypto/poly1305.zig
Outdated
inline while (i < 16) : (i += 1) | ||
out[i] = @intCast(u8, s.h[i]); | ||
|
||
mem.set(u32, s.r[0..], 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a secureZero
function which you can use here to ensure the writes are not optimized out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, I was just about to ask for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need a memory barrier to write it to memory from cpu cache.
lib/string.c:
/**
* memzero_explicit - Fill a region of memory (e.g. sensitive
* keying data) with 0s.
* @s: Pointer to the start of the area.
* @count: The size of the area.
*
* Note: usually using memset() is just fine (!), but in cases
* where clearing out _local_ data at the end of a scope is
* necessary, memzero_explicit() should be used instead in
* order to prevent the compiler from optimising away zeroing.
*
* memzero_explicit() doesn't need an arch-specific version as
* it just invokes the one of memset() implicitly.
*/
void memzero_explicit(void *s, size_t count)
{
memset(s, 0, count);
barrier_data(s);
}
EXPORT_SYMBOL(memzero_explicit);
I think you're thinking of that zig compiles programs as one big .o file. LLVM with -O3 can sometimes decide to turn that into one big function. |
@shawnl tell me when you consider this ready for review |
I think chacha20 is ready for review. the others are not as good. |
@shawnl maybe add another PR with x25519 |
I think the chacha20 is functionally fine. Here is a patch however which improves the performance. Goes from std/crypto/chacha20.zig | 46 ++++++++++++++++++++++------------------------
1 file changed, 22 insertions(+), 24 deletions(-)
diff --git a/std/crypto/chacha20.zig b/std/crypto/chacha20.zig
index 3030569a..836c8c88 100644
--- a/std/crypto/chacha20.zig
+++ b/std/crypto/chacha20.zig
@@ -22,19 +22,15 @@ fn Rp(a: usize, b: usize, c: usize, d: usize) QuarterRound {
};
}
-fn rotate(a: u32, b: u5) u32 {
- return ((a << b) |
- (a >> @intCast(u5, (32 - @intCast(u6, b))))
- );
-}
-
// The chacha family of ciphers are based on the salsa family.
-fn salsa20_wordtobyte(input: [16]u32) [64]u8 {
+fn salsa20_wordtobyte(out: []u8, input: [16]u32) void {
+ assert(out.len >= 64);
+
var x: [16]u32 = undefined;
- var out: [64]u8 = undefined;
for (x) |_, i|
x[i] = input[i];
+
const rounds = comptime []QuarterRound{
Rp( 0, 4, 8,12),
Rp( 1, 5, 9,13),
@@ -45,20 +41,21 @@ fn salsa20_wordtobyte(input: [16]u32) [64]u8 {
Rp( 2, 7, 8,13),
Rp( 3, 4, 9,14),
};
- comptime var j: usize = 20;
- inline while (j > 0) : (j -=2) {
- for (rounds) |r| {
- x[r.a] +%= x[r.b]; x[r.d] = rotate(x[r.d] ^ x[r.a], 16);
- x[r.c] +%= x[r.d]; x[r.b] = rotate(x[r.b] ^ x[r.c], 12);
- x[r.a] +%= x[r.b]; x[r.d] = rotate(x[r.d] ^ x[r.a], 8);
- x[r.c] +%= x[r.d]; x[r.b] = rotate(x[r.b] ^ x[r.c], 7);
+
+ comptime var j: usize = 0;
+ inline while (j < 20) : (j += 2) {
+ // two-round cycles
+ inline for (rounds) |r| {
+ x[r.a] +%= x[r.b]; x[r.d] = std.math.rotl(u32, x[r.d] ^ x[r.a], u32(16));
+ x[r.c] +%= x[r.d]; x[r.b] = std.math.rotl(u32, x[r.b] ^ x[r.c], u32(12));
+ x[r.a] +%= x[r.b]; x[r.d] = std.math.rotl(u32, x[r.d] ^ x[r.a], u32(8));
+ x[r.c] +%= x[r.d]; x[r.b] = std.math.rotl(u32, x[r.b] ^ x[r.c], u32(7));
}
}
- for (x) |_, i|
- x[i] +%= input[i];
- for (x) |_, i|
- mem.writeInt(out[4 * i .. 4 * i + 4], x[i], builtin.Endian.Little);
- return out;
+
+ for (x) |_, i| {
+ mem.writeInt(out[4 * i .. 4 * i + 4], x[i] +% input[i], builtin.Endian.Little);
+ }
}
fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) void {
@@ -73,13 +70,14 @@ fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) vo
mem.readIntLE(u32, c[8..12]),
mem.readIntLE(u32, c[12..16]),
};
-
+
mem.copy(u32, ctx[0..], constant_le[0..4]);
mem.copy(u32, ctx[4..12], key[0..8]);
mem.copy(u32, ctx[12..16], counter[0..4]);
while (true) {
- var buf = salsa20_wordtobyte(ctx);
+ var buf: [64]u8 = undefined;
+ salsa20_wordtobyte(buf[0..], ctx);
if (remaining < 64) {
var i: usize = 0;
@@ -88,8 +86,8 @@ fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) vo
return;
}
- comptime var i: usize = 0;
- inline while (i < 64) : (i += 1)
+ var i: usize = 0;
+ while (i < 64) : (i += 1)
out[cursor + i] = in[cursor + i] ^ buf[i];
cursor += 64;
The main changes are:
The modified throughput test I've used to benchmark is as follows. Interestingly we need to allocate memory instead of using a fixed buffer else Zig optimizes the whole thing out.
|
The main changes are: Unrolling the inner rounds of salsa20_wordtobyte which doubles the speed. Passing the slice explicitly instead of returning the array saves a copy (can optimize out in future with copy elision) and gives ~10% improvement. Inlining the outer loop gives ~15-20% improvement but it costs an extra 4Kb of code space. I think the tradeoff is worthwhile here. The other inline loops are small and can be done by the compiler if it is worthwhile. The rotate function replacement doesn't alter the performance from the former. The modified throughput test I've used to benchmark is as follows. Interestingly we need to allocate memory instead of using a fixed buffer else Zig optimizes the whole thing out. ziglang#1369 (comment)
the goal is to implement wireguard with chacha20-poly1305 and x25519, perhaps TLS later