std/crypto: add chacha20 #1369

shawnl · 2018-08-11T04:00:13Z

the goal is to implement wireguard with chacha20-poly1305 and x25519, perhaps TLS later

kristate · 2018-08-11T04:02:07Z

@shawnl This is cool -- can we get more test vectors?
https://tools.ietf.org/html/draft-agl-tls-chacha20poly1305-04#section-7

kristate · 2018-08-11T04:03:33Z

std/crypto/chacha20.zig

+    var remaining: usize = undefined;
+    var cursor: usize = 0;
+
+    //if (in.len > out.len) {


Let's try to keep commented lines out of master

I already had noticed this

Maybe I should error out if the lengths are not equal, but it seemed like so little to make the function return an error.

kristate · 2018-08-11T04:04:35Z

std/crypto/chacha20.zig

+    }
+}
+
+// https://tools.ietf.org/html/draft-agl-tls-chacha20poly1305-04#section-7


Let's test for all available test vectors.

the first 4 are trivial, I added the complicated one. I think we only need one of the trivial ones.

tiehuis · 2018-08-11T05:24:42Z

std/crypto/chacha20.zig

+    return out;
+}
+
+pub fn chaCha20(in: []const u8, key: [32]u8, nonce: [8]u8, out: *[]u8) void {


When using a slice as an output result, you don't need to pass it as a pointer.

Use out: []u8 and then when calling just use result[0..] to pass the slice.

that resulted in a segfault in the compiler!: #1372

You need to remove the current out dereferences in your code such as out.*. A slice in zig is just a pointer and a length pair. Since this is already a pointer to memory passing it as a pointer is necessary indirection and not required.

The segfault is a bug though and I've clarified in that issue.

shawnl · 2018-08-11T14:37:30Z

Should chaCha20() return an error if the input and output are not the same length?

tiehuis · 2018-08-12T00:31:52Z

I would be more inclined to have the core functionality not return an error at all. Instead I'd just add debug assertions on what is required by the caller and make them handle the required invariants.

So remove both the overflow and different buffer slice length errors.

andrewrk · 2018-08-13T20:28:37Z

Thanks! I'll do a full review when I get back from vacation in 1 week.

Alternately I trust @tiehuis's judgment on this one.

tiehuis

I haven't made any comments or looked into the algorithm details since I'm not that familiar so they are mostly style.

I did find the original Poly1305 and X25519 reference sources pretty opaque and terse (especially with regards to just using arrays directly everywhere). Here is an alternative which could provide some extra insight/clarity. This isn't a suggestion to completely change them though, since I can see these work fine as is right now.

You will also want to add entries to CMakeLists.txt in the root to add these new files as part of the install process.

tiehuis · 2018-08-14T10:33:37Z

std/crypto/chacha20.zig

@@ -0,0 +1,426 @@
+// Based on public domain Supercop by Daniel J. Bernstein
+
+const mem = @import("../mem.zig");


I find doing a top level const std = @import("../index.zig") and referring to submodules from that base is the cleanest way to do this in std.

tiehuis · 2018-08-14T10:34:06Z

std/crypto/chacha20.zig

+    };
+}
+
+fn rotate(a: u32, b: u5) u32 {


You can use std.math.rotl for this.

zig/std/math/index.zig:517:50: error: type 'comptime_int' does not support field access
pub fn absCast(x: var) @inttype(false, @typeof(x).bit_count) {

tiehuis · 2018-08-14T10:37:35Z

std/crypto/chacha20.zig

+        Rp( 3, 4, 9,14),
+    };
+    comptime var j: usize = 20;
+    inline while (j > 0) : (j -=2) {


I'd check the performance here when inlining things like this. In particular I would expect inlining the inner rounds to work better than inlining the outer loop. I don't think inlining the outer loop would provide much and the smaller code is probably more valuable. Of course you need to measure this before just taking my word.

Small note too but counting up with usize is usually slightly better in my opinion as it avoids the possible zero underflow that is very prevalent when attempting to port C code. It doesn't matter here however.

tiehuis · 2018-08-14T10:39:20Z

std/crypto/chacha20.zig

+    var remaining: usize = undefined;
+    var cursor: usize = 0;
+
+    if (in.len > out.len) {


You can simulate a ternary with assignment on single expression if statements

var remaining = if (in.len > out.len) in.len else out.len.

tiehuis · 2018-08-14T10:40:59Z

std/crypto/chacha20.zig

+    } else
+        remaining = out.len;
+
+    comptime const c = "expand 32-byte k";


I would omit these explicit comptime statements. In general I'd only comptime if semantically required or there is measured performance improvement. LLVM can probably figure this out by itself anyway.

tiehuis · 2018-08-14T11:06:10Z

std/crypto/x25519.zig

+}
+
+fn mainloop(work: [*]u32, e: [*]u8) void {
+    var xzm1: [64]u32 = undefined;


I had a look at the source you took this off and it seems slightly questionable here in how it defines these arrays. It only ever uses these in halves so when switching to a specific field element it should be sufficient to simply split all the [64]u32 into two [32]u32, replace those with a specific field element and then adjust the usages.

Yeah that would work.

it uses them together in select()

tiehuis · 2018-08-14T11:06:58Z

std/crypto/x25519.zig

+    e[31] &= 127;
+    e[31] |= 64;
+    i = 0;
+    inline while (i < 32) : (i += 1) work[i] = p[i];


This is just a std.mem.copy.

No it isn't. It is moving from u8 to u32

tiehuis · 2018-08-14T11:08:10Z

std/crypto/x25519.zig

+    mult(out,&t1,&z11);
+}
+
+fn crypto_scalarmult(n: []const u8, p: []const u8) [32]u8 {


Just a reminder from Poly1305, avoid returning an array. This is internal so isn't as important but for consistency I still would pass a result slice.

tiehuis · 2018-08-14T11:09:48Z

std/crypto/x25519.zig

+
+// https://tools.ietf.org/html/rfc7748#section-5.2
+test "crypto.X25519Mul test vector 1" {
+    const assert = @import("../debug/index.zig").assert;


If you only require two asserts for these tests just @import("std") at the top-level and fully qualify std.debug.assert.

tiehuis · 2018-08-14T11:18:24Z

std/crypto/poly1305.zig

+                                // length 16, but the type system is annoying me
+                                //poly1305.zig:75:26: error: cast discards const qualifier
+                                //            s.wholeBlock(@ptrCast(*[16]u8, buf[i - 16..i].ptr).*);
+    fn wholeBlock(s: *Poly1305, buf: [*]const u8) void {


At this stage I'd simply recommend using []const u8 slices as arguments and never arrays, even internally. The extra length field being passed is negligible and they are much more versatile and easy to use. You can debug.assert as I've mentioned a few times to enforce a length (at runtime which is the downside).

We can adjust at a later date if something like #863 is implemented.

Yes, I ran into #863, which is why I used raw pointers everywhere.

tiehuis · 2018-08-15T12:27:55Z

I had a closer look at the reference for Poly1305 and I'd strongly suggest instead basing your implementation on the ref10 version instead. The ref version as here appears an order of magnitude (x10) slower than that. The complexity isn't much greater and I think it is worthwhile to do this from the start.

https://bench.cr.yp.to/impl-sign/ed25519.html

shawnl · 2018-08-15T12:47:22Z

@tiehuis good catch

tiehuis · 2018-08-15T23:34:34Z

Sorry meant the Curve25519 implementation.

https://bench.cr.yp.to/impl-scalarmult/curve25519.html

shawnl · 2018-08-17T13:19:01Z

I found this: https://github.com/project-everest/hacl-star

Formally verified reference cryptographic primitives. And now I am questioning all of these.

tiehuis · 2018-08-17T13:51:03Z

I wouldn't get too caught up in formal verification of these and aiming for absolute perfection. This could be considered a preliminary version so we can work towards say, a fully-zig https server.

tiehuis · 2018-08-18T00:20:45Z

std/crypto/poly1305.zig

+        inline while (i < 16) : (i += 1)
+            out[i] = @intCast(u8, s.h[i]);
+
+        mem.set(u32, s.r[0..], 0);


I've added a secureZero function which you can use here to ensure the writes are not optimized out.

53b18b0

awesome, I was just about to ask for it!

You also need a memory barrier to write it to memory from cpu cache.

lib/string.c:

/** * memzero_explicit - Fill a region of memory (e.g. sensitive * keying data) with 0s. * @s: Pointer to the start of the area. * @count: The size of the area. * * Note: usually using memset() is just fine (!), but in cases * where clearing out _local_ data at the end of a scope is * necessary, memzero_explicit() should be used instead in * order to prevent the compiler from optimising away zeroing. * * memzero_explicit() doesn't need an arch-specific version as * it just invokes the one of memset() implicitly. */ void memzero_explicit(void *s, size_t count) { memset(s, 0, count); barrier_data(s); } EXPORT_SYMBOL(memzero_explicit);

shawnl · 2018-08-18T01:45:22Z

x25519 would be better with @mul(). #1350

Most of the performance gain of ref10 is using 32-bit X 32-bit=>64-bit multiplication instead of 16-bit X 16-bit=>32bit multiplication. I believe all the temporary variable stuff is unnecessary because doesn't zig compile programs as one big function?

andrewrk · 2018-08-20T18:24:39Z

doesn't zig compile programs as one big function?

I think you're thinking of that zig compiles programs as one big .o file. LLVM with -O3 can sometimes decide to turn that into one big function.

andrewrk · 2018-08-20T18:25:22Z

@shawnl tell me when you consider this ready for review

v3

shawnl · 2018-08-28T02:45:58Z

I think chacha20 is ready for review. the others are not as good.

kristate · 2018-08-28T02:47:46Z

@shawnl maybe add another PR with x25519

tiehuis · 2018-08-28T05:33:27Z

I think the chacha20 is functionally fine. Here is a patch however which improves the performance.

Goes from 176MiB/s to 431MiB/s for me.

  std/crypto/chacha20.zig | 46 ++++++++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 24 deletions(-)

diff --git a/std/crypto/chacha20.zig b/std/crypto/chacha20.zig
index 3030569a..836c8c88 100644
--- a/std/crypto/chacha20.zig
+++ b/std/crypto/chacha20.zig
@@ -22,19 +22,15 @@ fn Rp(a: usize, b: usize, c: usize, d: usize) QuarterRound {
     };
 }
 
-fn rotate(a: u32, b: u5) u32 {
-    return ((a << b) |
-            (a >> @intCast(u5, (32 - @intCast(u6, b))))
-           );
-}
-
 // The chacha family of ciphers are based on the salsa family.
-fn salsa20_wordtobyte(input: [16]u32) [64]u8 {
+fn salsa20_wordtobyte(out: []u8, input: [16]u32) void {
+    assert(out.len >= 64);
+
     var x: [16]u32 = undefined;
-    var out: [64]u8 = undefined;
 
     for (x) |_, i|
         x[i] = input[i];
+
     const rounds = comptime []QuarterRound{
         Rp( 0, 4, 8,12),
         Rp( 1, 5, 9,13),
@@ -45,20 +41,21 @@ fn salsa20_wordtobyte(input: [16]u32) [64]u8 {
         Rp( 2, 7, 8,13),
         Rp( 3, 4, 9,14),
     };
-    comptime var j: usize = 20;
-    inline while (j > 0) : (j -=2) {
-        for (rounds) |r| {
-            x[r.a] +%= x[r.b]; x[r.d] = rotate(x[r.d] ^ x[r.a], 16);
-            x[r.c] +%= x[r.d]; x[r.b] = rotate(x[r.b] ^ x[r.c], 12);
-            x[r.a] +%= x[r.b]; x[r.d] = rotate(x[r.d] ^ x[r.a],  8);
-            x[r.c] +%= x[r.d]; x[r.b] = rotate(x[r.b] ^ x[r.c],  7);
+
+    comptime var j: usize = 0;
+    inline while (j < 20) : (j += 2) {
+        // two-round cycles
+        inline for (rounds) |r| {
+            x[r.a] +%= x[r.b]; x[r.d] = std.math.rotl(u32, x[r.d] ^ x[r.a], u32(16));
+            x[r.c] +%= x[r.d]; x[r.b] = std.math.rotl(u32, x[r.b] ^ x[r.c], u32(12));
+            x[r.a] +%= x[r.b]; x[r.d] = std.math.rotl(u32, x[r.d] ^ x[r.a],  u32(8));
+            x[r.c] +%= x[r.d]; x[r.b] = std.math.rotl(u32, x[r.b] ^ x[r.c],  u32(7));
         }
     }
-    for (x) |_, i|
-        x[i] +%= input[i];
-    for (x) |_, i|
-        mem.writeInt(out[4 * i .. 4 * i + 4], x[i], builtin.Endian.Little);
-    return out;
+
+    for (x) |_, i| {
+        mem.writeInt(out[4 * i .. 4 * i + 4], x[i] +% input[i], builtin.Endian.Little);
+    }
 }
 
 fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) void {
@@ -73,13 +70,14 @@ fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) vo
         mem.readIntLE(u32, c[8..12]),
         mem.readIntLE(u32, c[12..16]),
     };
-    
+
     mem.copy(u32, ctx[0..], constant_le[0..4]);
     mem.copy(u32, ctx[4..12], key[0..8]);
     mem.copy(u32, ctx[12..16], counter[0..4]);
 
     while (true) {
-        var buf = salsa20_wordtobyte(ctx);
+        var buf: [64]u8 = undefined;
+        salsa20_wordtobyte(buf[0..], ctx);
 
         if (remaining < 64) {
             var i: usize = 0;
@@ -88,8 +86,8 @@ fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) vo
             return;
         }
 
-        comptime var i: usize = 0;
-        inline while (i < 64) : (i += 1)
+        var i: usize = 0;
+        while (i < 64) : (i += 1)
             out[cursor + i] = in[cursor + i] ^ buf[i];
 
         cursor += 64;

The main changes are:

Unrolling the inner rounds of salsa20_wordtobyte which doubles the speed.
Passing the slice explicitly instead of returning the array saves a copy (can optimize out in future with copy elision) and gives ~10% improvement.
Inlining the outer loop gives ~15-20% improvement but it costs an extra 4Kb of code space. I think the tradeoff is worthwhile here.
The other inline loops are small and can be done by the compiler if it is worthwhile.
The rotate function replacement doesn't alter the performance from the former.

The modified throughput test I've used to benchmark is as follows. Interestingly we need to allocate memory instead of using a fixed buffer else Zig optimizes the whole thing out.

const std = @import("std");
const time = std.os.time;
const Timer = time.Timer;
const chacha20 = @import("chacha20.zig");

const MiB = 1024 * 1024;
const BytesToHash = 128 * MiB;

pub fn main() !void {
    const key = []u8{
        0, 1, 2, 3, 4, 5, 6, 7,
        8, 9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,
        24, 25, 26, 27, 28, 29, 30, 31,
    };
    const nonce = []u8{
        0, 0, 0, 0,
        0, 0, 0, 0x4a,
        0, 0, 0, 0,
    };

    var stdout_file = try std.io.getStdOut();
    var stdout_out_stream = std.io.FileOutStream.init(&stdout_file);
    const stdout = &stdout_out_stream.stream;

    var rng = std.rand.DefaultPrng.init(0);

    var input = try std.heap.c_allocator.alloc(u8, BytesToHash);
    //var input: [BytesToHash]u8 = undefined;
    rng.random.bytes(input);

    var output = try std.heap.c_allocator.alloc(u8, BytesToHash);

    var timer = try Timer.start();
    const start = timer.lap();
    chacha20.chaCha20IETF(output, input, 1, key, nonce);
    const end = timer.read();

    const elapsed_s = @intToFloat(f64, end - start) / time.ns_per_s;
    const throughput = @floatToInt(u64, BytesToHash / elapsed_s);

    try stdout.print("{}: {} MiB/s\n", "chacha20", throughput / (1 * MiB));
}

The main changes are: Unrolling the inner rounds of salsa20_wordtobyte which doubles the speed. Passing the slice explicitly instead of returning the array saves a copy (can optimize out in future with copy elision) and gives ~10% improvement. Inlining the outer loop gives ~15-20% improvement but it costs an extra 4Kb of code space. I think the tradeoff is worthwhile here. The other inline loops are small and can be done by the compiler if it is worthwhile. The rotate function replacement doesn't alter the performance from the former. The modified throughput test I've used to benchmark is as follows. Interestingly we need to allocate memory instead of using a fixed buffer else Zig optimizes the whole thing out. ziglang#1369 (comment)

shawnl mentioned this pull request Aug 11, 2018

add ChaCha20 #1368

Closed

shawnl force-pushed the crypto branch from 26494d1 to 8cf7a2f Compare August 11, 2018 04:03

kristate suggested changes Aug 11, 2018

View reviewed changes

shawnl force-pushed the crypto branch 3 times, most recently from 3244390 to f084d52 Compare August 11, 2018 04:45

tiehuis reviewed Aug 11, 2018

View reviewed changes

shawnl force-pushed the crypto branch 3 times, most recently from ef1c30c to 2d068ce Compare August 11, 2018 20:09

shawnl force-pushed the crypto branch 3 times, most recently from 431ff2e to 78b6134 Compare August 12, 2018 01:20

shawnl force-pushed the crypto branch from feb4167 to b66f645 Compare August 14, 2018 01:40

shawnl changed the title ~~std/crypto: add chacha20~~ std/crypto: add chacha20 and x25519 Aug 14, 2018

shawnl force-pushed the crypto branch from b66f645 to e59b101 Compare August 14, 2018 02:09

shawnl changed the title ~~std/crypto: add chacha20 and x25519~~ std/crypto: add chacha20 and x25519 and poly1305 Aug 14, 2018

shawnl force-pushed the crypto branch from 76c7018 to 2180d17 Compare August 14, 2018 05:09

tiehuis requested changes Aug 14, 2018

View reviewed changes

tiehuis reviewed Aug 18, 2018

View reviewed changes

andrewrk added the work in progress This pull request is not ready for review yet. label Aug 20, 2018

andrewrk added this to the 0.4.0 milestone Aug 25, 2018

shawnl force-pushed the crypto branch from 9314673 to 5ed088d Compare August 28, 2018 02:39

shawnl changed the title ~~std/crypto: add chacha20 and x25519 and poly1305~~ std/crypto: add chacha20 Aug 28, 2018

std.crypto: add chaCha20

444edd9

v3

shawnl force-pushed the crypto branch from 5ed088d to 444edd9 Compare August 28, 2018 02:44

tiehuis approved these changes Aug 28, 2018

View reviewed changes

andrewrk removed the work in progress This pull request is not ready for review yet. label Aug 28, 2018

andrewrk modified the milestones: 0.4.0, 0.3.0 Aug 28, 2018

andrewrk merged commit 9de0f90 into ziglang:master Aug 28, 2018

shawnl deleted the crypto branch March 29, 2019 16:39

		@@ -0,0 +1,426 @@
		// Based on public domain Supercop by Daniel J. Bernstein

		const mem = @import("../mem.zig");

std/crypto: add chacha20 #1369

std/crypto: add chacha20 #1369

Conversation

shawnl commented Aug 11, 2018 • edited Loading

kristate commented Aug 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiehuis Aug 12, 2018 • edited Loading

Choose a reason for hiding this comment

shawnl commented Aug 11, 2018 • edited Loading

tiehuis commented Aug 12, 2018

andrewrk commented Aug 13, 2018

tiehuis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiehuis Aug 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiehuis commented Aug 15, 2018

shawnl commented Aug 15, 2018

tiehuis commented Aug 15, 2018

shawnl commented Aug 17, 2018

tiehuis commented Aug 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shawnl Aug 18, 2018 • edited Loading

Choose a reason for hiding this comment

shawnl commented Aug 18, 2018 • edited Loading

andrewrk commented Aug 20, 2018

andrewrk commented Aug 20, 2018

shawnl commented Aug 28, 2018

kristate commented Aug 28, 2018

tiehuis commented Aug 28, 2018 • edited Loading

shawnl commented Aug 11, 2018 •

edited

Loading

tiehuis Aug 12, 2018 •

edited

Loading

shawnl commented Aug 11, 2018 •

edited

Loading

tiehuis Aug 14, 2018 •

edited

Loading

shawnl Aug 18, 2018 •

edited

Loading

shawnl commented Aug 18, 2018 •

edited

Loading

tiehuis commented Aug 28, 2018 •

edited

Loading