Skip to content

std/crypto: add chacha20 #1369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 28, 2018
Merged

std/crypto: add chacha20 #1369

merged 2 commits into from
Aug 28, 2018

Conversation

shawnl
Copy link
Contributor

@shawnl shawnl commented Aug 11, 2018

the goal is to implement wireguard with chacha20-poly1305 and x25519, perhaps TLS later

@shawnl shawnl mentioned this pull request Aug 11, 2018
@kristate
Copy link
Contributor

@shawnl This is cool -- can we get more test vectors?
https://tools.ietf.org/html/draft-agl-tls-chacha20poly1305-04#section-7

var remaining: usize = undefined;
var cursor: usize = 0;

//if (in.len > out.len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to keep commented lines out of master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already had noticed this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should error out if the lengths are not equal, but it seemed like so little to make the function return an error.

}
}

// https://tools.ietf.org/html/draft-agl-tls-chacha20poly1305-04#section-7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test for all available test vectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first 4 are trivial, I added the complicated one. I think we only need one of the trivial ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@shawnl shawnl force-pushed the crypto branch 3 times, most recently from 3244390 to f084d52 Compare August 11, 2018 04:45
return out;
}

pub fn chaCha20(in: []const u8, key: [32]u8, nonce: [8]u8, out: *[]u8) void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using a slice as an output result, you don't need to pass it as a pointer.

Use out: []u8 and then when calling just use result[0..] to pass the slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that resulted in a segfault in the compiler!: #1372

Copy link
Member

@tiehuis tiehuis Aug 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to remove the current out dereferences in your code such as out.*. A slice in zig is just a pointer and a length pair. Since this is already a pointer to memory passing it as a pointer is necessary indirection and not required.

The segfault is a bug though and I've clarified in that issue.

@shawnl
Copy link
Contributor Author

shawnl commented Aug 11, 2018

Should chaCha20() return an error if the input and output are not the same length?

@shawnl shawnl force-pushed the crypto branch 3 times, most recently from ef1c30c to 2d068ce Compare August 11, 2018 20:09
@tiehuis
Copy link
Member

tiehuis commented Aug 12, 2018

I would be more inclined to have the core functionality not return an error at all. Instead I'd just add debug assertions on what is required by the caller and make them handle the required invariants.

So remove both the overflow and different buffer slice length errors.

@shawnl shawnl force-pushed the crypto branch 3 times, most recently from 431ff2e to 78b6134 Compare August 12, 2018 01:20
@andrewrk
Copy link
Member

Thanks! I'll do a full review when I get back from vacation in 1 week.

Alternately I trust @tiehuis's judgment on this one.

@shawnl shawnl changed the title std/crypto: add chacha20 std/crypto: add chacha20 and x25519 Aug 14, 2018
@shawnl shawnl changed the title std/crypto: add chacha20 and x25519 std/crypto: add chacha20 and x25519 and poly1305 Aug 14, 2018
Copy link
Member

@tiehuis tiehuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made any comments or looked into the algorithm details since I'm not that familiar so they are mostly style.

I did find the original Poly1305 and X25519 reference sources pretty opaque and terse (especially with regards to just using arrays directly everywhere). Here is an alternative which could provide some extra insight/clarity. This isn't a suggestion to completely change them though, since I can see these work fine as is right now.

You will also want to add entries to CMakeLists.txt in the root to add these new files as part of the install process.

@@ -0,0 +1,426 @@
// Based on public domain Supercop by Daniel J. Bernstein

const mem = @import("../mem.zig");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find doing a top level const std = @import("../index.zig") and referring to submodules from that base is the cleanest way to do this in std.

};
}

fn rotate(a: u32, b: u5) u32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use std.math.rotl for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zig/std/math/index.zig:517:50: error: type 'comptime_int' does not support field access
pub fn absCast(x: var) @inttype(false, @typeof(x).bit_count) {

Rp( 3, 4, 9,14),
};
comptime var j: usize = 20;
inline while (j > 0) : (j -=2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check the performance here when inlining things like this. In particular I would expect inlining the inner rounds to work better than inlining the outer loop. I don't think inlining the outer loop would provide much and the smaller code is probably more valuable. Of course you need to measure this before just taking my word.

Small note too but counting up with usize is usually slightly better in my opinion as it avoids the possible zero underflow that is very prevalent when attempting to port C code. It doesn't matter here however.

var remaining: usize = undefined;
var cursor: usize = 0;

if (in.len > out.len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simulate a ternary with assignment on single expression if statements

var remaining = if (in.len > out.len) in.len else out.len.

} else
remaining = out.len;

comptime const c = "expand 32-byte k";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would omit these explicit comptime statements. In general I'd only comptime if semantically required or there is measured performance improvement. LLVM can probably figure this out by itself anyway.

}

fn mainloop(work: [*]u32, e: [*]u8) void {
var xzm1: [64]u32 = undefined;
Copy link
Member

@tiehuis tiehuis Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the source you took this off and it seems slightly questionable here in how it defines these arrays. It only ever uses these in halves so when switching to a specific field element it should be sufficient to simply split all the [64]u32 into two [32]u32, replace those with a specific field element and then adjust the usages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it uses them together in select()

e[31] &= 127;
e[31] |= 64;
i = 0;
inline while (i < 32) : (i += 1) work[i] = p[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a std.mem.copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it isn't. It is moving from u8 to u32

mult(out,&t1,&z11);
}

fn crypto_scalarmult(n: []const u8, p: []const u8) [32]u8 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder from Poly1305, avoid returning an array. This is internal so isn't as important but for consistency I still would pass a result slice.


// https://tools.ietf.org/html/rfc7748#section-5.2
test "crypto.X25519Mul test vector 1" {
const assert = @import("../debug/index.zig").assert;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you only require two asserts for these tests just @import("std") at the top-level and fully qualify std.debug.assert.

// length 16, but the type system is annoying me
//poly1305.zig:75:26: error: cast discards const qualifier
// s.wholeBlock(@ptrCast(*[16]u8, buf[i - 16..i].ptr).*);
fn wholeBlock(s: *Poly1305, buf: [*]const u8) void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage I'd simply recommend using []const u8 slices as arguments and never arrays, even internally. The extra length field being passed is negligible and they are much more versatile and easy to use. You can debug.assert as I've mentioned a few times to enforce a length (at runtime which is the downside).

We can adjust at a later date if something like #863 is implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I ran into #863, which is why I used raw pointers everywhere.

@tiehuis
Copy link
Member

tiehuis commented Aug 15, 2018

I had a closer look at the reference for Poly1305 and I'd strongly suggest instead basing your implementation on the ref10 version instead. The ref version as here appears an order of magnitude (x10) slower than that. The complexity isn't much greater and I think it is worthwhile to do this from the start.

https://bench.cr.yp.to/impl-sign/ed25519.html

@shawnl
Copy link
Contributor Author

shawnl commented Aug 15, 2018

@tiehuis good catch

@tiehuis
Copy link
Member

tiehuis commented Aug 15, 2018

Sorry meant the Curve25519 implementation.

https://bench.cr.yp.to/impl-scalarmult/curve25519.html

@shawnl
Copy link
Contributor Author

shawnl commented Aug 17, 2018

I found this: https://github.com/project-everest/hacl-star

Formally verified reference cryptographic primitives. And now I am questioning all of these.

@tiehuis
Copy link
Member

tiehuis commented Aug 17, 2018

I wouldn't get too caught up in formal verification of these and aiming for absolute perfection. This could be considered a preliminary version so we can work towards say, a fully-zig https server.

inline while (i < 16) : (i += 1)
out[i] = @intCast(u8, s.h[i]);

mem.set(u32, s.r[0..], 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a secureZero function which you can use here to ensure the writes are not optimized out.

53b18b0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, I was just about to ask for it!

Copy link
Contributor Author

@shawnl shawnl Aug 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need a memory barrier to write it to memory from cpu cache.

lib/string.c:

/**
 * memzero_explicit - Fill a region of memory (e.g. sensitive
 *		      keying data) with 0s.
 * @s: Pointer to the start of the area.
 * @count: The size of the area.
 *
 * Note: usually using memset() is just fine (!), but in cases
 * where clearing out _local_ data at the end of a scope is
 * necessary, memzero_explicit() should be used instead in
 * order to prevent the compiler from optimising away zeroing.
 *
 * memzero_explicit() doesn't need an arch-specific version as
 * it just invokes the one of memset() implicitly.
 */
void memzero_explicit(void *s, size_t count)
{
	memset(s, 0, count);
	barrier_data(s);
}
EXPORT_SYMBOL(memzero_explicit);

@shawnl
Copy link
Contributor Author

shawnl commented Aug 18, 2018

x25519 would be better with @mul(). #1350

Most of the performance gain of ref10 is using 32-bit X 32-bit=>64-bit multiplication instead of 16-bit X 16-bit=>32bit multiplication. I believe all the temporary variable stuff is unnecessary because doesn't zig compile programs as one big function?

@andrewrk
Copy link
Member

doesn't zig compile programs as one big function?

I think you're thinking of that zig compiles programs as one big .o file. LLVM with -O3 can sometimes decide to turn that into one big function.

@andrewrk
Copy link
Member

@shawnl tell me when you consider this ready for review

@andrewrk andrewrk added the work in progress This pull request is not ready for review yet. label Aug 20, 2018
@andrewrk andrewrk added this to the 0.4.0 milestone Aug 25, 2018
@shawnl shawnl changed the title std/crypto: add chacha20 and x25519 and poly1305 std/crypto: add chacha20 Aug 28, 2018
@shawnl
Copy link
Contributor Author

shawnl commented Aug 28, 2018

I think chacha20 is ready for review. the others are not as good.

@kristate
Copy link
Contributor

@shawnl maybe add another PR with x25519

@tiehuis
Copy link
Member

tiehuis commented Aug 28, 2018

I think the chacha20 is functionally fine. Here is a patch however which improves the performance.

Goes from 176MiB/s to 431MiB/s for me.

  std/crypto/chacha20.zig | 46 ++++++++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 24 deletions(-)

diff --git a/std/crypto/chacha20.zig b/std/crypto/chacha20.zig
index 3030569a..836c8c88 100644
--- a/std/crypto/chacha20.zig
+++ b/std/crypto/chacha20.zig
@@ -22,19 +22,15 @@ fn Rp(a: usize, b: usize, c: usize, d: usize) QuarterRound {
     };
 }
 
-fn rotate(a: u32, b: u5) u32 {
-    return ((a << b) |
-            (a >> @intCast(u5, (32 - @intCast(u6, b))))
-           );
-}
-
 // The chacha family of ciphers are based on the salsa family.
-fn salsa20_wordtobyte(input: [16]u32) [64]u8 {
+fn salsa20_wordtobyte(out: []u8, input: [16]u32) void {
+    assert(out.len >= 64);
+
     var x: [16]u32 = undefined;
-    var out: [64]u8 = undefined;
 
     for (x) |_, i|
         x[i] = input[i];
+
     const rounds = comptime []QuarterRound{
         Rp( 0, 4, 8,12),
         Rp( 1, 5, 9,13),
@@ -45,20 +41,21 @@ fn salsa20_wordtobyte(input: [16]u32) [64]u8 {
         Rp( 2, 7, 8,13),
         Rp( 3, 4, 9,14),
     };
-    comptime var j: usize = 20;
-    inline while (j > 0) : (j -=2) {
-        for (rounds) |r| {
-            x[r.a] +%= x[r.b]; x[r.d] = rotate(x[r.d] ^ x[r.a], 16);
-            x[r.c] +%= x[r.d]; x[r.b] = rotate(x[r.b] ^ x[r.c], 12);
-            x[r.a] +%= x[r.b]; x[r.d] = rotate(x[r.d] ^ x[r.a],  8);
-            x[r.c] +%= x[r.d]; x[r.b] = rotate(x[r.b] ^ x[r.c],  7);
+
+    comptime var j: usize = 0;
+    inline while (j < 20) : (j += 2) {
+        // two-round cycles
+        inline for (rounds) |r| {
+            x[r.a] +%= x[r.b]; x[r.d] = std.math.rotl(u32, x[r.d] ^ x[r.a], u32(16));
+            x[r.c] +%= x[r.d]; x[r.b] = std.math.rotl(u32, x[r.b] ^ x[r.c], u32(12));
+            x[r.a] +%= x[r.b]; x[r.d] = std.math.rotl(u32, x[r.d] ^ x[r.a],  u32(8));
+            x[r.c] +%= x[r.d]; x[r.b] = std.math.rotl(u32, x[r.b] ^ x[r.c],  u32(7));
         }
     }
-    for (x) |_, i|
-        x[i] +%= input[i];
-    for (x) |_, i|
-        mem.writeInt(out[4 * i .. 4 * i + 4], x[i], builtin.Endian.Little);
-    return out;
+
+    for (x) |_, i| {
+        mem.writeInt(out[4 * i .. 4 * i + 4], x[i] +% input[i], builtin.Endian.Little);
+    }
 }
 
 fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) void {
@@ -73,13 +70,14 @@ fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) vo
         mem.readIntLE(u32, c[8..12]),
         mem.readIntLE(u32, c[12..16]),
     };
-    
+
     mem.copy(u32, ctx[0..], constant_le[0..4]);
     mem.copy(u32, ctx[4..12], key[0..8]);
     mem.copy(u32, ctx[12..16], counter[0..4]);
 
     while (true) {
-        var buf = salsa20_wordtobyte(ctx);
+        var buf: [64]u8 = undefined;
+        salsa20_wordtobyte(buf[0..], ctx);
 
         if (remaining < 64) {
             var i: usize = 0;
@@ -88,8 +86,8 @@ fn chaCha20_internal(out: []u8, in: []const u8, key: [8]u32, counter: [4]u32) vo
             return;
         }
 
-        comptime var i: usize = 0;
-        inline while (i < 64) : (i += 1)
+        var i: usize = 0;
+        while (i < 64) : (i += 1)
             out[cursor + i] = in[cursor + i] ^ buf[i];
 
         cursor += 64;

The main changes are:

  • Unrolling the inner rounds of salsa20_wordtobyte which doubles the speed.
  • Passing the slice explicitly instead of returning the array saves a copy (can optimize out in future with copy elision) and gives ~10% improvement.
  • Inlining the outer loop gives ~15-20% improvement but it costs an extra 4Kb of code space. I think the tradeoff is worthwhile here.
  • The other inline loops are small and can be done by the compiler if it is worthwhile.
  • The rotate function replacement doesn't alter the performance from the former.

The modified throughput test I've used to benchmark is as follows. Interestingly we need to allocate memory instead of using a fixed buffer else Zig optimizes the whole thing out.

const std = @import("std");
const time = std.os.time;
const Timer = time.Timer;
const chacha20 = @import("chacha20.zig");

const MiB = 1024 * 1024;
const BytesToHash = 128 * MiB;

pub fn main() !void {
    const key = []u8{
        0, 1, 2, 3, 4, 5, 6, 7,
        8, 9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,
        24, 25, 26, 27, 28, 29, 30, 31,
    };
    const nonce = []u8{
        0, 0, 0, 0,
        0, 0, 0, 0x4a,
        0, 0, 0, 0,
    };

    var stdout_file = try std.io.getStdOut();
    var stdout_out_stream = std.io.FileOutStream.init(&stdout_file);
    const stdout = &stdout_out_stream.stream;

    var rng = std.rand.DefaultPrng.init(0);

    var input = try std.heap.c_allocator.alloc(u8, BytesToHash);
    //var input: [BytesToHash]u8 = undefined;
    rng.random.bytes(input);

    var output = try std.heap.c_allocator.alloc(u8, BytesToHash);

    var timer = try Timer.start();
    const start = timer.lap();
    chacha20.chaCha20IETF(output, input, 1, key, nonce);
    const end = timer.read();

    const elapsed_s = @intToFloat(f64, end - start) / time.ns_per_s;
    const throughput = @floatToInt(u64, BytesToHash / elapsed_s);

    try stdout.print("{}: {} MiB/s\n", "chacha20", throughput / (1 * MiB));
}

The main changes are:

    Unrolling the inner rounds of salsa20_wordtobyte which doubles the speed.
    Passing the slice explicitly instead of returning the array saves a copy (can optimize out in future with copy elision) and gives ~10% improvement.
    Inlining the outer loop gives ~15-20% improvement but it costs an extra 4Kb of code space. I think the tradeoff is worthwhile here.
    The other inline loops are small and can be done by the compiler if it is worthwhile.
    The rotate function replacement doesn't alter the performance from the former.

The modified throughput test I've used to benchmark is as follows. Interestingly we need to allocate memory instead of using a fixed buffer else Zig optimizes the whole thing out.

ziglang#1369 (comment)
@andrewrk andrewrk removed the work in progress This pull request is not ready for review yet. label Aug 28, 2018
@andrewrk andrewrk modified the milestones: 0.4.0, 0.3.0 Aug 28, 2018
@andrewrk andrewrk merged commit 9de0f90 into ziglang:master Aug 28, 2018
@shawnl shawnl deleted the crypto branch March 29, 2019 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants