-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Type Independent String Functions #891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'll style guide it up sometime soon :). |
A few of these seem very simular to function from
I don't think we need the same functions in two places. As for the rest, I like the idea of having an |
std/string_utils.zig
Outdated
return byte >= 'A' and byte <= 'Z'; | ||
} | ||
|
||
test "String_Utils" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend splitting this into some smaller tests like:
test "string_utils.is_ascii_letter" {
assert(is_ascii_letter('C'));
assert(is_ascii_letter('e'));
assert(!is_ascii_letter('2'));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of course, I just wanted to get a basic testing suite up first :).
find_str handles both lowest and highest index which mem.indexOf doesn't, strip handles the ability to just do left or right side, or both where mem.trim doesn't. Split is well not lazy so yes perhaps it does, and str_eql calls mem_eql :), it is just there to replace mem.str_eql (I moved it as per my description). So overall the only function that really doesn't need to exist is starts_with which I just added due to ends_with existing. Would it be better for me to port my methods over to memory instead of having a separate string utility class? That is add the non-lazy method, the trimming either side, the highest/lowest indexOf choice, and the str_eql (since its used a lot for hashmaps) over to memory? But still keep the other 'ifX' functions in string utilities? |
I think that is better. After all, trimming, finding and splitting is not unique to strings. Unicode strings would probably need their own versions of some of these (if that is useful). |
Yes, I plan to do that separately :). I would say it is useful? Though we need a sort of locale sorted first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I'm a little worried that this leads us down a path of misuse of strings and unicode. See #234
.gitignore
Outdated
@@ -1,3 +1,4 @@ | |||
zig-cache/ | |||
build/ | |||
build-*/ | |||
.vscode/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #888
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough I'll remove it :), I shouldn't be calling 'git add .' anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be able to use git add .
. Everything to ignore falls into 2 categories: system-wide ignore, or project-specific ignore. If both are correct then git add .
works just fine.
std/index.zig
Outdated
@@ -30,6 +30,7 @@ pub const rand = @import("rand/index.zig"); | |||
pub const sort = @import("sort.zig"); | |||
pub const unicode = @import("unicode.zig"); | |||
pub const zig = @import("zig/index.zig"); | |||
pub const stringUtils = @import("string_utils.zig"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision to put these things in something called "memory" rather than "string" is intentional.
A "string" would be something like []u32
, where each item is a unicode point. Zig std lib does not have a string data type yet - and it looks like this PR does not introduce one. Although we do have some utf8 decoding and encoding code in std/unicode.zig.
String utilities would be a welcome addition to the zig standard library, but they should operate on unicode points, not encoded bytes. Utilities that operate on encoded bytes should be clear about that, using parameter names such as "bytes" and explaining the difference between encoded strings and actual strings in the docs.
std/string_utils.zig
Outdated
return byte >= 'a' and byte <= 'z'; | ||
} | ||
|
||
pub fn is_ascii_upper(byte: u8) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly confident that we should not have asciiUpper and asciiLower in the standard library for these reasons:
- it's too easy to misuse, when actual strings should be used instead
- if it really is the odd case where you need ascii upper/ascii lower, writing out the function body is easy enough. it should be enough tedium to get people to consider using the actual string functions.
Do you have an actual use case for ascii upper/lower right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the randomizer I use a few ascii functions(They take &const u8
because I use them as fn ptrs to generic functions), because I know the strings I get from the nds headers are ascii. I don't mind not having it in std
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's fair. Maybe it's not so bad to have them. We should definitely put a bunch of noticeable warning signs near them and avoid the use of the word "string".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could live in ascii.zig
. Then you sign up for ascii problems by importing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that "ascii" is different from []u8
. ascii requires the top bit of every byte to be 0
, so if you're doing ascii string processing, you might consider validating that top bit.
those ascii functions you linked are correct, because they check for specific ranges of values, but something like ascii_to_upper()
should perhaps assert the top bit is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could also just check if it was within 127? That avoids the bit masking operations (which are more expensive and remove clarity)? But yes it probably does have to be checked, though I would rather verify data through the locale viewer (which would check just once) rather than an indeterminate amount of times :).
std/unicode.zig
Outdated
// Is made up of one byte | ||
// Can just add a '0' and then the code point | ||
// Thus can just output 'c' | ||
var result: [1]u8 = undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewrk, @Hejsil
This is an idiom that I've done a lot since returning arrays from functions ain't pretty due to the arrays being allocated within the function stack, is there a better way to get an array output, and furthermore is there a better way to assign a slice to a new slice size without doing this 'trick'.
In most languages there is a syntax for creating a new array on the stack and assigning its value to something else, currently this requires a hidden memcpy presumably; which I think we could optimise away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is returning stack-allocated data from within the function to the caller and isn't valid.
Consider the following example:
const std = @import("std");
const warn = std.debug.warn;
fn example(out: &[]const u8, s: u8) void {
var a: [4]u8 = undefined;
a[0] = s;
a[1] = s + 1;
a[2] = s + 2;
a[3] = s + 3;
*out = a;
}
pub fn main() void {
var a: []const u8 = undefined;
example(&a, 0);
warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");
var b: []const u8 = undefined;
example(&b, 1);
warn("b: "); for (b) |e| { warn("{} ", e); } warn("\n");
warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");
}
Gives the output:
$ zig run example.zig
a: 0 1 2 3
b: 1 2 3 4
a: 1 2 3 4
That is the best case that is happening, there are no guarantees about the data returned since everything within the function call is out of scope on exit unless it was dynamically allocated.
I would suggest instead taking a raw slice and simply asserting that the user has provided a sufficiently sized buffer. This allows the user to use their own buffer which they may want and provides the most flexibility.
Fixing up that previous example:
const std = @import("std");
const warn = std.debug.warn;
fn example(out: []u8, s: u8) void {
std.debug.assert(out.len >= 4);
out[0] = s;
out[1] = s + 1;
out[2] = s + 2;
out[3] = s + 3;
}
pub fn main() void {
var a: [4]u8 = undefined;
example(a[0..], 0);
warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");
var b: [4]u8 = undefined;
example(b[0..], 1);
warn("b: "); for (b) |e| { warn("{} ", e); } warn("\n");
warn("a: "); for (a) |e| { warn("{} ", e); } warn("\n");
}
If you really wanted to return fixed data you could create a new struct and wrap the array. This requires bounding the array size to the maximum needed. This is less flexible and I wouldn't implement it this way.
const std = @import("std");
const warn = std.debug.warn;
const Encoded = struct { out: [4]u8 };
fn example(s: u8) Encoded {
var a: Encoded = undefined;
a.out[0] = s;
a.out[1] = s + 1;
a.out[2] = s + 2;
a.out[3] = s + 3;
return a;
}
pub fn main() void {
const a = example(0);
warn("a: "); for (a.out) |e| { warn("{} ", e); } warn("\n");
const b = example(1);
warn("b: "); for (b.out) |e| { warn("{} ", e); } warn("\n");
warn("a: "); for (a.out) |e| { warn("{} ", e); } warn("\n");
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To return the length I'd just return it as the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem with it is that this is used for encoding bytes back to a byte[] which means the size isn't known prior for example, this means that it gets a bit finicky. The way I think may be better for this unknown size thing could either be requiring them to run a function to get how many bytes the code point will be OR;
fn example(s: [4]const u8) []const u8 {
// ... do stuff with 's'
return s[0..sizeWanted];
}
Perhaps? Or just result to a full buffer based system and return the length of that byte? Giving them the opportunity to either slice it or leave it? Tricky problem this one is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I don't like the struct way; it feels icky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do this, and get a static guarantee that the buffer is a certain size:
fn example(out: &[4]u8, s: u8) void {
(*out)[0] = s;
(*out)[1] = s + 1;
(*out)[2] = s + 2;
(*out)[3] = s + 3;
}
Related: #863
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh I do like that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with requiring an array of a specific size is that you are forcing the caller to use an array of this type. The caller will need to memcpy the bytes if they want to move the bytes into any other sized array. The static guarantees are nice, but if the data is copied anywhere else you will be using a slice interface anyway so will simply be performing the implied check later down the track.
Also, does anywhere else in std take a fixed array argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I want #863, which will allow slice -> array pointer casts.
It is true that the check has to be performed some other place than the call, but isn't this the same as the maybe type. Giving you a static guarantee that something is true, requiring the caller to ensure the truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the main point I'm trying to say is that since a slice is a view into a backing array so in my view it doesn't make sense to limit the output type to an array of a specific size, when you could use a slice and take an arbitrary sized array.
Your proposal looks nice and handles exactly this scenario. For now, I'd stick with taking a slice and if down the track we get better guarantees we can update any of these occurences.
@thejoshwolfe Pinging you because I know you don't think we should do locales :). If you look at locale.zig and unicode.zig (I just made a few minor improvements) in this PR (ignoring string_utils.zig as that will disappear I'm just keeping it around as I'll reflect some of the nice options of it like finding highest index over to memory). This is kinda the intention of the scope, it is purely based around unicode and not trying to handle localisation. It can also handle things like time/number formatting of course but that's again around what I would call unicode. Most countries would fit a blanket toUpper and toLower and some would need a custom, at minimum this would be quite effective for atleast the english language as it does allow you to insert whatever type you want, though I would cut down on the formatter and make it just english. I however feel that we could cover a series of similar languages with it atleast? |
std/unicode.zig
Outdated
@@ -155,6 +156,18 @@ pub const Utf8View = struct { | |||
return initUnchecked(s); | |||
} | |||
|
|||
pub fn eql(self: &const Utf8View, other: &const Utf8View) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mem.eql
should be used when you don't care about encoding and only about memory, while something like unicode.eql
should take into account that a "letter" can be a combination of many unicode code points, and that the order of these code points, don't always matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not realise that, I'll look into it and build a proper solution :0
std/unicode.zig
Outdated
return mem.eql(u8, self.bytes, other.bytes); | ||
} | ||
|
||
pub fn slice(self: &const Utf8View, start: usize, end: usize) !Utf8View { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that slicing a unicode string would be from start_code_point
to end_code_point
, and not just slice the memory and validate that the substring still is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then again, the string could be big so maybe this is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that check is not needed :), as said in my other response. This slice will be renamed because that is a valid point it mainly is used in my split function for example which doesn't need a concept of end points; how about sliceRaw? I'll do a proper slice one since I can just check the first bit to detect when to iterate the index though it'll be max O(N) { N := the raw point corresponding to the code point index } just to detect the range which is unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I missed your second point, it isn't always needed :). So I'll probably add it as 'sliceRawUnchecked' for example
std/mem.zig
Outdated
.split_bytes = split_bytes, | ||
}; | ||
} | ||
pub fn split(comptime viewType: type, comptime iterator_type: type, comptime baseType: type) type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the things in mem
should use these "string iterators". When I use mem.split
I only care about memory, not encoding, views, interators and all that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next
fn in here is a good reason, why mem.split
shouldn't use these iterators. Iterating over memory can't fail with an error, because the elements don't need to be validated to decoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with both points because of a few reasons :).
- We need split in a few cases (just in std); first for string splitting in windows which is Utf16 with a few weird oddities, with my example you can just create a Utf16Win view and corresponding iterator and you don't have to create a unique function for it :), we also need a generic Utf8 one for 'general use' and a raw Ascii one for memory or when you know your just dealing with bland non encoded data.
Thus if we just pretend that iterators and views don't exist it becomes combersome :).
-
You generally don't have to care about views and iterators :) as shown in the test example there exists a type under stringUtils called string_it which maps the iterators and views for you :). That is instead of calling
split("A/B/C/", "/")
you callstringUtils.string_it.init("A/B/C/", "/")
, currently 'next' returns a view however I'm also gonna add a function that returns the raw type that you specify which in the case of the string one is []const u8, this should solve your issue of views as you'll never need to touch one. Views are typically a good thing as they generalise. -
Your second statement is also false, Iterators don't EVER fail with an error :), just look at both the Utf8 Iterator and Ascii Iterator they have type ?[]u8 not !?[]u8 because of course they don't fail, HOWEVER the views can fail if you give them invalid data, I can change it to just set the data regardless however this would then ensure that the data you originally give it has to be valid which it doesn't necessarily have to be :), however technically you are right next doesn't have to be error checked since I can validate that the slice can use initUnchecked.
So overall I can make next() not return an error ever and it kinda is just due to me wanting to be 100% safe and that kinda eroded the practicality of it :). So with all that said I don't particularly get why you don't like it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only real problem is that I don't think it should be in mem
. Memory splitting and string splitting is different. When we split memory, there is no concept of valid code points, encoding and all that. Therefore I think the interface should be simpler.
I agree, that for splitting paths, the string splitting functions you provide should be used, as paths are strings. However, there are cases where you just have some memory ([]u8, []16, []32, []64) and want to split on some values. Btw, stringUtils.string_it
only works on ascii, so you can't split any arbitrary sequence of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could provide a memory view and iterator though, and that might work out. I really think split
should not be able to fail not next
for these though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh it won't fail for next I already finished that, currently it can only fail upon initialisation of the iterator. Also you can get it to return the raw data instead of a view.
And yes it only works for ascii that is just due to me not bothering naming it properly. I think I'll name it stringUtils.split_ascii_it, and have stringUtils.split_it for Utf8 and have stringUtils.split_win_it for the windows special Utf16 (that is super annoying why can't windows do anything right lel).
Again I completely agree next will not fail for anything after my next set of changes :).
One last thing; so you would prefer mem.split to use the old system and have the new split exist under stringUtils :)?
Really appreciate you having an in-depth look at my code by the way, your points are really valid and are helpful 😄 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. I think mem
should only concern itself with slices of data. Not a more complicated iterator/view system. Ofc, I don't have the final say, as I'm not a "designer" of Zig or it's std library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note: paths are not strings in linux. They are null-terminated arrays of bytes. You can have invalid utf-8 as a linux file path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes I'll make sure that it allows for null-terminated arrays of bytes :). Pretty sure that path sanitises it to a slice anyway but I'll double check (then converts it back at the end).
CMakeLists.txt
Outdated
@@ -452,6 +452,10 @@ set(ZIG_STD_FILES | |||
"index.zig" | |||
"io.zig" | |||
"linked_list.zig" | |||
"string_utils.zig" | |||
"representations/utf8.zig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
representations
is an ambiguous word. it asks the question, "representing what?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh fair, I'll work on a better name :).
build.zig
Outdated
@@ -55,8 +56,8 @@ pub fn build(b: &Builder) !void { | |||
addCppLib(b, exe, cmake_binary_dir, "zig_cpp"); | |||
if (lld_include_dir.len != 0) { | |||
exe.addIncludeDir(lld_include_dir); | |||
var it = mem.split(lld_libraries, ";"); | |||
while (it.next()) |lib| { | |||
var it = try string.asciiSplit(lld_libraries, ";"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me that every place you replaced mem.split
with asciiSplit
is incorrect. Here we are quite intentionally splitting on utf8-encoded bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :), mem.split only worked on ascii bytes so I've replaced it to only work with ascii then I was aiming to go through and replace it with utf8 where needed :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I want to note that the fact you can just change it to string.utf8Split(...)
and you suddenly use unicode correctly kinda shows how 'cool' it is.
std/os/path.zig
Outdated
|
||
pub const sep_windows = '\\'; | ||
pub const sep_posix = '/'; | ||
pub const sep_windows : u8 = '\\'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add u8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't gone through and removed little needless changes sorry.
std/os/path.zig
Outdated
} else { | ||
return joinPosix(allocator, paths); | ||
} | ||
// Currently a work around for the expansion to allow string separators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std.os.path.join
as it is currently implemented is correct, fully implemented, finished, and passes all tests. Why are we adding workarounds for anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just converting a character into a string since the new join takes a string separator since I found it rarer to want a character than a string i.e. ", " is very common. If you know of a way to convert a character into a string without this little workaround then I would be overjoyed :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I've decided to have a function to handle that, it literally does just convert it into an array like []u8 { sep }
so functionally its the same but atleast its a bit nicer :). So you can just call string.joinCharSep()
and give a single character sep rather than a string :).
@andrewrk I don't have much else I want to do with this PR, so it comes down to whether or not you want me to keep the type invariant stuff or make it JUST utf8. I'm happy to change it over, however I still stand by my original view that its more helpful to have it then it is to remove it :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss the best way to solve this use case in #912
@@ -55,8 +56,8 @@ pub fn build(b: &Builder) !void { | |||
addCppLib(b, exe, cmake_binary_dir, "zig_cpp"); | |||
if (lld_include_dir.len != 0) { | |||
exe.addIncludeDir(lld_include_dir); | |||
var it = mem.split(lld_libraries, ";"); | |||
while (it.next()) |lib| { | |||
var it = try string.utf8Split(lld_libraries, ";"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't changed my mind about this since the last PR comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need more clarification about what specifically you dislike :). Is it the call name ? Like is it just because it is not called string.split
or is it the fact it uses generics behind the scene?
Again i'm happy to remove the type independent stuff if need be :)
@@ -309,7 +310,7 @@ const Node = union(enum) { | |||
const Toc = struct { | |||
nodes: []Node, | |||
toc: []u8, | |||
urls: std.HashMap([]const u8, Token, mem.hash_slice_u8, mem.eql_slice_u8), | |||
urls: std.HashMap([]const u8, Token, string.hashStr, string.strEql), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hash map is hashing bytes, not strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this shows a problem with the way its done because the hash_slice_u8 for example and hashStr would be identical and that leads to a singular name meaning two different things, I guess how would you approach this because in situations where it is hashing strings not bytes it leads to this same problem? I would either declare aliases or just live with one of them.
std/mem.zig
Outdated
@@ -130,6 +130,16 @@ pub fn copy(comptime T: type, dest: []T, source: []const T) void { | |||
for (source) |s, i| dest[i] = s; | |||
} | |||
|
|||
pub fn copyRange(comptime T: type, dest: []T, destStart: usize, destEnd: usize, source: []const T, sourceStart: usize, sourceEnd: usize) void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead, callers should use slicing, and then call mem.copy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I used that but then I realised slicing is soo much easier then just slipped my mind to remove it, even when you go through all your commits it is hard to find little things like this haha.
std/mem.zig
Outdated
/// Remove values from the beginning and end of a slice. | ||
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T) []const T { | ||
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T, side: Side) []const T { | ||
// Asserting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this comment communicating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I forgot to remove it :P
std/mem.zig
Outdated
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T) []const T { | ||
pub fn trim(comptime T: type, slice: []const T, values_to_strip: []const T, side: Side) []const T { | ||
// Asserting | ||
assert(side == Side.LEFT or side == Side.RIGHT or side == Side.BOTH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is guaranteed by the type system. any time a type is casted to an enum there is a runtime safety check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sweet, still learning things
std/os/child_process.zig
Outdated
@@ -63,9 +64,11 @@ pub const ChildProcess = struct { | |||
PermissionDenied, | |||
InvalidUserId, | |||
ResourceLimitReached, | |||
InvalidCharacter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should not be errors that are possible for ChildProcess.spawn
to emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? They are possible now due to it now using Utf8? In saying that it should only return one of them, I believe InvalidCharacter, so I'll remove the other one.
const ascii = std.string.ascii; | ||
const utf8 = std.string.utf8; | ||
|
||
/// Returns an iterator that iterates over the slices of `buffer` that are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these docs are incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only incorrect bit is that it references 'bytes' :)
@andrewrk I fixed all the changes requested BUT the hash situation and the split one. This is because I need clarification on those two things :). I will be happy to make the changes once I get clarification. |
I'll resolve this PR this weekend, one way or another
The stdlib string data type and unicode handling is extremely important. We need to get it right. And how we do it is going to be affected by #130 and #770. It's not time to introduce a string type yet. This is an area where it's difficult to contribute to. Thank you for putting effort into this change; you've given the zig project some things to think about. For now, zig's stance on strings is that we deal with encoded UTF-8 bytes. The standard library does not yet have any concept of a string, although it provides some unicode encoding and decoding facilities. When zig becomes more mature, we will solve this problem. As always, if someone is working on a project using zig and runs into a use case where this would have solved the problem, please open an issue and we can figure out short-term and long-term solutions. |
From the OP:
These sound valuable regardless of the rest of this PR. You could probably make a PR for each one. |
Fair enough, thanks for having a look :). @thejoshwolfe I'll do a smaller PR with just a few of the key changes like those :) (as well as ones like codePointSequenceLength, as well as like allowing to trim from only one side and so on). |
What is this PR
See here #912, this explains it in a deeper level.
What EXACTLY is in this PR
utf8.View
rather thanutf8View
should just make the code within it easier to understand :), this is similar to how its been done in MathWhat this PR ISN'T
This PR is not going to be ANYTHING to do with LOCALE.
Furthermore this PR isn't going to edit any code to use Unicode string functions as that'll push it away from its focused use, it'll still add tests of course and I presume any new code will use unicode but any old code should continue to use Ascii (and will merge over to ascii views) as for example I don't know if Path.zig just pretends the bytes it gets are ascii but really are unicode or if it reports an error if you give it a unicode or whatever, and due to the size of path.zig along with the fact that Windows uses a special Utf16 version that should be a separate PR :).
Checklist
Short Explanation of Views/Iterators
Views are just a way to view a memory in two ways simultaneously, both as raw data and as what it represents for example utf8 you can index the raw data or the code points it represents (which are u32 instead of u8 for example), you can of course do other actions like comparing views (if same type of course) and slicing/indexing both code points and raw data. The only function that can return an error is
view.init
which should perform a validity check on the data (there is also an initUnchecked function) as when indexing data there is no reason why it should be invalid.Iterators just allow you to cycle through a view, they allow you to obtain the 'nextBytes' or the 'nextCodepoint' representing either the next set of bytes or the actual code point for some iterators they are the same thing for example Ascii which will always return a byte array of size 1 for nextBytes (though of course it'll return it as an array rather than an individual element to fit the spec of iterators), for that reason you should commonly use nextCodepoint as that is more efficient in cases like Ascii and the reason you are using iterators is to get the functionality of turning the raw unicode data into code points :). There are no error throwing functions in iterators, this is to make your loops nicer and your code much nicer :).