-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Locale in Std #893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The industry standard implementation of internationalization, localization, unicode data, etc. is ICU (used by Chrome and Firefox). Maintaining locale-specific data for German or any other language (except the subset of English that is relevant to programming) seems out of scope and too hard for the Zig standard library. ICU is a big project, and interfacing with it is not trivial. Someone should look into what it's like to use that library from a Zig app. Hopefully it will be straight forward and not involve C++ or uncontrollable panics or any nonsense like that. If ICU appears to be unsatisfactory for Zig use, I would recommend making contributions there to improve the situation rather than writing a competitor. |
I think there would be objections to having the zig standard library rely on another library? For example python has their own standard library unicode example. The biggest issue I guess I see is how we maintain updates with their versions, and how it is relevant to installation? Like to compile zig would we require it as a dependency and then it breaks that idea that zig is meant to be dependency free compiling or is this as a separate non-std library application? I think adding an interface to ICU would be fine but I think some basic locale specific data is important as well? Like for example in python a lot of people use PyICU since it is more indepth then the std library however in most cases the generic one is 'good enough' for most unicode that isn't I need to localise this as something else. Locale isn't intended to act as a localisation system it is merely meant to act as a way to obtain subsets of the unicode system for use in things like Time, 'IsUpper' and other string functions. Keeping in mind that we are just maintaining things like whitespace, upper/lower case letters, and numbers I don't see how the std library can't maintain this or how this is 'too hard'? The majority of this could exist within configuration files that are loaded at compile time a lot of these files are located online and it could even download them from a source at compile time meaning that the effort of maintaining goes towards 0 effectively (that is if it contains all the data we want). This isn't aimed towards being a competitor, also ICU suffers from only supporting char* and not []u8 which hurts us I would say :).
Want to note: that this is mainly for implementation of things like toUppercase and toLowercase for example or for things like the thousand separator in number formatting. I can't think of a modern language (last 10yrs) that hasn't had its own locale system without requiring external libraries. |
We should certainly not include ICU in the Zig standard library. I'm also advocating against any kind of "good enough" solution in the standard library. Unicode support is very difficult, and the difficulty comes from a mess of subtle corner cases that a non expert would never catch. In order to not be wrong, you gotta go all the way and do it right. And for that, there's ICU. Here's a taste of some corner cases:
In the Zen of Zig, corner cases matter, and a "good enough" solution is contrary to this principle. There could, however, be something that is limited enough in scope that a complete and correct implementation is maintainable and valuable. For example, hex literals in Zig (and many other languages) can use upper or lowercase letters A-F or a-f, and this is really only talking about English. The English language comes up in some programming contexts, and that can be a well defined and useful set of features to support. |
Implementing those corner cases aren't hard, since in most cases you can just map 1:1 like in English/Turkish (i:İ for low->up)? In the cases where it is more complicated it is merely just some small analysis of surrounding letters as you said? I can't see a case that can't be solved in a small amount of effort? In most cases a generic 'lowerToUpper map' function will be fine but in others such as greek you would need a custom one which as I've said I don't see how it is out of scope? Of course I'm not talking about anything overly complex and we are still talking about a limited scoped solution that is for everything I've stated previously :). Building a solution that just uses English is not good, since its usefulness is actually often negative in terms of its usability, for whenever you want to move your application from using English to accepting a greater range of languages you now have to change your API too thus we have really provided an annoyance. This is like how C handles strings just pretending unicode doesn't exist, which causes sooo many problems. My solution would just grant a few basic functions like stated previously but the big benefit is the ability for the user to define more, the user can define an english system that uses u17 if they want or u42 and as long as they have views and iterators for those types it would work seamlessly they don't have to define their own ranges for things like letters and numbers, and they can use the benefits of most unicode systems despite having their own types :). This is the benefit, this is the reason I think Zig should have it, this is along with the zen of Zig if you so say; this gives the user the ability to do what they want to do :). Have a look here, Maybe this will get my point across better? |
What's the usecase here? Why do you want to convert between uppercase and lowercase? |
Purely an example, could be many things such as splitting a string, or checking if characters are within certain sets. String manipulation exists in almost every program I would say, I've rarely had one where I didn't use something like joining sets of data with a certain string, checking if characters were uppercase, converting between cases, checking if whitespace existed. And especially things like outputting time or outputting numbers in either 1,000 or 1.000 format for example dependent on locale :). So I guess the question I should pose is what isn't the usecase? C#'s Locale system is a great example of one :), so is Python's. |
You've clearly never written a device driver :)
|
I did say rarely not use them :P. So I have to install ICU just to output time and numbers nicely? Something that isn't overly complex (it ain't no device driver) and I'm adding an external dependency and relying on it for user output? I don't know I just don't 'like that answer' haha. Cause honestly if I found out that a language's solution to 'do this' is go get some other library and try to interface with it through an unofficial and possibly broken/non-existent file I would just go; yep not using this anymore. Like it won't be a nice interface due to the difference in types, and I still don't see how the massive issue of these two different unicode systems is answered by using ICU? Like why even have a unicode system just use ICU if you need it, most won't. Why even have crypto functions, most won't need it just use X library which does it slightly faster/better/whatever more than us... and so on. Why even have a std library at all (haha okay a little hyperbolic but still carries my point). And isn't these certain modules edge cases in themselves? (Care about edge case :P). Basically I view a std as such; if a user is constantly using a library or creating a function then shouldn't it be in std? Isn't that the point, of giving the user a list of nicely done functions so they don't have to do it every time? I remember someone saying something like "a language syntax and functionality is what draws people in, its standard library makes them stay", I don't see how ignoring unicode (which you effectively are by saying that if you wan't to use it in any functional way that isn't just get text and output it, and not manipulating it in anyway, and in most cases not even outputting it correctly) would be a benefit? Some languages (python3) even go so far to force you to use unicode and to go against the grain requires significant effort :). |
I can answer that. ICU has a C API and a pass-the-buck approach to error handling, no panics or aborts. Apropos locales: ICU is to a large extent data-driven. We could use its data files as a stepping stone for a zig locale library. There's a lot in ICU that can be omitted to keep things small. Just to name a few:
|
@bnoordhuis Yes exactly what my aim is in my PR :). Though I can support different encodings really easily as it is just plug and play. As stated previously as long as the data files contain things like Overall I really am passionate about making this work :), I've been burnt by one too many bad unicode implementations :). |
|
I think we need a clear usecase for variable sized characters in strings. I know some windows apis require utf16 or some other form of "wide char" strings, but if interfacing with certain apis is the only case for non-utf8 strings, I'm not sure we have a compelling case. You can always have a translation layer for apis with strange data types. Why would we want, for example, to join/split "wide char" strings? Note that the current join/split functions in the std lib work correctly for utf8 strings already. |
We need a clear, detailed, focused proposal for me to label this as a proposal, and then evaluate it for acceptance or rejection. I don't want your work to be wasted, and for something important like this topic, we need to get on the same page. Here is what I suggest: Choose the smallest yet meaningful thing you can, and make a very concrete proposal. Explain the use cases for your proposed changes. Show some example code. And then let's talk about that. You've produced a lot of discussion and a lot of code, but we need to focus your energy for it to be harnessed by this project. |
@thejoshwolfe Split does NOT work with Utf8 Strings :). It currently only works with Ascii Strings maybe give it another look. the reasoning why is this line for (self.split_bytes) |split_byte| {
if (byte == split_byte) { // Meaning only one of the code points has to match for it to split :)
return true;
}
} So my version is currently the ONLY one Utf8 Compatible. Also for example in os/path.zig It already needs to do a lot of combining and splitting of Utf16 (weird windows ones) so thats a good use case :). |
@andrewrk, That is perfectly understandable, I more wanted to gather opinion with this issue then propose any solution. I'll work on a nice proposal :). |
I think @thejoshwolfe was trying to say that split currently works correctly on utf8 haystacks split by bytes. It won't work splitting against utf8 as you say, although this would probably require a different signature anyway to be explicit about whether one was wanting to split on graphemes/codepoints etc. |
Ahhh yes; well it would be a bit weird to not split on codepoints and I think that should be the default behaviour? Maybe not for mem.split but for others :) |
I stand corrected. If there were a split function that takes a delimiter string, then as long as both input strings are valid utf8, then the outputs will be valid utf8 as well. |
No locale in standard library. Locale will have to be solved with a third party package. |
Felt this should be an issue as it is relevant to being discussed :).
Currently as shown in my PR #891 I'm doing locale as a duck typed system, with the type of characters being one of the possibilities and the others will be function ptrs. This allows for example for you to create a locale that handles it in
[]const u32
or in[]const u8
or really in any type as long as you supply a way to view it and a way to iterate it (effectively the view is used to get the iterator), this does produce a bit of 'useless' boilerplate code for u8 but allows it to use the unicode views effortlessly, you can just doCreateLocale(u8, unicode.Utf8View, unicode.Utf8Iterator)
and tada you have a locale that uses unicode, this effectively means that we can provide a locale basis in multiple formats since for example interacting with windows sometimes requires u16 sometimes u8 (it is ugly). Would also support C style strings.Overall, I hope I'm not over-engineering but effectively you would then from these locales provide for example all the letters in the alphabet for certain functions (as shown in my PR) this means that we could have a locale just for German, one just for English and one that contains everything? This would also bleed a little into time later on and even things like number formatting, we can get the default locale based on the system of course.
Note: this isn't changing the unicode system merely a way to show subsets of the unicode data!
The text was updated successfully, but these errors were encountered: