Skip to content

Locale in Std #893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BraedonWooding opened this issue Apr 5, 2018 · 19 comments
Closed

Locale in Std #893

BraedonWooding opened this issue Apr 5, 2018 · 19 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@BraedonWooding
Copy link
Contributor

BraedonWooding commented Apr 5, 2018

Felt this should be an issue as it is relevant to being discussed :).

Currently as shown in my PR #891 I'm doing locale as a duck typed system, with the type of characters being one of the possibilities and the others will be function ptrs. This allows for example for you to create a locale that handles it in []const u32 or in []const u8 or really in any type as long as you supply a way to view it and a way to iterate it (effectively the view is used to get the iterator), this does produce a bit of 'useless' boilerplate code for u8 but allows it to use the unicode views effortlessly, you can just do CreateLocale(u8, unicode.Utf8View, unicode.Utf8Iterator) and tada you have a locale that uses unicode, this effectively means that we can provide a locale basis in multiple formats since for example interacting with windows sometimes requires u16 sometimes u8 (it is ugly). Would also support C style strings.

Overall, I hope I'm not over-engineering but effectively you would then from these locales provide for example all the letters in the alphabet for certain functions (as shown in my PR) this means that we could have a locale just for German, one just for English and one that contains everything? This would also bleed a little into time later on and even things like number formatting, we can get the default locale based on the system of course.

Note: this isn't changing the unicode system merely a way to show subsets of the unicode data!

@thejoshwolfe
Copy link
Contributor

thejoshwolfe commented Apr 5, 2018

The industry standard implementation of internationalization, localization, unicode data, etc. is ICU (used by Chrome and Firefox). Maintaining locale-specific data for German or any other language (except the subset of English that is relevant to programming) seems out of scope and too hard for the Zig standard library.

ICU is a big project, and interfacing with it is not trivial. Someone should look into what it's like to use that library from a Zig app. Hopefully it will be straight forward and not involve C++ or uncontrollable panics or any nonsense like that.

If ICU appears to be unsatisfactory for Zig use, I would recommend making contributions there to improve the situation rather than writing a competitor.

@BraedonWooding
Copy link
Contributor Author

BraedonWooding commented Apr 5, 2018

I think there would be objections to having the zig standard library rely on another library? For example python has their own standard library unicode example. The biggest issue I guess I see is how we maintain updates with their versions, and how it is relevant to installation? Like to compile zig would we require it as a dependency and then it breaks that idea that zig is meant to be dependency free compiling or is this as a separate non-std library application?

I think adding an interface to ICU would be fine but I think some basic locale specific data is important as well? Like for example in python a lot of people use PyICU since it is more indepth then the std library however in most cases the generic one is 'good enough' for most unicode that isn't I need to localise this as something else.

Locale isn't intended to act as a localisation system it is merely meant to act as a way to obtain subsets of the unicode system for use in things like Time, 'IsUpper' and other string functions. Keeping in mind that we are just maintaining things like whitespace, upper/lower case letters, and numbers I don't see how the std library can't maintain this or how this is 'too hard'? The majority of this could exist within configuration files that are loaded at compile time a lot of these files are located online and it could even download them from a source at compile time meaning that the effort of maintaining goes towards 0 effectively (that is if it contains all the data we want).

This isn't aimed towards being a competitor, also ICU suffers from only supporting char* and not []u8 which hurts us I would say :).

Edit:
Oops accidentally closed issue whoops.

Want to note: that this is mainly for implementation of things like toUppercase and toLowercase for example or for things like the thousand separator in number formatting. I can't think of a modern language (last 10yrs) that hasn't had its own locale system without requiring external libraries.

@thejoshwolfe
Copy link
Contributor

We should certainly not include ICU in the Zig standard library. I'm also advocating against any kind of "good enough" solution in the standard library.

Unicode support is very difficult, and the difficulty comes from a mess of subtle corner cases that a non expert would never catch. In order to not be wrong, you gotta go all the way and do it right. And for that, there's ICU.

Here's a taste of some corner cases:

  • in English, "i" to uppercase is "I" and "I" to lowercase is "i".
  • in Turkish "i" to uppercase is "İ" and "I" to lowercase is "ı".
  • in Greek, "Σ" to lowercase is either "σ" or "ς" depending on the surrounding letters.

In the Zen of Zig, corner cases matter, and a "good enough" solution is contrary to this principle.

There could, however, be something that is limited enough in scope that a complete and correct implementation is maintainable and valuable. For example, hex literals in Zig (and many other languages) can use upper or lowercase letters A-F or a-f, and this is really only talking about English. The English language comes up in some programming contexts, and that can be a well defined and useful set of features to support.

@BraedonWooding
Copy link
Contributor Author

BraedonWooding commented Apr 5, 2018

  1. I agree but the problem is we need a solution for manipulating unicode that is 'official' of sorts :).
  2. By good enough I mean not that it is unsatisfactory but that it is a limited in scope solution that solves some of the problems but not all that is it is 'good enough' for most cases, which is what you are advocating towards the end.

Implementing those corner cases aren't hard, since in most cases you can just map 1:1 like in English/Turkish (i:İ for low->up)? In the cases where it is more complicated it is merely just some small analysis of surrounding letters as you said? I can't see a case that can't be solved in a small amount of effort? In most cases a generic 'lowerToUpper map' function will be fine but in others such as greek you would need a custom one which as I've said I don't see how it is out of scope?

Of course I'm not talking about anything overly complex and we are still talking about a limited scoped solution that is for everything I've stated previously :). Building a solution that just uses English is not good, since its usefulness is actually often negative in terms of its usability, for whenever you want to move your application from using English to accepting a greater range of languages you now have to change your API too thus we have really provided an annoyance. This is like how C handles strings just pretending unicode doesn't exist, which causes sooo many problems.

My solution would just grant a few basic functions like stated previously but the big benefit is the ability for the user to define more, the user can define an english system that uses u17 if they want or u42 and as long as they have views and iterators for those types it would work seamlessly they don't have to define their own ranges for things like letters and numbers, and they can use the benefits of most unicode systems despite having their own types :). This is the benefit, this is the reason I think Zig should have it, this is along with the zen of Zig if you so say; this gives the user the ability to do what they want to do :).

Have a look here, Maybe this will get my point across better?

@thejoshwolfe
Copy link
Contributor

What's the usecase here? Why do you want to convert between uppercase and lowercase?

@BraedonWooding
Copy link
Contributor Author

BraedonWooding commented Apr 5, 2018

Purely an example, could be many things such as splitting a string, or checking if characters are within certain sets. String manipulation exists in almost every program I would say, I've rarely had one where I didn't use something like joining sets of data with a certain string, checking if characters were uppercase, converting between cases, checking if whitespace existed. And especially things like outputting time or outputting numbers in either 1,000 or 1.000 format for example dependent on locale :).

So I guess the question I should pose is what isn't the usecase?

C#'s Locale system is a great example of one :), so is Python's.

@thejoshwolfe
Copy link
Contributor

I've rarely had one where I didn't use something like joining sets of data with a certain string, ...

You've clearly never written a device driver :)

  • joining sets of data with a certain string: sounds like a job for Buffer
  • checking if characters were uppercase, converting between cases, checking if whitespace existed, outputting time, outputting numbers in either 1,000 or 1.000 format: use ICU.

@BraedonWooding
Copy link
Contributor Author

BraedonWooding commented Apr 5, 2018

I did say rarely not use them :P.

So I have to install ICU just to output time and numbers nicely? Something that isn't overly complex (it ain't no device driver) and I'm adding an external dependency and relying on it for user output? I don't know I just don't 'like that answer' haha.

Cause honestly if I found out that a language's solution to 'do this' is go get some other library and try to interface with it through an unofficial and possibly broken/non-existent file I would just go; yep not using this anymore. Like it won't be a nice interface due to the difference in types, and I still don't see how the massive issue of these two different unicode systems is answered by using ICU? Like why even have a unicode system just use ICU if you need it, most won't. Why even have crypto functions, most won't need it just use X library which does it slightly faster/better/whatever more than us... and so on. Why even have a std library at all (haha okay a little hyperbolic but still carries my point). And isn't these certain modules edge cases in themselves? (Care about edge case :P).

Basically I view a std as such; if a user is constantly using a library or creating a function then shouldn't it be in std? Isn't that the point, of giving the user a list of nicely done functions so they don't have to do it every time? I remember someone saying something like "a language syntax and functionality is what draws people in, its standard library makes them stay", I don't see how ignoring unicode (which you effectively are by saying that if you wan't to use it in any functional way that isn't just get text and output it, and not manipulating it in anyway, and in most cases not even outputting it correctly) would be a benefit? Some languages (python3) even go so far to force you to use unicode and to go against the grain requires significant effort :).

@bnoordhuis
Copy link
Contributor

Someone should look into what it's like to use that library from a Zig app. Hopefully it will be straight forward and not involve C++ or uncontrollable panics or any nonsense like that.

I can answer that. ICU has a C API and a pass-the-buck approach to error handling, no panics or aborts.

Apropos locales: ICU is to a large extent data-driven. We could use its data files as a stepping stone for a zig locale library.

There's a lot in ICU that can be omitted to keep things small. Just to name a few:

  • zig uses utf-8 so all the different character encoding conversion tables can probably be dropped
  • break iteration (word wrapping) takes up a lot of space but is often pretty niche
  • ditto gender- and pluralization rules
  • unicode normalization
  • etc., etc.

@BraedonWooding
Copy link
Contributor Author

@bnoordhuis Yes exactly what my aim is in my PR :). Though I can support different encodings really easily as it is just plug and play. As stated previously as long as the data files contain things like timeSeparator, letters,... it would work perfectly and we can read it in at compile time resulting in no text files past compilation :). This would of course increase the binary though I'm sure through some arguments you could disable the loading of this information to reduce the binary size :).

Overall I really am passionate about making this work :), I've been burnt by one too many bad unicode implementations :).

@BraedonWooding
Copy link
Contributor Author

// TODO ASCII is wrong, we actually need full unicode support to compare paths. one of the comments in path.zig which shows the importance of this :). Luckily enough I got splitting to work with any kind of 'string' (well actually almost anything that you can give a view and iterator for) thus we can use unicode for it :)

@andrewrk andrewrk added this to the 0.4.0 milestone Apr 6, 2018
@thejoshwolfe
Copy link
Contributor

I think we need a clear usecase for variable sized characters in strings. I know some windows apis require utf16 or some other form of "wide char" strings, but if interfacing with certain apis is the only case for non-utf8 strings, I'm not sure we have a compelling case. You can always have a translation layer for apis with strange data types.

Why would we want, for example, to join/split "wide char" strings? Note that the current join/split functions in the std lib work correctly for utf8 strings already.

@andrewrk
Copy link
Member

@BraedonWooding

We need a clear, detailed, focused proposal for me to label this as a proposal, and then evaluate it for acceptance or rejection. I don't want your work to be wasted, and for something important like this topic, we need to get on the same page.

Here is what I suggest:

Choose the smallest yet meaningful thing you can, and make a very concrete proposal. Explain the use cases for your proposed changes. Show some example code. And then let's talk about that.

You've produced a lot of discussion and a lot of code, but we need to focus your energy for it to be harnessed by this project.

@BraedonWooding
Copy link
Contributor Author

BraedonWooding commented Apr 11, 2018

@thejoshwolfe Split does NOT work with Utf8 Strings :). It currently only works with Ascii Strings maybe give it another look. the reasoning why is this line

        for (self.split_bytes) |split_byte| {
            if (byte == split_byte) { // Meaning only one of the code points has to match for it to split :)
                return true;
            }
        }

So my version is currently the ONLY one Utf8 Compatible. Also for example in os/path.zig It already needs to do a lot of combining and splitting of Utf16 (weird windows ones) so thats a good use case :).

@BraedonWooding
Copy link
Contributor Author

@andrewrk, That is perfectly understandable, I more wanted to gather opinion with this issue then propose any solution. I'll work on a nice proposal :).

@tiehuis
Copy link
Member

tiehuis commented Apr 11, 2018

@BraedonWooding

I think @thejoshwolfe was trying to say that split currently works correctly on utf8 haystacks split by bytes. It won't work splitting against utf8 as you say, although this would probably require a different signature anyway to be explicit about whether one was wanting to split on graphemes/codepoints etc.

@BraedonWooding
Copy link
Contributor Author

Ahhh yes; well it would be a bit weird to not split on codepoints and I think that should be the default behaviour? Maybe not for mem.split but for others :)

@thejoshwolfe
Copy link
Contributor

I stand corrected. split() does not behave how i thought it would, which is like python, java, javascript string split. zig split is like c# string split taking a set of characters rather than a delimiter string (or regex). that was surprising to find out.

If there were a split function that takes a delimiter string, then as long as both input strings are valid utf8, then the outputs will be valid utf8 as well.

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Feb 7, 2019
@andrewrk
Copy link
Member

andrewrk commented Feb 7, 2019

No locale in standard library. Locale will have to be solved with a third party package.

@andrewrk andrewrk closed this as completed Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

5 participants