-
Notifications
You must be signed in to change notification settings - Fork 300
Include unicode escape sequence in docs on String literal #232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We do mention the encoding in the Prim docs, but it’s probably worth saying it here too. See https://pursuit.purescript.org/builtins/docs/Prim#t:String. The example you have there isn’t to do with encoding, though. That would evaluate to false regardless of the encoding, because you have a different set of code points between a and b. I consider the current behaviour of the String instances like Eq and Ord to be the most sensible option though: see purescript/purescript-strings#79 (comment) and the few comments following it. |
Interestingly, it looks like Perl6 has two different built-in types for strings - |
You're right, but here's an example which demonstrates my example:
Data.String exports Data.CodePoints, effectively softly recommending it. The question, then, is: How should PureScript-the-language define how Strings work? I'd prefer it does the most intuitive thing by default and allows you to opt-in to a data type which is faster on the specific backend you are targeting. |
Yes, this is totally on purpose: you should be using Data.String.CodePoints unless you have a specific reason to use Data.String.CodeUnits. Using the functions in Data.String.CodeUnits makes it way too easy to accidentally do things like split surrogate pairs in half: > CU.splitAt 1 "🐱🐲"
{ after: "�🐲", before: "�" }
> CP.splitAt 1 "🐱🐲"
{ after: "🐲", before: "🐱" }
I'm not sure it's safe to say that any option is the most intuitive, or even the fastest on any given platform. It's highly context-dependent: UTF-8 is often a good choice, especially for English text, because you generally just need one byte for each code point, and because (I think) it's the most common encoding on the web. However UTF-16 can be better for other languages, e.g. east Asian languages, which can fit in fewer bytes -- many characters which fit into two bytes when encoded as UTF-16 will require three in UTF-8. Also, many programming languages' default string type (including JS) uses UTF-16. PureScript's |
I had to look into PureScript source code and test cases to learn how to input Unicode characters by code points like this.
At the same time, it might also be good to define that PS strings use UTF-16 character-encoding internally, and whatever other details that entails.
Looks like swift-lang does String equality by code-points [1], not code-units like many languages do. I wonder if PureScript should treat this as a bug for fixing.
[1] https://oleb.net/blog/2017/11/swift-4-strings/
The text was updated successfully, but these errors were encountered: