Skip to content

Unicode to_lower() and to_upper #9363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aaronlaursen opened this issue Sep 20, 2013 · 9 comments
Closed

Unicode to_lower() and to_upper #9363

aaronlaursen opened this issue Sep 20, 2013 · 9 comments
Labels
A-Unicode Area: Unicode

Comments

@aaronlaursen
Copy link
Contributor

As mentioned by Josh Aas on IRC, rust doesn't seem to have a function to convert Unicode strings to their upper and lower case equivalents, in a similar way to to_lower() and to_upper() for ascii.

@aaronlaursen
Copy link
Contributor Author

I'm interested in working on this.

At the moment, it seems like the best option might be changing /src/etc/unicode.py to also read in CaseFolding.txt and add a corresponding table to /libstd/Unicode.rs. Then to_lower() and to_upper() can get added to /libstd/char.rs and mirror their ascii counterparts.

Does anyone know of a better way to do this, a reason this shouldn't be done, or a better place to put this?
At some point there was mention of doing proper icu bindings, but lower-/upper-case seems fundamental enough to be in the core.

Edit: It looks like this data can also be parsed out of UnicodeData.txt, so CaseFolding.txt is unnecessary.

@Kimundi
Copy link
Member

Kimundi commented Sep 20, 2013

@aaronlaursen The problem is that case folding depends on the locale. For example, in the Turkish locale i and I don't map to each other, so the api decisions get bigger than just adding a new table.

Though I personally think we should support at least a best-effort default mapping, but that might not be the general consensus.

@aaronlaursen
Copy link
Contributor Author

@Kimundi Good point. I hadn't thought of the locale issues...

The CaseFolding.txt does contain these local-dependent mappings (at least for the Turkish locale), but it would require a more complicated system for locale detection, etc.

I still share your desire for a default mapping. The UnicaodeData.txt has only a single mapping built in and could serve as a sane default...

I'm not sure how to go about finding the general consensus on these things...
I'll start a branch and I guess i can sit on the changes until it becomes clear which direction to go.

@pzol pzol added the A-unicode label Feb 26, 2014
@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

Related #9084 #12561
I would suggest to implement for char

fn to_case_fold(c: char) -> &' static [char];

in the spirit of http://docs.python.org/dev/library/stdtypes.html#str.casefold

@emberian
Copy link
Member

This should live in a libunicode, just so as to not bloat libstd further with unicode tables.

@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

Shouldn't we move all unicode tables to libunicode then?

@emberian
Copy link
Member

Ideally, yes.

On Wed, Feb 26, 2014 at 2:25 PM, Piotr Zolnierek
[email protected]:

Shouldn't we move all unicode tables to libunicode then?


Reply to this email directly or view it on GitHubhttps://github.com//issues/9363#issuecomment-36165615
.

@huonw
Copy link
Member

huonw commented Feb 26, 2014

(It's probably more useful/efficient for case folding to yield &'static str.)

@huonw
Copy link
Member

huonw commented May 31, 2014

This is now implemented as to_uppercase (and to_lowercase).

@huonw huonw closed this as completed May 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode
Projects
None yet
Development

No branches or pull requests

5 participants