You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The char module provides char::is_uppercase() and char::is_lowercase() for determining whether a character is uppercase or lowercase, and char::to_uppercase() and char::to_lowercase() for converting to uppercase or lowercase. However, a small number of characters are titlecase, which is in between the two; the standard library provides no APIs for handling titlecase.
Motivating examples or use cases
Many software systems place restrictions on the allowed case of characters, or use case for various semantic distinctions. For example, a programming language might require local variable names to be lowercase, or constants to be uppercase. Because most characters and languages are caseless, such rules are usually best implemented by excluding particular cases rather than requiring a particular case. Titlecase characters are conceptually both partly lowercase and partly uppercase, so an API that excludes either lowercase or uppercase characters will want to exclude titlecase as well, and an API that assigns special meaning to a particular case will generally want to assign the meaning to titlecase also.
In addition, it's common to want to convert a string to titlecase, which means capitalizing the first letter of all or most words. Defining what a word is, and deciding which words should be capitalized, is complex and context-dependent, and thus unsuited for the standard library. (Notably, UAX 29 and the unicode-segmentation crate are not the end-all-be-all of determining word boundaries. For example, software identifiers, like those dealt with by heck, have a very different concept of what a word is compared to normal running text). However, once individual words have been isolated for capitalization, the capitalization process and result are the same across all domains (disregarding locale-specific special casings that the standard library does not handle.) The exact rule is defined by the Unicode Standard:
For a string X: [...]
R3 toTitlecase(X): Find the word boundaries in X [...] For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C).
This algorithm is not complicated to implement, as long as the titlecase mappings are available. However, if the titlecase mappings are not available, users are far more likely to resort to an erroneous implementation using to_uppercase, rather that to add an additional crates.io dependency or compile the data themselves.
Solution sketch
Add the following to core::char:
/// Analogous to [`ToUppercase`](https://doc.rust-lang.org/core/char/struct.ToUppercase.html)/// and [`ToLowercase`](https://doc.rust-lang.org/core/char/struct.ToLowercase.html).#[derive(Clone,Debug)]pubstructToTitlecase(/*...*/);implIteratorforToTitlecase{typeItem = char;/* ... */}impl fmt::DisplayforToTitlecase{/* ... */}#[derive(Clone,Copy,Debug,PartialEq,Eq,Hash,PartialOrd,Ord)]pubenumCharCase{Lower = 0b00,Title = 0b10,Upper = 0b11,}
Add the following implementations to char:
/// Whether the character is uppercase, lowercase, or titlecase./// `core` already includes a data table for this property internally/// (used to implement the final-sigma casing rules),/// so implementation is trivial.#[must_use]#[inline]pubfnis_cased(self) -> bool{matchself{'A'..='Z' | 'a'..='z' => true,'\0'..='\u{A9}' => false,
_ => unicode::Cased(self),}}/// Whether the character is in Unicode general category Titlecase_Letter.#[must_use]#[inline]pubfnis_titlecase(self) -> bool{matchself{'\0'..='\u{01C4}' => false,
_ => self.is_cased() && !self.is_lowercase() && !self.is_uppercase()}}use core::char::CharCase;/// The case of this character, or `None` if it is uncased.#[must_use]pubfncase(self) -> Option<CharCase>{matchself{'A'..='Z' => Some(CharCase::Upper),'a'..='z' => Some(CharCase::Lower),'\0'..='\u{A9}' => None,
_ if !self.is_cased() => None,
_ ifself.is_lowercase() => Some(CharCase::Lower),
_ ifself.is_uppercase() => Some(CharCase::Upper),
_ => Some(CharCase::Title),}}use core::char::ToTitlecase;/// The only proposed API/// that requires adding new static data to `core::unicode`./// Most characters map to the same uppercase and titlecase,/// so we would only need to store the mappings that differ.#[must_use]#[inline]pubfnto_titlecase(self) -> ToTitlecase{ToTitlecase(CaseMappingIter::new(conversions::to_title(self)))}
Alternatives
These APIs could be implemented by a crate on crates.io (and in fact, several options exist already). However, doing so in core is more efficient for binary sizes, as core already contains an internal data table for the Cased property (while third-party implementations must include their own duplicate copy). Also, developers are far more likely to simply not handle titlecase correctly, than they are to add a dependency just to deal with it.
Links and related work
char::to_titlecase was added before 1.0 (rust-lang/rust#26039) but later removed (rust-lang/rust#26555, rust-lang/rust#26561), with the justification that converting a string to titlecase requires a word breaking algorithm from outside std. However, as I have argued above, providing titlecase APIs within core would be beneficial even if word breaking must still be implemented outside of it.
The text was updated successfully, but these errors were encountered:
Nit: the initial paragraph seems to suggest that "being titlecase" is a boolean property of characters, but my best interpretation of thr linked sources suggests that unicode only defines a function for converting a character to titlecase, ie. it is a transformation, not a property.
Nit: the initial paragraph seems to suggest that "being titlecase" is a boolean property of characters, but my best interpretation of thr linked sources suggests that unicode only defines a function for converting a character to titlecase, ie. it is a transformation, not a property.
No, title case is a property of unicode scalars. It is the Lt category in Unicode (even if it is much smaller than the upper case and lower case categories).
To be exact, titlecase can refer to several things:
The titlecase tranformationtoTitlecase(), which converts a character or string into…
Its titlecase form; a character or string s is said to be “in titlecase” if toTitlecase(s) == s.
The titlecase property of characters, represented by the General_Category of Titlecase_Letter (Lt); a character c is said to have this property, to “be titlecase”, if c == toTitlecase(c) != toUppercase(c).
Proposal
Problem statement
The
char
module provideschar::is_uppercase()
andchar::is_lowercase()
for determining whether a character is uppercase or lowercase, andchar::to_uppercase()
andchar::to_lowercase()
for converting to uppercase or lowercase. However, a small number of characters are titlecase, which is in between the two; the standard library provides no APIs for handling titlecase.Motivating examples or use cases
Many software systems place restrictions on the allowed case of characters, or use case for various semantic distinctions. For example, a programming language might require local variable names to be lowercase, or constants to be uppercase. Because most characters and languages are caseless, such rules are usually best implemented by excluding particular cases rather than requiring a particular case. Titlecase characters are conceptually both partly lowercase and partly uppercase, so an API that excludes either lowercase or uppercase characters will want to exclude titlecase as well, and an API that assigns special meaning to a particular case will generally want to assign the meaning to titlecase also.
In addition, it's common to want to convert a string to titlecase, which means capitalizing the first letter of all or most words. Defining what a word is, and deciding which words should be capitalized, is complex and context-dependent, and thus unsuited for the standard library. (Notably, UAX 29 and the
unicode-segmentation
crate are not the end-all-be-all of determining word boundaries. For example, software identifiers, like those dealt with byheck
, have a very different concept of what a word is compared to normal running text). However, once individual words have been isolated for capitalization, the capitalization process and result are the same across all domains (disregarding locale-specific special casings that the standard library does not handle.) The exact rule is defined by the Unicode Standard:This algorithm is not complicated to implement, as long as the titlecase mappings are available. However, if the titlecase mappings are not available, users are far more likely to resort to an erroneous implementation using
to_uppercase
, rather that to add an additional crates.io dependency or compile the data themselves.Solution sketch
Add the following to
core::char
:Add the following implementations to
char
:Alternatives
These APIs could be implemented by a crate on crates.io (and in fact, several options exist already). However, doing so in
core
is more efficient for binary sizes, ascore
already contains an internal data table for the Cased property (while third-party implementations must include their own duplicate copy). Also, developers are far more likely to simply not handle titlecase correctly, than they are to add a dependency just to deal with it.Links and related work
char::to_titlecase
was added before 1.0 (rust-lang/rust#26039) but later removed (rust-lang/rust#26555, rust-lang/rust#26561), with the justification that converting a string to titlecase requires a word breaking algorithm from outsidestd
. However, as I have argued above, providing titlecase APIs withincore
would be beneficial even if word breaking must still be implemented outside of it.The text was updated successfully, but these errors were encountered: