Skip to content

Add titlecase APIs to char #354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Jules-Bertholet opened this issue Mar 16, 2024 · 3 comments
Open

Add titlecase APIs to char #354

Jules-Bertholet opened this issue Mar 16, 2024 · 3 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@Jules-Bertholet
Copy link

Jules-Bertholet commented Mar 16, 2024

Proposal

Problem statement

The char module provides char::is_uppercase() and char::is_lowercase() for determining whether a character is uppercase or lowercase, and char::to_uppercase() and char::to_lowercase() for converting to uppercase or lowercase. However, a small number of characters are titlecase, which is in between the two; the standard library provides no APIs for handling titlecase.

Motivating examples or use cases

Many software systems place restrictions on the allowed case of characters, or use case for various semantic distinctions. For example, a programming language might require local variable names to be lowercase, or constants to be uppercase. Because most characters and languages are caseless, such rules are usually best implemented by excluding particular cases rather than requiring a particular case. Titlecase characters are conceptually both partly lowercase and partly uppercase, so an API that excludes either lowercase or uppercase characters will want to exclude titlecase as well, and an API that assigns special meaning to a particular case will generally want to assign the meaning to titlecase also.

In addition, it's common to want to convert a string to titlecase, which means capitalizing the first letter of all or most words. Defining what a word is, and deciding which words should be capitalized, is complex and context-dependent, and thus unsuited for the standard library. (Notably, UAX 29 and the unicode-segmentation crate are not the end-all-be-all of determining word boundaries. For example, software identifiers, like those dealt with by heck, have a very different concept of what a word is compared to normal running text). However, once individual words have been isolated for capitalization, the capitalization process and result are the same across all domains (disregarding locale-specific special casings that the standard library does not handle.) The exact rule is defined by the Unicode Standard:

For a string X: [...]

R3 toTitlecase(X): Find the word boundaries in X [...] For each word boundary, find the first cased character F following the word boundary. If F exists, map F to Titlecase_Mapping(F); then map all characters C between F and the following word boundary to Lowercase_Mapping(C).

This algorithm is not complicated to implement, as long as the titlecase mappings are available. However, if the titlecase mappings are not available, users are far more likely to resort to an erroneous implementation using to_uppercase, rather that to add an additional crates.io dependency or compile the data themselves.

Solution sketch

Add the following to core::char:

/// Analogous to [`ToUppercase`](https://doc.rust-lang.org/core/char/struct.ToUppercase.html)
/// and [`ToLowercase`](https://doc.rust-lang.org/core/char/struct.ToLowercase.html).
#[derive(Clone, Debug)]
pub struct ToTitlecase(/*...*/);

impl Iterator for ToTitlecase {
    type Item = char;
    /* ... */ 
}

impl fmt::Display for ToTitlecase { /* ... */ }

#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub enum CharCase {
    Lower = 0b00,
    Title = 0b10,
    Upper = 0b11,
}

Add the following implementations to char:

/// Whether the character is uppercase, lowercase, or titlecase.
/// `core` already includes a data table for this property internally
/// (used to implement the final-sigma casing rules),
/// so implementation is trivial.
#[must_use]
#[inline]
pub fn is_cased(self) -> bool {
    match self {
        'A'..='Z' | 'a'..='z' => true,
        '\0'..='\u{A9}' => false,
        _ => unicode::Cased(self),
    }
}

///  Whether the character is in Unicode general category Titlecase_Letter.
#[must_use]
#[inline]
pub fn is_titlecase(self) -> bool {
    match self {
        '\0'..='\u{01C4}' => false,
        _ => self.is_cased() && !self.is_lowercase() && !self.is_uppercase()
   }
}

use core::char::CharCase;

/// The case of this character, or `None` if it is uncased.
#[must_use]
pub fn case(self) -> Option<CharCase> {
    match self {
        'A'..='Z' => Some(CharCase::Upper),
        'a'..='z' => Some(CharCase::Lower),
        '\0'..='\u{A9}' => None,
        _ if !self.is_cased() => None,
        _ if self.is_lowercase() => Some(CharCase::Lower),
        _ if self.is_uppercase() => Some(CharCase::Upper),
        _ => Some(CharCase::Title),
    }
}

use core::char::ToTitlecase;

/// The only proposed API
/// that requires adding new static data to `core::unicode`.
/// Most characters map to the same uppercase and titlecase,
/// so we would only need to store the mappings that differ.
#[must_use]
#[inline]
pub fn to_titlecase(self) -> ToTitlecase {
    ToTitlecase(CaseMappingIter::new(conversions::to_title(self)))
}

Alternatives

These APIs could be implemented by a crate on crates.io (and in fact, several options exist already). However, doing so in core is more efficient for binary sizes, as core already contains an internal data table for the Cased property (while third-party implementations must include their own duplicate copy). Also, developers are far more likely to simply not handle titlecase correctly, than they are to add a dependency just to deal with it.

Links and related work

char::to_titlecase was added before 1.0 (rust-lang/rust#26039) but later removed (rust-lang/rust#26555, rust-lang/rust#26561), with the justification that converting a string to titlecase requires a word breaking algorithm from outside std. However, as I have argued above, providing titlecase APIs within core would be beneficial even if word breaking must still be implemented outside of it.

@Jules-Bertholet Jules-Bertholet added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Mar 16, 2024
@lolbinarycat
Copy link

Nit: the initial paragraph seems to suggest that "being titlecase" is a boolean property of characters, but my best interpretation of thr linked sources suggests that unicode only defines a function for converting a character to titlecase, ie. it is a transformation, not a property.

@RustyYato
Copy link

Nit: the initial paragraph seems to suggest that "being titlecase" is a boolean property of characters, but my best interpretation of thr linked sources suggests that unicode only defines a function for converting a character to titlecase, ie. it is a transformation, not a property.

No, title case is a property of unicode scalars. It is the Lt category in Unicode (even if it is much smaller than the upper case and lower case categories).

See: https://www.unicode.org/reports/tr44/#GC_Values_Table

For reference: here's a list of all characters in the Lt category: https://www.compart.com/en/unicode/category/Lt

@Jules-Bertholet
Copy link
Author

To be exact, titlecase can refer to several things:

  • The titlecase tranformation toTitlecase(), which converts a character or string into…
  • Its titlecase form; a character or string s is said to be “in titlecase” if toTitlecase(s) == s.
  • The titlecase property of characters, represented by the General_Category of Titlecase_Letter (Lt); a character c is said to have this property, to “be titlecase”, if c == toTitlecase(c) != toUppercase(c).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

3 participants