Skip to content

add codepoint-based string functions as Data.String.CodePoints #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Jul 10, 2017

Conversation

michaelficarra
Copy link
Contributor

Functions in this module treat strings as if they were sequences of code points. I still want to pull more out of the FFI code and need to exercise all the FFI code paths by running on older browsers. But still, it should be ready for review. I'll squash after the review is done.

@michaelficarra
Copy link
Contributor Author

I've now tested all of the JS code paths meant for legacy browsers and moved everything I reasonably could from JS to PureScript. This should be good to go. Please review.

Copy link

@chexxor chexxor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great module! 👏 👏
I'm not super familiar with Unicode, so another reviewer would be nice.

import Data.Newtype (class Newtype)
import Data.String as String
-- WARN: This list must be updated to re-export any exports added to Data.String. That makes me sad.
import Data.String (Pattern(..), Replacement(..), charAt, charCodeAt, contains, fromCharArray, joinWith, localeCompare, null, replace, replaceAll, split, stripPrefix, stripSuffix, toChar, toCharArray, toLower, toUpper, trim) as StringReExports
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why export functions from a different module? I've seen this before in other modules, but I've never understood why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-export the Data.String functions so that you can switch to code point based functions by simply changing the module name. All functions that would be implemented the same (such as null, contains, toUpper, trim, etc.) are just re-exported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the comment here to clarify this? I don't think what's currently there is quite correct.

derive instance newtypeCodePoint :: Newtype CodePoint _

codePointFromInt :: Int -> Maybe CodePoint
codePointFromInt n | 0 <= n && n <= 0x10FFFF = Just (CodePoint n)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to document why these code-points are significant and therefore are hardcoded. They are hardcoded in a few functions in this module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 17 planes in Unicode, each with 0xFFFF code points. That means that there are 0x110000 code points in total. I will pull this value out as a constant if it is used in more than one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @chexxor - I think the best place to document this would be on the CodePoint newtype.

@@ -0,0 +1,216 @@
module Data.String.CodePoints
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for code points in a specific character set? Perhaps UTF-8 or UTF-16? If so, I wonder if it should be in the module name.

Looks like you're referencing Unicode definitions for some of this - http://www.unicode.org/glossary/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions allow you to treat PureScript/JavaScript strings as if they were sequences of Unicode code points instead of their true underlying implementation, a sequence of UTF-16 code units.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add something along those lines as a doc-comment for the module please? Just so that people who come across it on Pursuit know what it's for and when it should be used.

codePointAtFallback n s = Array.index (toCodePointArray s) n


count :: (CodePoint -> Boolean) -> String -> Int
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does count require a predicate? Is it common to consider only certain characters of a code set when counting? Or is it because code points have complications, like digraphs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See how count behaves in Data.String. This is just the equivalent function. It counts the number of leading code points/units that pass the predicate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a really odd name for that function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there was a discussion about it somewhere else. Maybe we should keep this name for now, and open an issue to rename both instances of it the next time we make a breaking change. countPrefix or something perhaps.

@paf31
Copy link
Contributor

paf31 commented Jun 3, 2017

Agreed, the API looks nice. Thanks!

I don't really feel qualified to review this fully, but I'd just like to suggest that this could be a separate library (in core or otherwise), which would avoid the new dependencies on arrays and lists.

@michaelficarra
Copy link
Contributor Author

Okay, I think we're ready for another review. You can review just the changes since last time here: michaelficarra/purescript-strings@205838c^...codepoints

Copy link
Contributor

@hdgarrood hdgarrood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I think we're nearly there now :)


-- | Returns a record with the first code point and the remaining code points
-- | of the given string. Returns Nothing if the string is empty. Operates in
-- | space and time linear to the length of the string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this linear? I would have hoped it would be constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was operating under the assumption that a naïve implementation of String.drop would copy the remainder of the string. But that's actually probably not the case. I can do tests, but I'm comfortable just asserting it will be constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, I'm comfortable with that too.

import Data.String as String
import Data.String.Unsafe as Unsafe
-- WARN: In order for this module to be a drop-in replacement for Data.String,
-- this list must be updated to re-export any exports added to Data.String.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a new function to Data.String which operates on the level of code units and therefore needs a slightly different implementation in this module instead of just re-exporting, this comment isn't quite correct, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest saying here? "any exports which don't operate on indices or Chars"? Either way, whenever Data.String is changed, there is an action item, which is what I'm trying to convey.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; how about something like:

If a new function is added to Data.String, a version of that function should be exported from this module, which should be the same except that it should operate on the code point level rather than the code unit level. If the function's behaviour does not change based on whether we consider strings as sequences of code points or code units, it can simply be re-exported from Data.String.

codePointToInt :: CodePoint -> Int
codePointToInt (CodePoint n) = n

unsurrogate :: Int -> Int -> CodePoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every use of this function has an isLead cu0 && isTrail cu1 check beforehand; do you think it might make more sense to include this check inside the unsurrogate function and have it return a Maybe CodePoint, returning Nothing if it is given arguments which don't form a surrogate pair?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only an internal function, it only has two usages, and neither would be more convenient to write that way, so I'm going to leave it as-is for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then 👍


-- | Returns the first code point of the string after dropping the given number
-- | of code points from the beginning, if there is such a code point. Operates
-- | in constant space and in time linear to `n`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be unclear what n refers to here from the point of view of someone reading docs on Pursuit.



-- | Returns the number of code points in the given string. Operates in
-- | constant space and time linear to the length of the string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might be slightly clearer to say "Operates in constant space and in time linear to the length of the string"?


-- | Returns a string containing the leading sequence of code points which all
-- | match the given predicate from the given string. Operates in space and
-- | time linear to the given number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify what is meant by "the given number" here?

@michaelficarra
Copy link
Contributor Author

@hdgarrood Comments addressed.

@michaelficarra
Copy link
Contributor Author

@hdgarrood Updated.

toCodePointArrayFallback s = unfoldr unconsButWithTuple s

unconsButWithTuple :: String -> Maybe (Tuple CodePoint String)
unconsButWithTuple s' = (\{ head, tail } -> Tuple head tail) <$> uncons s'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly better to just use s as the identifier here?

@hdgarrood
Copy link
Contributor

Ok, I think this looks good. @paf31, do you have any other comments?

@paf31
Copy link
Contributor

paf31 commented Jul 9, 2017

👍 @hdgarrood Seems like you've reviewed this pretty thoroughly, so I'm happy to merge. Thanks for reviewing everything.

Thank you @michaelficarra for adding plenty of tests, it makes it easier to see what's going on 😄

Is there anything in particular you'd like to add in the release notes for this change?

@michaelficarra
Copy link
Contributor Author

@paf31 You might want to mention that in the future we may switch this with the code unit based functions as the default.

@hdgarrood
Copy link
Contributor

🎉 Thanks very much!

@hdgarrood hdgarrood merged commit d2e3292 into purescript:master Jul 10, 2017
@hdgarrood
Copy link
Contributor

Oops, I probably should have squashed. Oh well.

@davidchambers
Copy link
Contributor

I probably should have squashed.

Michael intended to do so himself:

I'll squash after the review is done.

@michaelficarra michaelficarra deleted the codepoints branch July 10, 2017 16:16
@michaelficarra
Copy link
Contributor Author

Hahaha, oh boy. That's a lot of commits. Oh well, not like we can do anything about it now. Thanks, all, for the reviews and support. This wouldn't have been nearly as good without them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants