-
Notifications
You must be signed in to change notification settings - Fork 73
add codepoint-based string functions as Data.String.CodePoints #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've now tested all of the JS code paths meant for legacy browsers and moved everything I reasonably could from JS to PureScript. This should be good to go. Please review. |
99ee1bf
to
c59f340
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great module! 👏 👏
I'm not super familiar with Unicode, so another reviewer would be nice.
src/Data/String/CodePoints.purs
Outdated
import Data.Newtype (class Newtype) | ||
import Data.String as String | ||
-- WARN: This list must be updated to re-export any exports added to Data.String. That makes me sad. | ||
import Data.String (Pattern(..), Replacement(..), charAt, charCodeAt, contains, fromCharArray, joinWith, localeCompare, null, replace, replaceAll, split, stripPrefix, stripSuffix, toChar, toCharArray, toLower, toUpper, trim) as StringReExports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why export functions from a different module? I've seen this before in other modules, but I've never understood why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-export the Data.String functions so that you can switch to code point based functions by simply changing the module name. All functions that would be implemented the same (such as null
, contains
, toUpper
, trim
, etc.) are just re-exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the comment here to clarify this? I don't think what's currently there is quite correct.
src/Data/String/CodePoints.purs
Outdated
derive instance newtypeCodePoint :: Newtype CodePoint _ | ||
|
||
codePointFromInt :: Int -> Maybe CodePoint | ||
codePointFromInt n | 0 <= n && n <= 0x10FFFF = Just (CodePoint n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to document why these code-points are significant and therefore are hardcoded. They are hardcoded in a few functions in this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 17 planes in Unicode, each with 0xFFFF code points. That means that there are 0x110000 code points in total. I will pull this value out as a constant if it is used in more than one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @chexxor - I think the best place to document this would be on the CodePoint
newtype.
@@ -0,0 +1,216 @@ | |||
module Data.String.CodePoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for code points in a specific character set? Perhaps UTF-8 or UTF-16? If so, I wonder if it should be in the module name.
Looks like you're referencing Unicode definitions for some of this - http://www.unicode.org/glossary/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions allow you to treat PureScript/JavaScript strings as if they were sequences of Unicode code points instead of their true underlying implementation, a sequence of UTF-16 code units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add something along those lines as a doc-comment for the module please? Just so that people who come across it on Pursuit know what it's for and when it should be used.
src/Data/String/CodePoints.purs
Outdated
codePointAtFallback n s = Array.index (toCodePointArray s) n | ||
|
||
|
||
count :: (CodePoint -> Boolean) -> String -> Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does count
require a predicate? Is it common to consider only certain characters of a code set when counting? Or is it because code points have complications, like digraphs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See how count
behaves in Data.String
. This is just the equivalent function. It counts the number of leading code points/units that pass the predicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a really odd name for that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think there was a discussion about it somewhere else. Maybe we should keep this name for now, and open an issue to rename both instances of it the next time we make a breaking change. countPrefix
or something perhaps.
Agreed, the API looks nice. Thanks! I don't really feel qualified to review this fully, but I'd just like to suggest that this could be a separate library (in core or otherwise), which would avoid the new dependencies on |
Okay, I think we're ready for another review. You can review just the changes since last time here: michaelficarra/purescript-strings@205838c^...codepoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, I think we're nearly there now :)
src/Data/String/CodePoints.purs
Outdated
|
||
-- | Returns a record with the first code point and the remaining code points | ||
-- | of the given string. Returns Nothing if the string is empty. Operates in | ||
-- | space and time linear to the length of the string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this linear? I would have hoped it would be constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was operating under the assumption that a naïve implementation of String.drop
would copy the remainder of the string. But that's actually probably not the case. I can do tests, but I'm comfortable just asserting it will be constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok great, I'm comfortable with that too.
src/Data/String/CodePoints.purs
Outdated
import Data.String as String | ||
import Data.String.Unsafe as Unsafe | ||
-- WARN: In order for this module to be a drop-in replacement for Data.String, | ||
-- this list must be updated to re-export any exports added to Data.String. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add a new function to Data.String
which operates on the level of code units and therefore needs a slightly different implementation in this module instead of just re-exporting, this comment isn't quite correct, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you suggest saying here? "any exports which don't operate on indices or Chars"? Either way, whenever Data.String is changed, there is an action item, which is what I'm trying to convey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; how about something like:
If a new function is added to Data.String, a version of that function should be exported from this module, which should be the same except that it should operate on the code point level rather than the code unit level. If the function's behaviour does not change based on whether we consider strings as sequences of code points or code units, it can simply be re-exported from Data.String.
codePointToInt :: CodePoint -> Int | ||
codePointToInt (CodePoint n) = n | ||
|
||
unsurrogate :: Int -> Int -> CodePoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every use of this function has an isLead cu0 && isTrail cu1
check beforehand; do you think it might make more sense to include this check inside the unsurrogate
function and have it return a Maybe CodePoint
, returning Nothing
if it is given arguments which don't form a surrogate pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only an internal function, it only has two usages, and neither would be more convenient to write that way, so I'm going to leave it as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then 👍
src/Data/String/CodePoints.purs
Outdated
|
||
-- | Returns the first code point of the string after dropping the given number | ||
-- | of code points from the beginning, if there is such a code point. Operates | ||
-- | in constant space and in time linear to `n`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be unclear what n
refers to here from the point of view of someone reading docs on Pursuit.
src/Data/String/CodePoints.purs
Outdated
|
||
|
||
-- | Returns the number of code points in the given string. Operates in | ||
-- | constant space and time linear to the length of the string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it might be slightly clearer to say "Operates in constant space and in time linear to the length of the string"?
src/Data/String/CodePoints.purs
Outdated
|
||
-- | Returns a string containing the leading sequence of code points which all | ||
-- | match the given predicate from the given string. Operates in space and | ||
-- | time linear to the given number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify what is meant by "the given number" here?
28d72d0
to
255ab82
Compare
255ab82
to
3a24c8d
Compare
@hdgarrood Comments addressed. |
@hdgarrood Updated. |
src/Data/String/CodePoints.purs
Outdated
toCodePointArrayFallback s = unfoldr unconsButWithTuple s | ||
|
||
unconsButWithTuple :: String -> Maybe (Tuple CodePoint String) | ||
unconsButWithTuple s' = (\{ head, tail } -> Tuple head tail) <$> uncons s' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly better to just use s
as the identifier here?
Ok, I think this looks good. @paf31, do you have any other comments? |
👍 @hdgarrood Seems like you've reviewed this pretty thoroughly, so I'm happy to merge. Thanks for reviewing everything. Thank you @michaelficarra for adding plenty of tests, it makes it easier to see what's going on 😄 Is there anything in particular you'd like to add in the release notes for this change? |
@paf31 You might want to mention that in the future we may switch this with the code unit based functions as the default. |
🎉 Thanks very much! |
Oops, I probably should have squashed. Oh well. |
Michael intended to do so himself:
|
Hahaha, oh boy. That's a lot of commits. Oh well, not like we can do anything about it now. Thanks, all, for the reviews and support. This wouldn't have been nearly as good without them. |
Functions in this module treat strings as if they were sequences of code points. I still want to pull more out of the FFI code and need to exercise all the FFI code paths by running on older browsers. But still, it should be ready for review. I'll squash after the review is done.