add codepoint-based string functions as Data.String.CodePoints #79

michaelficarra · 2017-05-27T22:17:56Z

Functions in this module treat strings as if they were sequences of code points. I still want to pull more out of the FFI code and need to exercise all the FFI code paths by running on older browsers. But still, it should be ready for review. I'll squash after the review is done.

michaelficarra · 2017-05-29T18:08:15Z

I've now tested all of the JS code paths meant for legacy browsers and moved everything I reasonably could from JS to PureScript. This should be good to go. Please review.

chexxor

Great module! 👏 👏
I'm not super familiar with Unicode, so another reviewer would be nice.

chexxor · 2017-06-01T02:54:40Z

src/Data/String/CodePoints.purs

+import Data.Newtype (class Newtype)
+import Data.String as String
+-- WARN: This list must be updated to re-export any exports added to Data.String. That makes me sad.
+import Data.String (Pattern(..), Replacement(..), charAt, charCodeAt, contains, fromCharArray, joinWith, localeCompare, null, replace, replaceAll, split, stripPrefix, stripSuffix, toChar, toCharArray, toLower, toUpper, trim) as StringReExports


Why export functions from a different module? I've seen this before in other modules, but I've never understood why.

I re-export the Data.String functions so that you can switch to code point based functions by simply changing the module name. All functions that would be implemented the same (such as null, contains, toUpper, trim, etc.) are just re-exported.

Can you please update the comment here to clarify this? I don't think what's currently there is quite correct.

chexxor · 2017-06-01T02:55:35Z

src/Data/String/CodePoints.purs

+derive instance newtypeCodePoint :: Newtype CodePoint _
+
+codePointFromInt :: Int -> Maybe CodePoint
+codePointFromInt n | 0 <= n && n <= 0x10FFFF = Just (CodePoint n)


Would be nice to document why these code-points are significant and therefore are hardcoded. They are hardcoded in a few functions in this module.

There are 17 planes in Unicode, each with 0xFFFF code points. That means that there are 0x110000 code points in total. I will pull this value out as a constant if it is used in more than one place.

Agreed with @chexxor - I think the best place to document this would be on the CodePoint newtype.

chexxor · 2017-06-01T02:57:25Z

src/Data/String/CodePoints.purs

@@ -0,0 +1,216 @@
+module Data.String.CodePoints


Is this for code points in a specific character set? Perhaps UTF-8 or UTF-16? If so, I wonder if it should be in the module name.

Looks like you're referencing Unicode definitions for some of this - http://www.unicode.org/glossary/

These functions allow you to treat PureScript/JavaScript strings as if they were sequences of Unicode code points instead of their true underlying implementation, a sequence of UTF-16 code units.

Could you add something along those lines as a doc-comment for the module please? Just so that people who come across it on Pursuit know what it's for and when it should be used.

chexxor · 2017-06-01T03:08:51Z

src/Data/String/CodePoints.purs

+codePointAtFallback n s = Array.index (toCodePointArray s) n
+
+
+count :: (CodePoint -> Boolean) -> String -> Int


Why does count require a predicate? Is it common to consider only certain characters of a code set when counting? Or is it because code points have complications, like digraphs?

See how count behaves in Data.String. This is just the equivalent function. It counts the number of leading code points/units that pass the predicate.

It's just a really odd name for that function.

Yeah I think there was a discussion about it somewhere else. Maybe we should keep this name for now, and open an issue to rename both instances of it the next time we make a breaking change. countPrefix or something perhaps.

paf31 · 2017-06-03T21:15:53Z

Agreed, the API looks nice. Thanks!

I don't really feel qualified to review this fully, but I'd just like to suggest that this could be a separate library (in core or otherwise), which would avoid the new dependencies on arrays and lists.

michaelficarra · 2017-07-07T09:36:25Z

Okay, I think we're ready for another review. You can review just the changes since last time here: michaelficarra/purescript-strings@205838c^...codepoints

hdgarrood

Thanks for this, I think we're nearly there now :)

hdgarrood · 2017-07-07T15:04:07Z

src/Data/String/CodePoints.purs

+
+-- | Returns a record with the first code point and the remaining code points
+-- | of the given string. Returns Nothing if the string is empty. Operates in
+-- | space and time linear to the length of the string.


Is this linear? I would have hoped it would be constant.

I was operating under the assumption that a naïve implementation of String.drop would copy the remainder of the string. But that's actually probably not the case. I can do tests, but I'm comfortable just asserting it will be constant.

Ok great, I'm comfortable with that too.

hdgarrood · 2017-07-07T15:17:14Z

src/Data/String/CodePoints.purs

+import Data.String as String
+import Data.String.Unsafe as Unsafe
+-- WARN: In order for this module to be a drop-in replacement for Data.String,
+-- this list must be updated to re-export any exports added to Data.String.


If we add a new function to Data.String which operates on the level of code units and therefore needs a slightly different implementation in this module instead of just re-exporting, this comment isn't quite correct, right?

What would you suggest saying here? "any exports which don't operate on indices or Chars"? Either way, whenever Data.String is changed, there is an action item, which is what I'm trying to convey.

Right; how about something like:

If a new function is added to Data.String, a version of that function should be exported from this module, which should be the same except that it should operate on the code point level rather than the code unit level. If the function's behaviour does not change based on whether we consider strings as sequences of code points or code units, it can simply be re-exported from Data.String.

hdgarrood · 2017-07-07T15:21:31Z

src/Data/String/CodePoints.purs

+codePointToInt :: CodePoint -> Int
+codePointToInt (CodePoint n) = n
+
+unsurrogate :: Int -> Int -> CodePoint


Every use of this function has an isLead cu0 && isTrail cu1 check beforehand; do you think it might make more sense to include this check inside the unsurrogate function and have it return a Maybe CodePoint, returning Nothing if it is given arguments which don't form a surrogate pair?

It's only an internal function, it only has two usages, and neither would be more convenient to write that way, so I'm going to leave it as-is for now.

Ok then 👍

hdgarrood · 2017-07-07T15:26:33Z

src/Data/String/CodePoints.purs

+
+-- | Returns the first code point of the string after dropping the given number
+-- | of code points from the beginning, if there is such a code point. Operates
+-- | in constant space and in time linear to `n`.


I think it might be unclear what n refers to here from the point of view of someone reading docs on Pursuit.

hdgarrood · 2017-07-07T15:30:13Z

src/Data/String/CodePoints.purs

+
+
+-- | Returns the number of code points in the given string. Operates in
+-- | constant space and time linear to the length of the string.


Do you think it might be slightly clearer to say "Operates in constant space and in time linear to the length of the string"?

hdgarrood · 2017-07-07T15:31:51Z

src/Data/String/CodePoints.purs

+
+-- | Returns a string containing the leading sequence of code points which all
+-- | match the given predicate from the given string. Operates in space and
+-- | time linear to the given number.


Could you please clarify what is meant by "the given number" here?

michaelficarra · 2017-07-07T16:57:42Z

@hdgarrood Comments addressed.

michaelficarra · 2017-07-08T05:42:15Z

@hdgarrood Updated.

hdgarrood · 2017-07-08T12:24:45Z

src/Data/String/CodePoints.purs

+toCodePointArrayFallback s = unfoldr unconsButWithTuple s
+
+unconsButWithTuple :: String -> Maybe (Tuple CodePoint String)
+unconsButWithTuple s' = (\{ head, tail } -> Tuple head tail) <$> uncons s'


Possibly better to just use s as the identifier here?

hdgarrood · 2017-07-08T12:25:23Z

Ok, I think this looks good. @paf31, do you have any other comments?

paf31 · 2017-07-09T20:25:25Z

👍 @hdgarrood Seems like you've reviewed this pretty thoroughly, so I'm happy to merge. Thanks for reviewing everything.

Thank you @michaelficarra for adding plenty of tests, it makes it easier to see what's going on 😄

Is there anything in particular you'd like to add in the release notes for this change?

michaelficarra · 2017-07-09T20:57:08Z

@paf31 You might want to mention that in the future we may switch this with the code unit based functions as the default.

hdgarrood · 2017-07-10T13:22:00Z

🎉 Thanks very much!

hdgarrood · 2017-07-10T13:23:36Z

Oops, I probably should have squashed. Oh well.

davidchambers · 2017-07-10T13:34:27Z

I probably should have squashed.

Michael intended to do so himself:

I'll squash after the review is done.

michaelficarra · 2017-07-10T16:19:15Z

Hahaha, oh boy. That's a lot of commits. Oh well, not like we can do anything about it now. Thanks, all, for the reviews and support. This wouldn't have been nearly as good without them.

michaelficarra added 23 commits May 24, 2017 23:29

WIP code point based string functions

428b995

more progress

dc0577c

minor stuff

25572de

count

8279da8

drop and take

292e0de

length

fd91b0b

singleton

8387641

splitAt

fb47387

use String.fromCodePoint in singleton implementation when available

5a6cfd0

re-export Data.String

3003c09

uncons

75117d2

re-arrange imports

ecfbf0b

re-arrange JS exports

d5b6d92

fix count; implement dropWhile and takeWhile

8860295

indexOf and lastIndexOf

a6855b4

add some initial tests and fix some bugs

8c55257

trailing whitespace

a26afdf

finished the tests

c1ff8c5

fix linting errors

04154a5

change re-export of Data.String

c798dfe

bugfixes

71cdcf2

move fromCodePoint from JS to purs

2c2418a

move codePointAt0 from JS to purs

46e9545

remove TODOs

c59f340

michaelficarra force-pushed the codepoints branch from 99ee1bf to c59f340 Compare May 29, 2017 18:41

chexxor reviewed Jun 1, 2017

View reviewed changes

use charCodeAt from Data.String.Unsafe

71c5156

chexxor mentioned this pull request Jun 2, 2017

Rename "count" to "countPrefix" #81

Closed

michaelficarra added 7 commits July 5, 2017 09:51

consistent code unit variable names

4292a8b

bug fix lastIndexOf'

0d81e0b

add comments and complexity notes

370af7c

update Data.String import warning comment

cef521a

refactor to avoid lists dep; better complexity adherence in fallbacks

b38eb80

remove fallback to Array.from in codePointAt JS implementation for now

4f3d71d

prefer let over where

e3cea19

hdgarrood reviewed Jul 7, 2017

View reviewed changes

michaelficarra force-pushed the codepoints branch from 28d72d0 to 255ab82 Compare July 7, 2017 16:44

michaelficarra added 2 commits July 7, 2017 09:54

change JS implementation of count to use string iterator if possible

db3eba3

update comments

3a24c8d

michaelficarra force-pushed the codepoints branch from 255ab82 to 3a24c8d Compare July 7, 2017 16:54

michaelficarra added 2 commits July 7, 2017 22:27

pull functions out of where clauses

82a502f

change complexity documentation for drop{,While} and take{,While}

085022e

hdgarrood reviewed Jul 8, 2017

View reviewed changes

forgot about a prime

6edb70f

hdgarrood approved these changes Jul 8, 2017

View reviewed changes

hdgarrood merged commit d2e3292 into purescript:master Jul 10, 2017

michaelficarra deleted the codepoints branch July 10, 2017 16:16

hdgarrood mentioned this pull request Jan 14, 2019

Include unicode escape sequence in docs on String literal purescript/documentation#232

Open

		codePointAtFallback n s = Array.index (toCodePointArray s) n


		count :: (CodePoint -> Boolean) -> String -> Int



		-- \| Returns the number of code points in the given string. Operates in
		-- \| constant space and time linear to the length of the string.

add codepoint-based string functions as Data.String.CodePoints #79

add codepoint-based string functions as Data.String.CodePoints #79

Uh oh!

Conversation

michaelficarra commented May 27, 2017

Uh oh!

michaelficarra commented May 29, 2017

Uh oh!

chexxor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paf31 commented Jun 3, 2017

Uh oh!

michaelficarra commented Jul 7, 2017

Uh oh!

hdgarrood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelficarra commented Jul 7, 2017

Uh oh!

michaelficarra commented Jul 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hdgarrood commented Jul 8, 2017

Uh oh!

paf31 commented Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chexxor left a comment •

edited

Loading

paf31 commented Jul 9, 2017 •

edited

Loading