Skip to content

Refactor/cleanup of String/ByteString usage #4666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 10, 2017
Merged

Conversation

hvr
Copy link
Member

@hvr hvr commented Aug 5, 2017

This refactoring is intended to clean-up various instances where [Char] is used to convey
something other than Unicode text. This picks up unfinished yak-shaving that was put aside when working
on #3913 ... ;-)

Note: this PR has been carefully split into self-contained commits to ease reviewing

@mention-bot
Copy link

@hvr, thanks for your PR! By analyzing the history of the files in this pull request, we identified @dcoutts, @ezyang and @23Skidoo to be potential reviewers.

hvr added 3 commits August 5, 2017 19:16
This is the proper type, as 'getFileContents' is supposed to return
contents read in binary file mode, and prior to this patch, `[Char]` was
abused to return binary data.
The API was assymetric, as there was `fromUTF8(L)BS` but not
the dual operations.

The plan is refactor all occurences of

 - `fromUTF8 :: String -> String`
 - `toUTF8`  :: String -> String`

until `fromUTF8`/`toUTF8` is unused, at which point
we can officially deprecate or remove it.
@hvr hvr force-pushed the pr/utf8-refactor branch from a0408be to 881ba6b Compare August 5, 2017 17:16
This was introduced in 1821d80 but it
would imply that stdout was set to binary mode, which it isn't.
@hvr hvr changed the title WIP: Refactor/cleanup of String/ByteString usage Refactor/cleanup of String/ByteString usage Aug 5, 2017
@hvr hvr force-pushed the pr/utf8-refactor branch 2 times, most recently from 8912f68 to 3856f76 Compare August 5, 2017 18:52
hvr added 3 commits August 6, 2017 00:03
This new type will be used to disentangle conflated uses of `String` and
clearly distinguish between binary and textual data.
This removes the remaining occurences of the weakly typed
`{to,from}UTF8` conversion.
Since we don't use those functions anymore, we can finally `DEPRECATE`
them.
@hvr hvr force-pushed the pr/utf8-refactor branch from 3856f76 to 9f7f733 Compare August 5, 2017 22:03
@hvr hvr requested a review from 23Skidoo August 6, 2017 07:40
Copy link
Member

@23Skidoo 23Skidoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo minor comments.

@@ -1777,12 +1776,16 @@ checkCabalFileBOM ops = do
-- --cabal-file is specified. So if you can't find the file,
-- just don't bother with this check.
Left _ -> return $ Nothing
Right pdfile -> (flip check pc . startsWithBOM . fromUTF8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can startsWithBOM be removed now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -246,8 +246,7 @@ startsWithBOM _ = False

-- | Check whether a file has Unicode byte order mark (BOM).
fileHasBOM :: FilePath -> NoCallStackIO Bool
fileHasBOM f = fmap (startsWithBOM . fromUTF8)
. hGetContents =<< openBinaryFile f ReadMode
fileHasBOM f = (startsWithBOM . fromUTF8LBS) <$> BS.readFile f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use BS.isPrefixOf here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but we can also just remove fileHasBOM as nobody uses it anymore iirc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

-- closed.
--
-- @since 2.2.0
ioDataHGetContents :: Handle -> IODataMode -> IO IOData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move IOData & friends to a separate module and import that qualified instead of using a prefix? Like the existing BS./LBS. precedent.

Copy link
Member Author

@hvr hvr Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@23Skidoo what do you suggest? Something like

module Distribution.Simple.IOData
    ( IOData(..)
    , hGetContents
    , hPutContents
    -- ...
    ) where

-- ...

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's what I had in mind.

Copy link
Member Author

@hvr hvr Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go for D.Utils.IOData, as that's where e.g. ShortText lives too

| c <= '\xFB' = moreBytes 5 0x200000 cs (ord c .&. 0x3)
| c <= '\xFD' = moreBytes 6 0x4000000 cs (ord c .&. 0x1)
| otherwise = replacementChar : fromUTF8 cs
fromUTF8 = decodeStringUtf8 . map c2w
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, we could also remove to/fromUTF8 w/o any deprecation. Nothing in the code-base needs it anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since this will be in Cabal 2.2, setup scripts shouldn't be affected. May be worth adding a note to the changelog about the stuff that is gone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for changelog see b1bbf2e

@hvr
Copy link
Member Author

hvr commented Aug 9, 2017

@23Skidoo I've implemented your suggestion; can you look over it?

I'll write the changelog as soon as you confirm this is what you intended :-)

@23Skidoo
Copy link
Member

23Skidoo commented Aug 9, 2017

@hvr Looks good, though you forgot to git add the new module.

hvr added 2 commits August 9, 2017 21:56
This reduces the surface area of lib:Cabal by removing
entry points that Setup.hs scripts are very unlikely to
use.
@hvr hvr force-pushed the pr/utf8-refactor branch from 353e758 to 2430a29 Compare August 9, 2017 19:56
@hvr
Copy link
Member Author

hvr commented Aug 9, 2017

@23Skidoo Doh! ... fixed...

@hvr hvr added this to the 2.2 milestone Aug 10, 2017
@hvr hvr merged commit 6a08850 into haskell:master Aug 10, 2017
@hvr hvr deleted the pr/utf8-refactor branch August 10, 2017 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants