-
Notifications
You must be signed in to change notification settings - Fork 237
Use the GHC lexer for the Hyperlinker backend #714
Conversation
This reverts commit b605510. Conflicts: haddock-api/haddock-api.cabal haddock-api/src/Haddock/Interface.hs
Things now run using the GHC lexer. There are still - stray debug statements - unnecessary changes w.r.t. master
Things are looking good. quasiquotes in particular look beautiful: the TH ones (with Haskell source inside) colour/link their contents too! Haven't yet begun to check for possible performance problems.
The support for these is hackier - but no more hacky than the existing support.
Alright. I think I've actually found an acceptable workaround to my two points from the previous comment. In a nutshell, we will do no better than we did before for parsing CPP and pragmas.
These both sound like acceptable workarounds. We could probably do better for the second point, but it would require some non-trivial GHC lexer hacking - which is probably not worth it. TODO before this is mergeable:
|
Nice! Finally a chance to throw the custom lexer over board! This sounds
reasonable! Did you run the hyperlinker tests?
I think a lot of performance improvement in this area came with
740458a
which eliminates an accidentally quadric look-up.
Since we are working with GHCs SrcLoc and SrcSpan we can also eliminate the
custom Position and Span from haddock, right?
Alec Theriault <[email protected]> schrieb am Mi., 6. Dez. 2017,
06:31:
… Alright. I think I've actually found an acceptable workaround to my two
points from the previous comment. In a nutshell, we will do no better than
we did before for parsing CPP and pragmas than we did before.
- we still identify CPP based on the suboptimal criteria that the line
starts with #
- we still assumes anything of the form {-# ... #-} is a pragma
These both sound like acceptable workarounds. We could probably do better
for the second point, but it would require some non-trivial GHC lexer
hacking - which is probably not worth it.
TODO before this is mergeable:
- test performance
- update tests
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#714 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AByiiYfRKbJ-9vIj9TwcJbUgNovElxWJks5s9iaZgaJpZM4Q3SiI>
.
|
The tests were in some cases altered: I consider the new output to be more correct than the old one....
Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'.
I think this is ready to merge. Sample of changes (before and after): Here are the only caveats I can think of:
Please do try this out before merging - I'd like more confidence about performance in particular. 😄 Thanks! |
Exciting stuff! Will try this at home! Boy, the next haddock release is packed with new features and fixes. Awesome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! Just one more nit.
hSrc = concat [ hLine | Right hLine <- hLinesRight ] | ||
cppSrc = concat [ cppLine | Left cppLine <- cppLinesLeft ] | ||
|
||
in case L.lexTokenStream (stringToStringBuffer hSrc) pos dflags of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC for each chunk (that is code between some CPP markers) we call the lexer. For each chunk we call stringToStringBuffer
. If the lexer fails we make the offending part into a comment and try again with the rest, create a new StringBuffer
and call the lexer again. Calling stringToStringBuffer
is pretty expensive and doing it over and over this way seems quadric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is happening. If the lexer fails, that chunk gets marked as one big "unknown" token and we move on (see the PFailed
case - we just shove hSrc
into one big token).
Right! The result looks really good! |
Thanks for reviewing - I really appreciate how responsive you've been! |
* Start changing to use GHC lexer * better cpp * Change SrcSpan to RealSrcSpan * Remove error * Try to stop too many open files * wip * wip * Revert "wip" This reverts commit b605510. Conflicts: haddock-api/haddock-api.cabal haddock-api/src/Haddock/Interface.hs * Remove pointless 'caching' * Use dlist rather than lists when finding vars * Use a map rather than list * Delete bogus comment * Rebase followup Things now run using the GHC lexer. There are still - stray debug statements - unnecessary changes w.r.t. master * Cleaned up differences w.r.t. current Haddock HEAD Things are looking good. quasiquotes in particular look beautiful: the TH ones (with Haskell source inside) colour/link their contents too! Haven't yet begun to check for possible performance problems. * Support CPP and top-level pragmas The support for these is hackier - but no more hacky than the existing support. * Tests pass, CPP is better recognized The tests were in some cases altered: I consider the new output to be more correct than the old one.... * Fix shrinking of source without tabs in test * Replace 'Position'/'Span' with GHC counterparts Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'. * Nits * Forgot entry in .cabal * Update changelog
* Start changing to use GHC lexer * better cpp * Change SrcSpan to RealSrcSpan * Remove error * Try to stop too many open files * wip * wip * Revert "wip" This reverts commit b605510. Conflicts: haddock-api/haddock-api.cabal haddock-api/src/Haddock/Interface.hs * Remove pointless 'caching' * Use dlist rather than lists when finding vars * Use a map rather than list * Delete bogus comment * Rebase followup Things now run using the GHC lexer. There are still - stray debug statements - unnecessary changes w.r.t. master * Cleaned up differences w.r.t. current Haddock HEAD Things are looking good. quasiquotes in particular look beautiful: the TH ones (with Haskell source inside) colour/link their contents too! Haven't yet begun to check for possible performance problems. * Support CPP and top-level pragmas The support for these is hackier - but no more hacky than the existing support. * Tests pass, CPP is better recognized The tests were in some cases altered: I consider the new output to be more correct than the old one.... * Fix shrinking of source without tabs in test * Replace 'Position'/'Span' with GHC counterparts Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'. * Nits * Forgot entry in .cabal * Update changelog
* Start changing to use GHC lexer * better cpp * Change SrcSpan to RealSrcSpan * Remove error * Try to stop too many open files * wip * wip * Revert "wip" This reverts commit b605510. Conflicts: haddock-api/haddock-api.cabal haddock-api/src/Haddock/Interface.hs * Remove pointless 'caching' * Use dlist rather than lists when finding vars * Use a map rather than list * Delete bogus comment * Rebase followup Things now run using the GHC lexer. There are still - stray debug statements - unnecessary changes w.r.t. master * Cleaned up differences w.r.t. current Haddock HEAD Things are looking good. quasiquotes in particular look beautiful: the TH ones (with Haskell source inside) colour/link their contents too! Haven't yet begun to check for possible performance problems. * Support CPP and top-level pragmas The support for these is hackier - but no more hacky than the existing support. * Tests pass, CPP is better recognized The tests were in some cases altered: I consider the new output to be more correct than the old one.... * Fix shrinking of source without tabs in test * Replace 'Position'/'Span' with GHC counterparts Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'. * Nits * Forgot entry in .cabal * Update changelog (cherry picked from commit 4f75be9)
Lead up discussion is here. Moving my concerns here:
CPP is not handled in the GHC lexer, yet Haddock will have to handle it. Currently, the solution is to recover from all lexical errors by turning problematic lines into comments (and then restarting the lexer from after that line). That seems like an OK failure mode to me, but I think it should be discussed...
Language pragmas (actually all top-level pragmas) are not lexed as pragmas. This (probably) requires a GHC change. In the meantime, language pragmas would look like regular comments.
GHC's handling of top-level pragmas is a (IMHO) huge lexer-level hack to try to avoid loading the whole file and we are feeling that here. (If
bytestring
ever becomes a stage 0 package,Data.ByteString.Lazy
is what that really needs - not a hand coded list-based parser.)