Skip to content
This repository was archived by the owner on Aug 3, 2024. It is now read-only.

Use the GHC lexer for the Hyperlinker backend #714

Merged
merged 21 commits into from
Dec 10, 2017

Conversation

harpocrates
Copy link
Collaborator

Lead up discussion is here. Moving my concerns here:

  • CPP is not handled in the GHC lexer, yet Haddock will have to handle it. Currently, the solution is to recover from all lexical errors by turning problematic lines into comments (and then restarting the lexer from after that line). That seems like an OK failure mode to me, but I think it should be discussed...

  • Language pragmas (actually all top-level pragmas) are not lexed as pragmas. This (probably) requires a GHC change. In the meantime, language pragmas would look like regular comments.

    GHC's handling of top-level pragmas is a (IMHO) huge lexer-level hack to try to avoid loading the whole file and we are feeling that here. (If bytestring ever becomes a stage 0 package, Data.ByteString.Lazy is what that really needs - not a hand coded list-based parser.)

mpickering and others added 15 commits December 5, 2017 00:58
This reverts commit b605510.

Conflicts:
	haddock-api/haddock-api.cabal
	haddock-api/src/Haddock/Interface.hs
Things now run using the GHC lexer. There are still

  - stray debug statements
  - unnecessary changes w.r.t. master
Things are looking good. quasiquotes in particular look beautiful: the
TH ones (with Haskell source inside) colour/link their contents too!

Haven't yet begun to check for possible performance problems.
The support for these is hackier - but no more hacky than the existing
support.
@harpocrates
Copy link
Collaborator Author

harpocrates commented Dec 6, 2017

Alright. I think I've actually found an acceptable workaround to my two points from the previous comment. In a nutshell, we will do no better than we did before for parsing CPP and pragmas.

  • we still identify CPP based on the suboptimal criteria that the line starts with #
  • we still assumes anything of the form {-# ... #-} is a pragma

These both sound like acceptable workarounds. We could probably do better for the second point, but it would require some non-trivial GHC lexer hacking - which is probably not worth it.

TODO before this is mergeable:

  • test performance
  • update tests

@alexbiehl
Copy link
Member

alexbiehl commented Dec 6, 2017 via email

The tests were in some cases altered: I consider the new output to be more
correct than the old one....
Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'.
@harpocrates
Copy link
Collaborator Author

I think this is ready to merge. Sample of changes (before and after):

screen shot 2017-12-07 at 2 09 22 pm screen shot 2017-12-07 at 2 09 34 pm

Here are the only caveats I can think of:

  • Less flexibility around failure. If the lexer fails on some input, the whole chunk of Haskell source (bounded by the nearest CPP lines) will be marked as "unknown". We could conceivably just fall back onto the lexer we had before, but that isn't currently implemented. I'd like to witness some failing chunk of valid Haskell before bothering to do this.
  • CPP is a hack. In a nutshell, the file gets split up into CPP chunks and non-CPP chunks. The claim is that the non-CPP chunks can be lexed just fine on their own. I'm not sure this claim always holds. Besides, our heuristics for separating CPP from non-CPP could be buggy.

Please do try this out before merging - I'd like more confidence about performance in particular. 😄

Thanks!

@harpocrates harpocrates changed the title WIP: Use the GHC lexer for the Hyperlinker backend Use the GHC lexer for the Hyperlinker backend Dec 7, 2017
@alexbiehl
Copy link
Member

Exciting stuff! Will try this at home!

Boy, the next haddock release is packed with new features and fixes. Awesome!

Copy link
Member

@alexbiehl alexbiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Just one more nit.

hSrc = concat [ hLine | Right hLine <- hLinesRight ]
cppSrc = concat [ cppLine | Left cppLine <- cppLinesLeft ]

in case L.lexTokenStream (stringToStringBuffer hSrc) pos dflags of
Copy link
Member

@alexbiehl alexbiehl Dec 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC for each chunk (that is code between some CPP markers) we call the lexer. For each chunk we call stringToStringBuffer. If the lexer fails we make the offending part into a comment and try again with the rest, create a new StringBuffer and call the lexer again. Calling stringToStringBuffer is pretty expensive and doing it over and over this way seems quadric.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is happening. If the lexer fails, that chunk gets marked as one big "unknown" token and we move on (see the PFailed case - we just shove hSrc into one big token).

@alexbiehl alexbiehl merged commit 3a34ce5 into haskell:master Dec 10, 2017
@alexbiehl
Copy link
Member

Right! The result looks really good!

@harpocrates
Copy link
Collaborator Author

Thanks for reviewing - I really appreciate how responsive you've been!

@harpocrates harpocrates deleted the ghc-lexer branch December 11, 2017 00:25
alexbiehl pushed a commit to alexbiehl/haddock that referenced this pull request Feb 1, 2018
* Start changing to use GHC lexer

* better cpp

* Change SrcSpan to RealSrcSpan

* Remove error

* Try to stop too many open files

* wip

* wip

* Revert "wip"

This reverts commit b605510.

Conflicts:
	haddock-api/haddock-api.cabal
	haddock-api/src/Haddock/Interface.hs

* Remove pointless 'caching'

* Use dlist rather than lists when finding vars

* Use a map rather than list

* Delete bogus comment

* Rebase followup

Things now run using the GHC lexer. There are still

  - stray debug statements
  - unnecessary changes w.r.t. master

* Cleaned up differences w.r.t. current Haddock HEAD

Things are looking good. quasiquotes in particular look beautiful: the
TH ones (with Haskell source inside) colour/link their contents too!

Haven't yet begun to check for possible performance problems.

* Support CPP and top-level pragmas

The support for these is hackier - but no more hacky than the existing
support.

* Tests pass, CPP is better recognized

The tests were in some cases altered: I consider the new output to be more
correct than the old one....

* Fix shrinking of source without tabs in test

* Replace 'Position'/'Span' with GHC counterparts

Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'.

* Nits

* Forgot entry in .cabal

* Update changelog
alexbiehl pushed a commit that referenced this pull request Feb 1, 2018
* Start changing to use GHC lexer

* better cpp

* Change SrcSpan to RealSrcSpan

* Remove error

* Try to stop too many open files

* wip

* wip

* Revert "wip"

This reverts commit b605510.

Conflicts:
	haddock-api/haddock-api.cabal
	haddock-api/src/Haddock/Interface.hs

* Remove pointless 'caching'

* Use dlist rather than lists when finding vars

* Use a map rather than list

* Delete bogus comment

* Rebase followup

Things now run using the GHC lexer. There are still

  - stray debug statements
  - unnecessary changes w.r.t. master

* Cleaned up differences w.r.t. current Haddock HEAD

Things are looking good. quasiquotes in particular look beautiful: the
TH ones (with Haskell source inside) colour/link their contents too!

Haven't yet begun to check for possible performance problems.

* Support CPP and top-level pragmas

The support for these is hackier - but no more hacky than the existing
support.

* Tests pass, CPP is better recognized

The tests were in some cases altered: I consider the new output to be more
correct than the old one....

* Fix shrinking of source without tabs in test

* Replace 'Position'/'Span' with GHC counterparts

Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'.

* Nits

* Forgot entry in .cabal

* Update changelog
sjakobi pushed a commit to sjakobi/haddock that referenced this pull request Jun 10, 2018
* Start changing to use GHC lexer

* better cpp

* Change SrcSpan to RealSrcSpan

* Remove error

* Try to stop too many open files

* wip

* wip

* Revert "wip"

This reverts commit b605510.

Conflicts:
	haddock-api/haddock-api.cabal
	haddock-api/src/Haddock/Interface.hs

* Remove pointless 'caching'

* Use dlist rather than lists when finding vars

* Use a map rather than list

* Delete bogus comment

* Rebase followup

Things now run using the GHC lexer. There are still

  - stray debug statements
  - unnecessary changes w.r.t. master

* Cleaned up differences w.r.t. current Haddock HEAD

Things are looking good. quasiquotes in particular look beautiful: the
TH ones (with Haskell source inside) colour/link their contents too!

Haven't yet begun to check for possible performance problems.

* Support CPP and top-level pragmas

The support for these is hackier - but no more hacky than the existing
support.

* Tests pass, CPP is better recognized

The tests were in some cases altered: I consider the new output to be more
correct than the old one....

* Fix shrinking of source without tabs in test

* Replace 'Position'/'Span' with GHC counterparts

Replaces 'Position' -> 'GHC.RealSrcLoc' and 'Span' -> 'GHC.RealSrcSpan'.

* Nits

* Forgot entry in .cabal

* Update changelog

(cherry picked from commit 4f75be9)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants