Skip to content

Unicode symbols >= 𐀀 are broken #2646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Bodigrim opened this issue Jan 28, 2022 · 17 comments
Open

Unicode symbols >= 𐀀 are broken #2646

Bodigrim opened this issue Jan 28, 2022 · 17 comments
Labels
component: ghcide component: lsp type: bug Something isn't right: doesn't work as intended, documentation is missing/outdated, etc..

Comments

@Bodigrim
Copy link
Contributor

Your environment

Which OS do you use: MacOS
Which LSP client (editor/plugin) do you use: Sublime Code

Steps to reproduce

aaa :: Word 
aaa = 10

𐀀𐀀𐀀 :: Word 
𐀀𐀀𐀀 = 10 

Expected behaviour

Both aaa and 𐀀𐀀𐀀 should be underlined as unused bindings.

Actual behaviour

aaa is underlined correctly, but only the first character of 𐀀𐀀𐀀 is underlined.

изображение

That's because HLS does not distinguish positions returned from GHC (which are in code points = characters) from positions mandated by LSP (which are UTF-16 code units). Basically, GHC says us that 3 first code points in line 5 are an unused binding. Each 𐀀 is a single character, but 2 UTF-16 code units, so HLS should ask LSP to underline first 6 code units. Instead of this HLS asks LSP to underline only 3, and since the 3rd one is in the middle of the 2nd character, only the 1st character gets underlined.

CC @michaelpj @alanz, this is related to haskell/lsp#392 (comment).

@Bodigrim Bodigrim added type: bug Something isn't right: doesn't work as intended, documentation is missing/outdated, etc.. status: needs triage labels Jan 28, 2022
@Bodigrim
Copy link
Contributor Author

Code actions are also broken. E. g., choosing "Delete '𐀀𐀀𐀀'" results in the following program

aaa :: Word 
aaa = 10 10

That's again because HLS meant to ask LSP to delete 8 characters, but asked to delete 8 code points, 6 of which were 𐀀𐀀𐀀.

@michaelpj
Copy link
Collaborator

Thanks for the reproduction. To be clear, did you mean that this is related to the issue with splitting in the middle of code units, or is it this one? haskell/lsp#392 (comment)

That is, do you think this will be fixed by the changes you made to lsp, or is there still something broken?

@jneira
Copy link
Member

jneira commented Jan 28, 2022

In any case a regression test (or several of them) here exercising unicode symbols would be great

@Bodigrim
Copy link
Contributor Author

@michaelpj This is unlikely to be fixed in recent lsp patches. Essentially there must be a place, responsible for (fast) conversion between positions in code points and positions in code units. I think I can provide such low-level routine from text-rope and then someone will have to migrate ghcide to it.

@jneira I’m sorry, I’m unlikely to have capacity to add a regression test. Could someone else possibly please pick it up?

@michaelpj
Copy link
Collaborator

I'll look into it (inc tests), I just thought you might already have a hunch as to where the problem lies.

@michaelpj
Copy link
Collaborator

Looking at this more, I think this is utterly broken throughout the codebase. We frequently change from GHC SrcSpans to Positions, blindly assuming that the numbers can just be translated between them, which is very wrong. And we use the numbers from Position with functions from Text, which is also wrong.

If I understand correctly, you can't even convert from a postion-in-code-points to a position-in-code-units without having the whole text in question to hand, which seems quite annoying. In that case it probably would be useful to have such a function in text-rope and we'll just have to painfully audit HLS for direct use of SrcSpans.

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

@michaelpj
Copy link
Collaborator

Not really sure what the best way to tackle this systematically is. We could have a CodeUnit newtype in lsp that we use for the column offset, that might force us to look at all the use sites. I guess I can't argue that the Text methods should take a CodePoint newtype instead of Int :)

@michaelpj
Copy link
Collaborator

Vague plan:

  • Since we need the text, let's ask the VFS to do the conversion for us
  • Add CodePointPosition to the VFS, which is a mirror of Position with code points for the column offset. Perhaps also CodePointRange, CodePointLocation
  • Add a function of type VFS -> URI -> CodePointPosition -> Position, etc.

@Bodigrim
Copy link
Contributor Author

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

Already there: it provides both character-based API (Data.Text.Lines / Data.Text.Rope) and UTF16-based API (Data.Text.Utf16.Lines / Data.Text.Utf16.Rope). I just need to provide conversions.

@Bodigrim
Copy link
Contributor Author

I realised that text-rope already provides a way to convert between positions: one can Data.Text.Lines.splitAtPosition with character-based counter and then call Data.Text.Utf16.Lines.lengthAsPosition on the prefix to receive UTF-16-based counter. I might provide a dedicated helper in the future, but this would work for a proof of concept.

Looking on ghcide, I think it can really benefit from using Data.Text.Lines from text-rope instead of a plain Text, because this will improve ubiquitous line splitting.

So I'd suggest to wait for an lsp release with text-rope-based VFS, then take another look at this issue.

@michaelpj
Copy link
Collaborator

Yes, definitely not planning to do anything before then. Thanks for letting me know how to do the conversion!

michaelpj added a commit to michaelpj/lsp that referenced this issue Feb 19, 2022
LSP `Position`s use UTF-16 code units for offsets within lines; most
other sane tools (like GHC) use Unicode code points. We need to use the
right one in the right place, otherwise we get issues like
haskell/haskell-language-server#2646.

This is pretty unpleasant, since code points are variable-size, so you
can't do the conversion without having the file text itself.

This PR provides a type for positions using code points (for clients to
use to help them be less confused) and functions for using the VFS to
convert between those and LSP positions.
michaelpj added a commit to michaelpj/lsp that referenced this issue Feb 19, 2022
LSP `Position`s use UTF-16 code units for offsets within lines; most
other sane tools (like GHC) use Unicode code points. We need to use the
right one in the right place, otherwise we get issues like
haskell/haskell-language-server#2646.

This is pretty unpleasant, since code points are variable-size, so you
can't do the conversion without having the file text itself.

This PR provides a type for positions using code points (for clients to
use to help them be less confused) and functions for using the VFS to
convert between those and LSP positions.
michaelpj added a commit to michaelpj/lsp that referenced this issue Feb 20, 2022
LSP `Position`s use UTF-16 code units for offsets within lines; most
other sane tools (like GHC) use Unicode code points. We need to use the
right one in the right place, otherwise we get issues like
haskell/haskell-language-server#2646.

This is pretty unpleasant, since code points are variable-size, so you
can't do the conversion without having the file text itself.

This PR provides a type for positions using code points (for clients to
use to help them be less confused) and functions for using the VFS to
convert between those and LSP positions.
@Ptival
Copy link

Ptival commented Jul 29, 2024

Maybe the example in this issue makes this look too benign, but this bug makes Emacs w/ Haskell LSP completely unusable on projects that use Unicode. For instance with this file:

{-# LANGUAGE UnicodeSyntax #-}
module Mini where
type 𝐿 a = [a]

Once you start editing the last line (e.g. try to replace 𝐿 with a normal L and then back), the LSP backend and what's in your buffer start becoming completely out of sync, and you get error messages such as parse errors or text that is no longer present in your buffer.

@michaelpj
Copy link
Collaborator

Yeah, I would expect "unusable" to be about right.

I've been surprised that there hasn't been more complaint on this issue, and I've guessed that means that Unicode just isn't that popular with Haskell developers.

That said, I don't think this would be too hard to fix, just tedious. So if someone is keen to see it done then that could be great!

@Ptival
Copy link

Ptival commented Aug 2, 2024

I could be bothered enough to attempt a fix over the weekend, though I'm not quite sure where to begin.

Do you think the solution would look more or less like chasing the points at which positions are being exchanged between haskell-language-server and lsp, and figure out the correct invocation of your codePointPositionToPosition and positionToCodePointPosition functions, and its possible consequences on the nearby code?

@michaelpj
Copy link
Collaborator

Yes, that's pretty much it. I would probably use hlint to help: add an error rule for the use of the Position constructor directly, make a whitelist, and then start whittling it down. That way we can also hopefully stop it getting worse 😅

@Ptival
Copy link

Ptival commented Aug 4, 2024

Almost immediately running into an odd situation. Consider:

realSrcLocToPosition :: RealSrcLoc -> Position
realSrcLocToPosition real =
Position (fromIntegral $ srcLocLine real - 1) (fromIntegral $ srcLocCol real - 1)

My gut feeling is this is the place that is ripe for column confusion: supposedly, the srcLocCol is 1-based and code point-based, while Position is 0-based and code unit-based. However, this is nowhere near a VirtualFile.

Chasing up the call chain, I often end up at a StringBuffer as the thing looking the most like the contents of a file, e.g.:

parsePragmasIntoHscEnv env fp contents = catchSrcErrors dflags0 "pragmas" $ do
let (_warns,opts) = getOptions (initParserOpts dflags0) contents fp

where the pragmas in a file are being checked and catchSrcErrors eventually calls realSrcLocToPosition, or:

typecheckModule (IdeDefer defer) hsc tc_helpers pm = do
let modSummary = pm_mod_summary pm
dflags = ms_hspp_opts modSummary
initialized <- catchSrcErrors (hsc_dflags hsc) "typecheck (initialize plugins)"
(Loader.initializePlugins (hscSetFlags (ms_hspp_opts modSummary) hsc))
case initialized of
Left errs -> return (errs, Nothing)
Right hscEnv -> do
(warnings, etcm) <- withWarnings sourceTypecheck $ \tweak ->
let
session = tweak (hscSetFlags dflags hscEnv)
-- TODO: maybe settings ms_hspp_opts is unnecessary?
mod_summary'' = modSummary { ms_hspp_opts = hsc_dflags session}
in
catchSrcErrors (hsc_dflags hscEnv) sourceTypecheck $ do
tcRnModule session tc_helpers $ demoteIfDefer pm{pm_mod_summary = mod_summary''}

where maybe the module contents are available in the ms_hspp_buf field of the ParsedModule...

So:

  1. Should I try to grab and pass around these StringBuffers all the way down? Or is this a recipe for disaster?
  2. If so, should this call for versions of your helper converter functions that just take in the contents, without the extra LSP-related stuff in a VirtualFile?
  3. And finally, these are StringBuffers, not Ropes. What would be the expectation here?

Overall, I get the feeling that the parts of the GHCIDE code I'm looking at are fairly LSP agnostic, so reaching the gap to the VirtualFile abstraction feels odd.

Any thoughts?

@Bodigrim
Copy link
Contributor Author

More trail of destruction: https://gitlab.haskell.org/ghc/ghc/-/issues/25396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: ghcide component: lsp type: bug Something isn't right: doesn't work as intended, documentation is missing/outdated, etc..
Projects
None yet
Development

No branches or pull requests

6 participants