-
-
Notifications
You must be signed in to change notification settings - Fork 391
Unicode symbols >= 𐀀 are broken #2646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Code actions are also broken. E. g., choosing "Delete '𐀀𐀀𐀀'" results in the following program
That's again because HLS meant to ask LSP to delete 8 characters, but asked to delete 8 code points, 6 of which were |
Thanks for the reproduction. To be clear, did you mean that this is related to the issue with splitting in the middle of code units, or is it this one? haskell/lsp#392 (comment) That is, do you think this will be fixed by the changes you made to |
In any case a regression test (or several of them) here exercising unicode symbols would be great |
@michaelpj This is unlikely to be fixed in recent @jneira I’m sorry, I’m unlikely to have capacity to add a regression test. Could someone else possibly please pick it up? |
I'll look into it (inc tests), I just thought you might already have a hunch as to where the problem lies. |
Looking at this more, I think this is utterly broken throughout the codebase. We frequently change from GHC If I understand correctly, you can't even convert from a postion-in-code-points to a position-in-code-units without having the whole text in question to hand, which seems quite annoying. In that case it probably would be useful to have such a function in (Maybe |
Not really sure what the best way to tackle this systematically is. We could have a |
Vague plan:
|
Already there: it provides both character-based API ( |
I realised that Looking on So I'd suggest to wait for an |
Yes, definitely not planning to do anything before then. Thanks for letting me know how to do the conversion! |
LSP `Position`s use UTF-16 code units for offsets within lines; most other sane tools (like GHC) use Unicode code points. We need to use the right one in the right place, otherwise we get issues like haskell/haskell-language-server#2646. This is pretty unpleasant, since code points are variable-size, so you can't do the conversion without having the file text itself. This PR provides a type for positions using code points (for clients to use to help them be less confused) and functions for using the VFS to convert between those and LSP positions.
LSP `Position`s use UTF-16 code units for offsets within lines; most other sane tools (like GHC) use Unicode code points. We need to use the right one in the right place, otherwise we get issues like haskell/haskell-language-server#2646. This is pretty unpleasant, since code points are variable-size, so you can't do the conversion without having the file text itself. This PR provides a type for positions using code points (for clients to use to help them be less confused) and functions for using the VFS to convert between those and LSP positions.
LSP `Position`s use UTF-16 code units for offsets within lines; most other sane tools (like GHC) use Unicode code points. We need to use the right one in the right place, otherwise we get issues like haskell/haskell-language-server#2646. This is pretty unpleasant, since code points are variable-size, so you can't do the conversion without having the file text itself. This PR provides a type for positions using code points (for clients to use to help them be less confused) and functions for using the VFS to convert between those and LSP positions.
Maybe the example in this issue makes this look too benign, but this bug makes Emacs w/ Haskell LSP completely unusable on projects that use Unicode. For instance with this file: {-# LANGUAGE UnicodeSyntax #-}
module Mini where
type 𝐿 a = [a] Once you start editing the last line (e.g. try to replace |
Yeah, I would expect "unusable" to be about right. I've been surprised that there hasn't been more complaint on this issue, and I've guessed that means that Unicode just isn't that popular with Haskell developers. That said, I don't think this would be too hard to fix, just tedious. So if someone is keen to see it done then that could be great! |
I could be bothered enough to attempt a fix over the weekend, though I'm not quite sure where to begin. Do you think the solution would look more or less like chasing the points at which positions are being exchanged between haskell-language-server and lsp, and figure out the correct invocation of your |
Yes, that's pretty much it. I would probably use |
Almost immediately running into an odd situation. Consider: haskell-language-server/ghcide/src/Development/IDE/GHC/Error.hs Lines 89 to 91 in 9565d0b
My gut feeling is this is the place that is ripe for column confusion: supposedly, the Chasing up the call chain, I often end up at a haskell-language-server/ghcide/src/Development/IDE/Core/Preprocessor.hs Lines 146 to 147 in 9565d0b
where the pragmas in a file are being checked and catchSrcErrors eventually calls realSrcLocToPosition , or:
haskell-language-server/ghcide/src/Development/IDE/Core/Compile.hs Lines 174 to 189 in 9565d0b
where maybe the module contents are available in the ms_hspp_buf field of the ParsedModule ...
So:
Overall, I get the feeling that the parts of the GHCIDE code I'm looking at are fairly LSP agnostic, so reaching the gap to the Any thoughts? |
More trail of destruction: https://gitlab.haskell.org/ghc/ghc/-/issues/25396 |
Your environment
Which OS do you use: MacOS
Which LSP client (editor/plugin) do you use: Sublime Code
Steps to reproduce
Expected behaviour
Both
aaa
and𐀀𐀀𐀀
should be underlined as unused bindings.Actual behaviour
aaa
is underlined correctly, but only the first character of𐀀𐀀𐀀
is underlined.That's because HLS does not distinguish positions returned from GHC (which are in code points = characters) from positions mandated by LSP (which are UTF-16 code units). Basically, GHC says us that 3 first code points in line 5 are an unused binding. Each
𐀀
is a single character, but 2 UTF-16 code units, so HLS should ask LSP to underline first 6 code units. Instead of this HLS asks LSP to underline only 3, and since the 3rd one is in the middle of the 2nd character, only the 1st character gets underlined.CC @michaelpj @alanz, this is related to haskell/lsp#392 (comment).
The text was updated successfully, but these errors were encountered: