-
Notifications
You must be signed in to change notification settings - Fork 93
Migration to text 2.0 #391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The rope is critical to performance. But perhaps @ollef could update the package to use text-2.*. |
Where did you see that the LSP spec calls for UTF-16? This is what the 3.16 spec says:
|
From this section of the latest spec: https://microsoft.github.io/language-server-protocol/specifications/specification-3-17/#textDocuments
|
There is also microsoft/language-server-protocol#376 (comment) |
rust-analyzer supports the use of an offset encoding to select between UTF-8 and UTF-16 offset encodings. Which I presume is an extension at the moment, as per my prior link |
Another slightly orthogonal problem is that GHC reported |
It should be possible to update rope-utf16-splay to use text-2.0 while keeping the same UTF-16 code unit based interface for indexing. This will add an additional O(n) overhead to construct the rope, as getting the length in UTF-16 code units is no longer O(1) as it was before (it now has to iterate through the characters in the text to count how many UTF-16 code units there would have been in the text), and an additional O( It should also be possible to add support for indexing in other bases (code points or UTF-8 code units, whatever GHC uses for its |
@ollef The offsets are only ever used within a given line. Could that fact be used to control the splitting perhaps? ie always split on a line boundary, then the offset is local to a line, and we only care when it is actually used. |
Splitting on line boundaries would be problematic for files with extremely long lines. But ah! We can detect cases when we can use UTF-8 code unit indexing to do splitting with no additional overhead --- that's when a chunk's UTF-16 length is the same as the UTF-8 length in code units, i.e. the chunk is ASCII only. |
rope-utf16-splay-0.4.0.0 supports text-2.0 with the same interface as before. I'll leave it to the maintainers of this library to decide on how to proceed. Let me know if you run into any issues. |
@ollef awesome, great job! Any chance you have sources for https://github.com/ollef/rope-utf16-splay/blob/master/bench.html somewhere? |
I believe this is what I was using back then: https://github.com/ollef/rope-bench/. A bit bitrottted, but maybe the benchmarks can be adapted if nothing else (https://github.com/ollef/rope-bench/blob/main/bench/Bench.hs). |
That's weird. I'd expect GHC to use Unicode code points, not UTF-8 (or UTF-16) code units?.. |
lsp
usesrope-utf16-splay
to represent buffers:lsp/lsp-types/lsp-types.cabal
Line 91 in 1468d13
This is not a good fit since
text-2.*
uses an UTF-8 encoding. We can either create an equivalent data structure for UTF-8, or drop it and work withText
directly. @alanz do you know how critical the rope data structure is for performance?The text was updated successfully, but these errors were encountered: