Skip to content

Migration to text 2.0 #391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pepeiborra opened this issue Jan 13, 2022 · 15 comments · Fixed by #392
Closed

Migration to text 2.0 #391

pepeiborra opened this issue Jan 13, 2022 · 15 comments · Fixed by #392

Comments

@pepeiborra
Copy link
Collaborator

lsp uses rope-utf16-splay to represent buffers:

, rope-utf16-splay >= 0.3.1.0

This is not a good fit since text-2.* uses an UTF-8 encoding. We can either create an equivalent data structure for UTF-8, or drop it and work with Text directly. @alanz do you know how critical the rope data structure is for performance?

@alanz
Copy link
Collaborator

alanz commented Jan 13, 2022

The rope is critical to performance. But perhaps @ollef could update the package to use text-2.*.
One issue is that the LSP spec calls for UTF-16 encoding originally, and has only recently allowed a choice.
So the library should always support UTF-16 for backward compatibility, and may also support UTF-8.

@pepeiborra
Copy link
Collaborator Author

pepeiborra commented Jan 14, 2022

Where did you see that the LSP spec calls for UTF-16? This is what the 3.16 spec says:

Contains the actual content of the message. The content part of a message uses JSON-RPC to describe requests, responses and notifications. The content part is encoded using the charset provided in the Content-Type field. It defaults to utf-8, which is the only encoding supported right now. If a server or client receives a header with a different encoding than utf-8 it should respond with an error.

(Prior versions of the protocol used the string constant utf8 which is not a correct encoding constant according to specification.) For backwards compatibility it is highly recommended that a client and a server treats the string utf8 as utf-8.

@wz1000
Copy link
Collaborator

wz1000 commented Jan 14, 2022

From this section of the latest spec: https://microsoft.github.io/language-server-protocol/specifications/specification-3-17/#textDocuments

The current protocol is tailored for textual documents whose content can be represented as a string. There is currently no support for binary documents. A position inside a document (see Position definition below) is expressed as a zero-based line and character offset. The offsets are based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16. To ensure that both client and server split the string into the same line representation the protocol specifies the following end-of-line sequences: ‘\n’, ‘\r\n’ and ‘\r’.

@alanz
Copy link
Collaborator

alanz commented Jan 14, 2022

@alanz
Copy link
Collaborator

alanz commented Jan 14, 2022

rust-analyzer supports the use of an offset encoding to select between UTF-8 and UTF-16 offset encodings. Which I presume is an extension at the moment, as per my prior link

@wz1000
Copy link
Collaborator

wz1000 commented Jan 14, 2022

Another slightly orthogonal problem is that GHC reported SrcSpans are always with respect to UTF-8 (I think), and we make no effort to do any conversions.

@ollef
Copy link
Contributor

ollef commented Jan 14, 2022

It should be possible to update rope-utf16-splay to use text-2.0 while keeping the same UTF-16 code unit based interface for indexing.

This will add an additional O(n) overhead to construct the rope, as getting the length in UTF-16 code units is no longer O(1) as it was before (it now has to iterate through the characters in the text to count how many UTF-16 code units there would have been in the text), and an additional O(chunkLength) overhead to do splitting operations for the same reason. But I think this will be acceptable for the LSP use case. It's possible that we will want to decrease the chunkLength (currently at 1000 UTF-16 code units) to find a new sweet spot for splitting performance.

It should also be possible to add support for indexing in other bases (code points or UTF-8 code units, whatever GHC uses for its SrcSpans).

@alanz
Copy link
Collaborator

alanz commented Jan 14, 2022

@ollef The offsets are only ever used within a given line. Could that fact be used to control the splitting perhaps? ie always split on a line boundary, then the offset is local to a line, and we only care when it is actually used.

@ollef
Copy link
Contributor

ollef commented Jan 14, 2022

Splitting on line boundaries would be problematic for files with extremely long lines.

But ah! We can detect cases when we can use UTF-8 code unit indexing to do splitting with no additional overhead --- that's when a chunk's UTF-16 length is the same as the UTF-8 length in code units, i.e. the chunk is ASCII only.

@pepeiborra
Copy link
Collaborator Author

@Bodigrim has already contributed a solution in #392
Suggest we focus on reviewing that one!

@michaelpj
Copy link
Collaborator

Perhaps @ollef and @Bodigrim can collude on the ideal solution?

@ollef
Copy link
Contributor

ollef commented Jan 14, 2022

rope-utf16-splay-0.4.0.0 supports text-2.0 with the same interface as before. I'll leave it to the maintainers of this library to decide on how to proceed. Let me know if you run into any issues.

@Bodigrim
Copy link
Contributor

@ollef awesome, great job! Any chance you have sources for https://github.com/ollef/rope-utf16-splay/blob/master/bench.html somewhere?

@ollef
Copy link
Contributor

ollef commented Jan 15, 2022

I believe this is what I was using back then: https://github.com/ollef/rope-bench/. A bit bitrottted, but maybe the benchmarks can be adapted if nothing else (https://github.com/ollef/rope-bench/blob/main/bench/Bench.hs).

@Bodigrim
Copy link
Contributor

Another slightly orthogonal problem is that GHC reported SrcSpans are always with respect to UTF-8 (I think), and we make no effort to do any conversions.

That's weird. I'd expect GHC to use Unicode code points, not UTF-8 (or UTF-16) code units?..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants