Migration to text 2.0 #391

pepeiborra · 2022-01-13T11:09:07Z

lsp uses rope-utf16-splay to represent buffers:

Line 91 in 1468d13

, rope-utf16-splay >= 0.3.1.0

This is not a good fit since text-2.* uses an UTF-8 encoding. We can either create an equivalent data structure for UTF-8, or drop it and work with Text directly. @alanz do you know how critical the rope data structure is for performance?

The text was updated successfully, but these errors were encountered:

alanz · 2022-01-13T11:17:13Z

The rope is critical to performance. But perhaps @ollef could update the package to use text-2.*.
One issue is that the LSP spec calls for UTF-16 encoding originally, and has only recently allowed a choice.
So the library should always support UTF-16 for backward compatibility, and may also support UTF-8.

pepeiborra · 2022-01-14T08:54:11Z

Where did you see that the LSP spec calls for UTF-16? This is what the 3.16 spec says:

Contains the actual content of the message. The content part of a message uses JSON-RPC to describe requests, responses and notifications. The content part is encoded using the charset provided in the Content-Type field. It defaults to utf-8, which is the only encoding supported right now. If a server or client receives a header with a different encoding than utf-8 it should respond with an error.

(Prior versions of the protocol used the string constant utf8 which is not a correct encoding constant according to specification.) For backwards compatibility it is highly recommended that a client and a server treats the string utf8 as utf-8.

wz1000 · 2022-01-14T09:02:38Z

From this section of the latest spec: https://microsoft.github.io/language-server-protocol/specifications/specification-3-17/#textDocuments

The current protocol is tailored for textual documents whose content can be represented as a string. There is currently no support for binary documents. A position inside a document (see Position definition below) is expressed as a zero-based line and character offset. The offsets are based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16. To ensure that both client and server split the string into the same line representation the protocol specifies the following end-of-line sequences: ‘\n’, ‘\r\n’ and ‘\r’.

alanz · 2022-01-14T09:08:40Z

There is also microsoft/language-server-protocol#376 (comment)

alanz · 2022-01-14T09:10:09Z

rust-analyzer supports the use of an offset encoding to select between UTF-8 and UTF-16 offset encodings. Which I presume is an extension at the moment, as per my prior link

wz1000 · 2022-01-14T09:14:39Z

Another slightly orthogonal problem is that GHC reported SrcSpans are always with respect to UTF-8 (I think), and we make no effort to do any conversions.

ollef · 2022-01-14T09:35:08Z

It should be possible to update rope-utf16-splay to use text-2.0 while keeping the same UTF-16 code unit based interface for indexing.

This will add an additional O(n) overhead to construct the rope, as getting the length in UTF-16 code units is no longer O(1) as it was before (it now has to iterate through the characters in the text to count how many UTF-16 code units there would have been in the text), and an additional O(chunkLength) overhead to do splitting operations for the same reason. But I think this will be acceptable for the LSP use case. It's possible that we will want to decrease the chunkLength (currently at 1000 UTF-16 code units) to find a new sweet spot for splitting performance.

It should also be possible to add support for indexing in other bases (code points or UTF-8 code units, whatever GHC uses for its SrcSpans).

alanz · 2022-01-14T09:38:04Z

@ollef The offsets are only ever used within a given line. Could that fact be used to control the splitting perhaps? ie always split on a line boundary, then the offset is local to a line, and we only care when it is actually used.

ollef · 2022-01-14T09:42:19Z

Splitting on line boundaries would be problematic for files with extremely long lines.

But ah! We can detect cases when we can use UTF-8 code unit indexing to do splitting with no additional overhead --- that's when a chunk's UTF-16 length is the same as the UTF-8 length in code units, i.e. the chunk is ASCII only.

pepeiborra · 2022-01-14T10:39:27Z

@Bodigrim has already contributed a solution in #392
Suggest we focus on reviewing that one!

michaelpj · 2022-01-14T14:21:50Z

Perhaps @ollef and @Bodigrim can collude on the ideal solution?

ollef · 2022-01-14T17:31:17Z

rope-utf16-splay-0.4.0.0 supports text-2.0 with the same interface as before. I'll leave it to the maintainers of this library to decide on how to proceed. Let me know if you run into any issues.

Bodigrim · 2022-01-15T13:18:38Z

@ollef awesome, great job! Any chance you have sources for https://github.com/ollef/rope-utf16-splay/blob/master/bench.html somewhere?

ollef · 2022-01-15T14:17:58Z

I believe this is what I was using back then: https://github.com/ollef/rope-bench/. A bit bitrottted, but maybe the benchmarks can be adapted if nothing else (https://github.com/ollef/rope-bench/blob/main/bench/Bench.hs).

Bodigrim · 2022-01-16T19:00:09Z

Another slightly orthogonal problem is that GHC reported SrcSpans are always with respect to UTF-8 (I think), and we make no effort to do any conversions.

That's weird. I'd expect GHC to use Unicode code points, not UTF-8 (or UTF-16) code units?..

pepeiborra mentioned this issue Jan 13, 2022

Migration to text-2.* haskell/haskell-language-server#2583

Closed

Bodigrim mentioned this issue Jan 14, 2022

Support text-2.0 #392

Merged

michaelpj closed this as completed in #392 Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration to text 2.0 #391

Migration to text 2.0 #391

pepeiborra commented Jan 13, 2022

alanz commented Jan 13, 2022

pepeiborra commented Jan 14, 2022 •

edited

Loading

wz1000 commented Jan 14, 2022

alanz commented Jan 14, 2022

alanz commented Jan 14, 2022

wz1000 commented Jan 14, 2022

ollef commented Jan 14, 2022

alanz commented Jan 14, 2022

ollef commented Jan 14, 2022 •

edited

Loading

pepeiborra commented Jan 14, 2022

michaelpj commented Jan 14, 2022

ollef commented Jan 14, 2022

Bodigrim commented Jan 15, 2022

ollef commented Jan 15, 2022

Bodigrim commented Jan 16, 2022

Migration to text 2.0 #391

Migration to text 2.0 #391

Comments

pepeiborra commented Jan 13, 2022

alanz commented Jan 13, 2022

pepeiborra commented Jan 14, 2022 • edited Loading

wz1000 commented Jan 14, 2022

alanz commented Jan 14, 2022

alanz commented Jan 14, 2022

wz1000 commented Jan 14, 2022

ollef commented Jan 14, 2022

alanz commented Jan 14, 2022

ollef commented Jan 14, 2022 • edited Loading

pepeiborra commented Jan 14, 2022

michaelpj commented Jan 14, 2022

ollef commented Jan 14, 2022

Bodigrim commented Jan 15, 2022

ollef commented Jan 15, 2022

Bodigrim commented Jan 16, 2022

pepeiborra commented Jan 14, 2022 •

edited

Loading

ollef commented Jan 14, 2022 •

edited

Loading