Skip to content

Treesitter challenges #808

Open
Open
@jennybc

Description

@jennybc

My recent foray into the language server and, more specifically, completions has given me Some Opinions™️, which I discuss with @lionel- and @DavisVaughan. We agreed I'd capture some of this friction I'm noticing as a newcomer to the codebase. This issue is inspired by work on #778 and #805.

As I get to know the completions codebase, I've been surprised at the widespread, very low-level interaction with the treesitter syntax tree. In hindsight, it's clear I expected activities to be framed in terms of nodes of type, e.g., "call", "arguments", or "argument". I didn't expect so much logic around anonymous nodes, such as "(", ")", ",", and "=".

To a considerable extent, you cannot avoid working with these low-level nodes. I mean, someone has to do it! Therefore one coping strategy is to keep building out a nice layer of well-tested wrapping around treesitter and to increase usage of this wrapping (i.e. try to eliminate bespoke low-level tree handling in functions that do high-level tasks).

But it may also be true that this is a legitimate downside of using the treesitter tree or parser directly (or at all?). This issue is a place to record challenges that come from treesitter.

Whitespace is hard

The fact that whitespace is basically not accounted for in the syntax tree is quite painful, because the cursor quite often has whitespace on one or both sides. In the language server, we constantly need to determine which node is "most associated" with the cursor. It's accurate-ish to say that treesitter's treatment of whitespace makes the "most associated" node almost undefined in these cases. It certainly puts you in a gray area.

Let's look at an example! Consider this code, where @ indicates the cursor:

options(
  a = @
)

Here's treesitter's view of that code. On the left, I overlay treesitter coordinates and on the right is the resulting syntax tree.

    0  1  2  3  4  5  6  7  8
    ┌──┬──┬──┬──┬──┬──┬──┬──┐
 0  │ o│ p│ t│ i│ o│ n│ s│ (│
    └──┴──┴──┴──┴──┴──┴──┴──┘

    0  1  2  3  4  5  6            program [0, 0] - [3, 0]
    ┌──┬──┬──┬──┬──┬──┐              call [0, 0] - [2, 1]
 1  │  │  │ a│  │ =│  │                function: identifier [0, 0] - [0, 7]
    └──┴──┴──┴──┴──┴──┘                arguments: arguments [0, 7] - [2, 1]
                                         open: ( [0, 7] - [0, 8]
    0  1                                 argument: argument [1, 2] - [1, 5]
    ┌──┐                                   name: identifier [1, 2] - [1, 3]
 2  │ )│                                   = [1, 4] - [1, 5]
    └──┘                                 close: ) [2, 0] - [2, 1]

To paraphrase the treesitter docs about [i, j] coordinates:

The row number i gives the number of newlines before a given position.
The column j gives the number of characters between the position and beginning of the line.

(It's really bytes, not characters, but that's not important for this discussion.)

A treesitter position:

  • Sits ON a line
  • Sits BETWEEN two characters

In the example, the cursor @ is at position [1, 6].
So which node is the cursor "in"?
If your job is to provide completions, which bit of syntax are you helping the user to fill in?

IMO there are two reasonable answers. You're either in:

  • An "argument" node. Here, the node with text a =.
  • The potential "value" node that could exist as a child of the "argument" node.

I view these as equivalent, because if you chose option 1, you would then have logic to bring yourself to option 2. That's just a matter of how you design the interface.

If you use bare treesitter tooling, here's the node you are "in":

  • The "arguments" node, which is everything between the "(" and the ")".

You can read this off the tree, because the "arguments" node is the smallest node with a span that contains position [1, 6].
Selecting the "arguments" node is very unfavorable for providing completions, though. It's too high.

What if there was no space between the "=" and the cursor?

options(
  a =@
)

Bare treesitter tooling would still say the cursor is in the "arguments" node.
Ark already has some wrappers around treesitter where we have (somewhat) fixed this up.
find_closest_node_to_point() would latch on to the "=" in this case.
(And quite a bit of existing logic expects to solve problems in this "bottom up" way, although I'm not sure it has to be this way.)

This is a good place to record the capturing behaviour at node boundaries.
In treesitter, a node span is sticky / inclusive on the left and not sticky / exclusive on the right.
Concretely, where @ indicates the cursor position and [ ... ] indicates a node's span:

? ? @[ ... ]  ? ? <- the cursor IS in the node
? ?  [ ... ]@ ? ? <- the cursor IS NOT in the node

find_closest_node_to_point() would say the cursor is in the node in both cases.

Executive summary

The treatment of whitespace makes treesitter syntax trees tricky to use directly for language server tasks.
You generally need to walk up and/or down to identify the node that really drives your actions.

It feels like ark's language server currently has these tricky gymnastics inlined throughout the codebase.
In the future, it would be nice to give ourselves a more ergonomic interface to the tree.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions