-
-
Notifications
You must be signed in to change notification settings - Fork 84
Support multi-language documents #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
One interesting possible follow on to this is the ability to define an intermediary language that would allow for more fluent chaining of commands. For example, when making a list in Ruby or JS/TS, I assume I need to add commas.
What if I could create the following and then say
This example seems trivial but I'm finding myself using |
@Will-Sommers yes that would be nice to be able to do, but I guess I'm not sure how we'd know treat this fragment as a separate language if it appeared in a Ruby document |
Heyo, I reached out and asked one of the GH engineers about multi-language documents and it looks like there is support within tree sitter. Here's the example he linked. That being said, being able to parse is just one step. Right now it looks like we rely on the editor's language selection to indicate which language we are in and which parser to use. Steps for further investigation:
I think that something like syntax within markdown might be sort of difficult unless you use an extended info string ala Github Flavored Markdown. |
Yeah that injection stuff looks like the way to go. Fwiw VSCode knows that the segment is a different language, as evidenced by the fact that it's able to do syntax highlighting, but I don't think we can get that info. Worth a quick look I guess. But if we can't use that, then def the tree-sitter injection stuff seems like the way to go Will be interesting to see if you can get that injection stuff to work Yeah for markdown I'd use the extended info string; looks like tree-sitter has support for determining language from the text of a token, as described in that doc you sent |
Heyo, it does look like this is able to exposed to us via VSCode using a language server extension, either one that we import and depend on or one that ships with VSCode and does things like multi-language syntax highlighting. Here are the two VSCode approaches outlined. I think this is a better approach than relying on a parser since it will give more flexibility. We really need to be able to say is — We're in x or y language when the cursor is within this block. I think then it would be easy to let cursorless, as it stands, take over with a specific languageId. I think the extra flexibility will help support custom code blocks that have slight differences in their specs, like I think the compelling use cases to support here are likely:
I think that the first two are tied in my mind, but the second case seems more commonly used than the first these days. |
Looking at that VSCode example, it looks like it may be a bit complex to get that direction working. It seems we need to fork each lsp extension we want to use? One thing to think about for either approach is how we handle referring to a scope type in the parent tree. For example, if we're inside an embedded code block in markdown, and say "take section" to select the markdown section that contains our code block. Also, in the future we may want to be able to compute a list of the language ids of all visible editors. This way we can narrow down the set of scope types that are active in talon lists. That will potentially become important as we support more and more scope types, eg foe things like latex where you want "environment", etc. Prob not worth engineering too hard for right now, but just a slight consideration to keep in mind |
Fwiw in cases where injections that ship with tree-sitter repos are lacking, we can borrow from these:
Also worth looking at how these projects implement injections to see if it's helpful |
Cool! Thanks for the links, I'll take a look at them tonight and see a bit more. Unrelated to this main issue, I think there could also be some interesting things here as well wrt naming of nodes. This is really cool. I need to think a bit more about it. |
Heyo, just some notes here: So it looks like what neo-vim does and test suite for tree-sitter does is pass in the queries included in each tree-sitter project into each parser's language, this returns a new query object which then can be used to match against rest of the file. This will return a set of matches. Within each match there's field called e.g.
The injection.language names are specified in each tree-sitter project's package.json. Highlight and injection paths are also specified there. From the tree-sitter docs, it looks like from here, we need to reference that parse and use that parser's language for a re-parse.
Quickly looking at the queries files within neo-vim, I think we can crib a bunch from them. It looks like within the I'll start to think about how this might change things. For one, it looks like this supposes multiple trees, since one tree is returned from each parse. |
It's also worth thinking bout how this works with incremental parsing |
Also wrt the multiple tree thing, I don't believe there's any reason we can't just graft the trees together ourselves, right? Tho maybe it's better to leave the trees untouched, and then just maintain our own map from nodes to injected subtrees |
Also I think I'd argue we should try to push as much as possible into the parse-tree extension rather than cursorless |
Heyo, I'll think about all of these notes. I thought about this a bit more and think that we should first approach looking at languages where the sub-language is part of the grammar and there's a node similar to The other case handling cases where a transpilation step looks at the text and handles the inner blocks in a separate fashion, a la CSS in JSX.
Yep! This query can happen off of any
Just to be clear, what you're advocating is, for example, taking the
I agree in principle on this but I'm curious if it will be the best way. We'll need to see how |
re: Tree-Sitter being able to handle this by default.
This is meant to go in the Update: I reached out to one of the devs via email. |
Just to be clear, by "incremental parsing", I mean updating the parse tree as the document changes, which tree-sitter is able to do efficiently without reparsing the entire document. Here's where we do it in the parse-tree extension: https://github.com/cursorless-dev/vscode-parse-tree/blob/4af875b7cbd72d68c1e1eafe43ddabc3403264ce/src/extension.ts#L109-L134 |
Sorry—not sure I understand the difference between these two cases. Can you elaborate? Or maybe worth chatting on discord? |
Not really advocating one direction or the other tbh, just brainstorming |
Sorry, I don't follow. Maybe another thing for a discord |
Adding more notes — it looks like NeoVim created their own data structure to track child trees as well as language injection/queries. [link] Looking more at this, including in the document example where an |
I think there are going to end up being some interesting design challenges here, for which we are going to have to develop principles as we go. This came up recently in the context of .talon file support, which kind of is two languages: k/v pairs and talonscript as values of keys. There was discussion about what |
There is also dotnet languages to consider here that also mix with HTML:
|
Now that Cursorless supports HTML (great job!) it would be amazing if it could also support nested languages inside of HTML. For example JS/TS inside of
<script>
tags, and CSS/SCSS inside of<style>
tags.This would be super helpful for web developers. Because plenty of popular frameworks (such as Vue) use single file components, where HTML, JS and CSS are all located in the same file.
I assume that this documentation is relevant to this issue:
https://tree-sitter.github.io/tree-sitter/using-parsers#multi-language-documents
The text was updated successfully, but these errors were encountered: