You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cursorless has some support for Unicode, but it could be improved. The core of the problem is that we don't properly determine grapheme boundaries, so our tokeniser and hat placement will sometimes split individual graphemes, resulting in problems such as
emoji being split in half by hat renderer (eg 🤷♂️ being split into 🤷♂),
deleting half of a surrogate pair, which would result in invalid Unicode in a document
etc
We need to properly handle Unicode in the following areas:
We'd like to vendor in VSCode's grapheme splitter, which appears to be a complete implementation of Unicode's recommended grapheme cluster splitting algorithm, optimised for use in a code editor. We will use this directly for grapheme splitting. For our other, regex-based segmentation algorithms, we will modify them to automatically extend until the nearest cluster boundary after they do regex splitting.
Look into https://github.com/flmnt/graphemer as possible replacement for VSCode splitter? Should do some benchmarking, as the VSCode implementation seems well optimised for code, where the common case is mostly ASCII
2. Create generateGraphemeAwareMatchesInRange function
The file will look roughly as follows:
import{Range,TextEditor}from"@cursorless/common";import{RangeOffsets}from"../../../typings/updateSelections";import{imap}from"itertools";interfaceRangeWithGraphemes{offsets: RangeOffsets;graphemes: RangeOffsets[];}function*generateGraphemeAwareMatchesInRange(regex: Regexp,editor: TextEditor,range: Range,direction: Direction,): Iterable<RangeWithGraphemes>{// Instantiate grapheme iterator for rangeconstgraphemeIterator=generateGraphemesInRange(editor,range,direction)[Symbol.iterator]();letgrapheme=graphemeIterator.next();constmatchIterable=generateMatchOffsetsInRange(regex,editor,range,direction,);for(constmatchofmatchIterable){// For each match returned by regex:// Discard graphemes until we find the first that overlaps with matchwhile(!grapheme.done&&grapheme.value.end<=match.start){grapheme=graphemeIterator.next();}// Collect all graphemes that overlap with matchconstgraphemesInMatch: RangeOffsets[]=[];while(!grapheme.done&&grapheme.value.start<match.end){graphemesInMatch.push(grapheme.value);grapheme=graphemeIterator.next();}if(graphemesInMatch.length===0){thrownewError("No graphemes in match; this should never happen");}// Expand match to include all graphemes, and return the graphemes for// convenienceyield{offsets: {start: graphemesInMatch[0].start,end: graphemesInMatch[graphemesInMatch.length-1].end,},graphemes: graphemesInMatch,};}}typeDirection="forward"|"backward";functiongenerateMatchOffsetsInRange(regex: RegExp,editor: TextEditor,range: Range,direction: Direction,): Iterable<RangeOffsets>{constoffset=editor.document.offsetAt(range.start);consttext=editor.document.getText(range);constmatchToRange=(match: RegExpMatchArray): RangeOffsets=>({start: offset+match.index!,end: offset+match.index!+match[0].length,});// Reset the regex to start at the beginning of string, in case the regex has// been used before.// See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec#finding_successive_matchesregex.lastIndex=0;returndirection==="forward"
? imap(text.matchAll(regex),matchToRange)
: Array.from(text.matchAll(regex),matchToRange).reverse();}functiongenerateGraphemesInRange(editor: TextEditor,range: Range,direction: Direction,): Iterable<RangeOffsets>{thrownewError("Not implemented; we need to vendor this function in from VSCode source code");}
3. Use generateGraphemeAwareMatchesInRange everywhere we need it
Switch tokeniser to use generateGraphemeAwareMatchesInRange combined with our existing token regex
Switch existing grapheme splitter to just perform normalization; using graphemes already output by generateGraphemeAwareMatchesInRange instead of doing its own grapheme splitting. This will require changes to allocateHats, which coordinates the interaction between tokeniser and grapheme splitter. We to probably rename existing grapheme splitter to normalizeGraphemes or something
Switch word scope to use generateGraphemeAwareMatchesInRange combined with our existing word regex
Switch identifier scope to use generateGraphemeAwareMatchesInRange combined with our existing identifier regex
Switch token scope to use generateGraphemeAwareMatchesInRange combined with our existing token regex
Switch character scope to use generateGraphemeAwareMatchesInRange
), as it will require work to get it to be grapheme aware, and it's not even clear if the regex expansion behaviour was ever desirable. We should just switch tokens to be closed
Technically speaking, we could end up splitting a token that we don't want to, if our token regex isn't smart enough to understand some construct that should actually be considered part of the middle of a token. Including \p{M} in our regex, as we do today, should solve almost every problem, and we can always improve our regex if necessary, so this should be a non-issue. In addition, the failure mode here is much less dramatic than today's; as we won't be corrupting graphemes by eg splitting surrogate pairs. We'll just end up with shorter tokens than we'd like in these rare cases
The problem
Cursorless has some support for Unicode, but it could be improved. The core of the problem is that we don't properly determine grapheme boundaries, so our tokeniser and hat placement will sometimes split individual graphemes, resulting in problems such as
We need to properly handle Unicode in the following areas:
cursorless/src/libs/cursorless-engine/tokenizer/tokenizer.ts
Line 89 in 2e5ba05
cursorless/src/libs/cursorless-engine/tokenizer/tokenizer.ts
Line 88 in 2e5ba05
The solution
We'd like to vendor in VSCode's grapheme splitter, which appears to be a complete implementation of Unicode's recommended grapheme cluster splitting algorithm, optimised for use in a code editor. We will use this directly for grapheme splitting. For our other, regex-based segmentation algorithms, we will modify them to automatically extend until the nearest cluster boundary after they do regex splitting.
Steps
1. Copy grapheme splitter from VSCode
We begin by copying VSCode's grapheme splitter into our vendor directory (note that we could prob reuse the copy of
CharCode
that we've already vendored in if necessary). See https://github.com/microsoft/vscode/blob/9bc5f9b708c36a38ba7e5d41e291c1f244af5972/src/vs/editor/common/core/cursorColumns.ts#L44 for example usage of their splitter.We will probably want to wrap their splitter in something more ergonomic, eg something that returns an iterable of ranges:
2. Create
generateGraphemeAwareMatchesInRange
functionThe file will look roughly as follows:
3. Use
generateGraphemeAwareMatchesInRange
everywhere we need itgenerateGraphemeAwareMatchesInRange
combined with our existing token regexgenerateGraphemeAwareMatchesInRange
instead of doing its own grapheme splitting. This will require changes toallocateHats
, which coordinates the interaction between tokeniser and grapheme splitter. We to probably rename existing grapheme splitter tonormalizeGraphemes
or somethinggenerateGraphemeAwareMatchesInRange
combined with our existing word regexgenerateGraphemeAwareMatchesInRange
combined with our existing identifier regexgenerateGraphemeAwareMatchesInRange
combined with our existing token regexgenerateGraphemeAwareMatchesInRange
cursorless/src/core/updateSelections/getOffsetsForEmptyRangeInsert.ts
Line 68 in e17628a
cursorless/src/core/updateSelections/getOffsetsForNonEmptyRangeInsert.ts
Line 82 in e17628a
closed
4. Add tests
We should add tests for the following:
\u0027\u005a\u0351\u036b\u0343\u036a\u0302\u036b\u033d\u034f\u0334\u0319\u0324\u031e\u0349\u035a\u032f\u031e\u0320\u034d\u0041\u036b\u0357\u0334\u0362\u0335\u031c\u0330\u0354\u004c\u0368\u0367\u0369\u0358\u0320\u005f\u0047\u0311\u0357\u030e\u0305\u035b\u0341\u0334\u033b\u0348\u034d\u0354\u0339\u004f\u0342\u030c\u030c\u0358\u0328\u0335\u0339\u033b\u031d\u0333\u005f\u0061\u0069\u0072\u0021\u033f\u030b\u0365\u0365\u0302\u0363\u0310\u0301\u0301\u035e\u035c\u0356\u032c\u0330\u0319\u0317\u0061\u0027
\u0049\u00f1\u0074\u00eb\u0072\u006e\u00e2\u0074\u0069\u00f4\u006e\u00e0\u006c\u0069\u007a\u00e6\u0074\u0069\u00f8\u006e\u2603\u{1f4a9}
(from https://mathiasbynens.be/notes/javascript-unicode#poo-test)
For each of the above strings, we should run:
Caveats
Technically speaking, we could end up splitting a token that we don't want to, if our token regex isn't smart enough to understand some construct that should actually be considered part of the middle of a token. Including
\p{M}
in our regex, as we do today, should solve almost every problem, and we can always improve our regex if necessary, so this should be a non-issue. In addition, the failure mode here is much less dramatic than today's; as we won't be corrupting graphemes by eg splitting surrogate pairs. We'll just end up with shorter tokens than we'd like in these rare casesReferences
The text was updated successfully, but these errors were encountered: