Solidify our Unicode support #1264

pokey · 2023-02-10T15:09:10Z

The problem

Cursorless has some support for Unicode, but it could be improved. The core of the problem is that we don't properly determine grapheme boundaries, so our tokeniser and hat placement will sometimes split individual graphemes, resulting in problems such as

emoji being split in half by hat renderer (eg 🤷‍♂️ being split into 🤷♂),
deleting half of a surrogate pair, which would result in invalid Unicode in a document
etc

We need to properly handle Unicode in the following areas:

tokeniser,
grapheme splitter,
word scope see https://github.com/cursorless-dev/cursorless/blob/main/src/libs/cursorless-engine/scopeHandlers/WordScopeHandler/WordTokenizer.ts and

cursorless/src/libs/cursorless-engine/tokenizer/tokenizer.ts

Line 89 in 2e5ba05

wordMatcher: new RegExp(wordRegex, "gu"),
identifier scope see

cursorless/src/libs/cursorless-engine/tokenizer/tokenizer.ts

Line 88 in 2e5ba05

identifierMatcher: new RegExp(identifiersRegex, "gu"),
token scope (relies on tokenizer)
character scope (relies on grapheme splitter)

The solution

We'd like to vendor in VSCode's grapheme splitter, which appears to be a complete implementation of Unicode's recommended grapheme cluster splitting algorithm, optimised for use in a code editor. We will use this directly for grapheme splitting. For our other, regex-based segmentation algorithms, we will modify them to automatically extend until the nearest cluster boundary after they do regex splitting.

Look into https://github.com/flmnt/graphemer as possible replacement for VSCode splitter? Should do some benchmarking, as the VSCode implementation seems well optimised for code, where the common case is mostly ASCII

Steps

1. Copy grapheme splitter from VSCode

We begin by copying VSCode's grapheme splitter into our vendor directory (note that we could prob reuse the copy of CharCode that we've already vendored in if necessary). See https://github.com/microsoft/vscode/blob/9bc5f9b708c36a38ba7e5d41e291c1f244af5972/src/vs/editor/common/core/cursorColumns.ts#L44 for example usage of their splitter.

We will probably want to wrap their splitter in something more ergonomic, eg something that returns an iterable of ranges:

function generateGraphemesInRange(editor: TextEditor, range: Range, direction: Direction): Iterable<RangeOffsets>;

2. Create `generateGraphemeAwareMatchesInRange` function

The file will look roughly as follows:

import { Range, TextEditor } from "@cursorless/common";
import { RangeOffsets } from "../../../typings/updateSelections";
import { imap } from "itertools";

interface RangeWithGraphemes {
  offsets: RangeOffsets;
  graphemes: RangeOffsets[];
}

function* generateGraphemeAwareMatchesInRange(
  regex: Regexp,
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeWithGraphemes> {
  // Instantiate grapheme iterator for range
  const graphemeIterator = generateGraphemesInRange(editor, range, direction)[
    Symbol.iterator
  ]();
  let grapheme = graphemeIterator.next();

  const matchIterable = generateMatchOffsetsInRange(
    regex,
    editor,
    range,
    direction,
  );
  for (const match of matchIterable) {
    // For each match returned by regex:

    // Discard graphemes until we find the first that overlaps with match
    while (!grapheme.done && grapheme.value.end <= match.start) {
      grapheme = graphemeIterator.next();
    }

    // Collect all graphemes that overlap with match
    const graphemesInMatch: RangeOffsets[] = [];
    while (!grapheme.done && grapheme.value.start < match.end) {
      graphemesInMatch.push(grapheme.value);
      grapheme = graphemeIterator.next();
    }

    if (graphemesInMatch.length === 0) {
      throw new Error("No graphemes in match; this should never happen");
    }

    // Expand match to include all graphemes, and return the graphemes for
    // convenience
    yield {
      offsets: {
        start: graphemesInMatch[0].start,
        end: graphemesInMatch[graphemesInMatch.length - 1].end,
      },
      graphemes: graphemesInMatch,
    };
  }
}

type Direction = "forward" | "backward";

function generateMatchOffsetsInRange(
  regex: RegExp,
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeOffsets> {
  const offset = editor.document.offsetAt(range.start);
  const text = editor.document.getText(range);

  const matchToRange = (match: RegExpMatchArray): RangeOffsets => ({
    start: offset + match.index!,
    end: offset + match.index! + match[0].length,
  });

  // Reset the regex to start at the beginning of string, in case the regex has
  // been used before.
  // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec#finding_successive_matches
  regex.lastIndex = 0;

  return direction === "forward"
    ? imap(text.matchAll(regex), matchToRange)
    : Array.from(text.matchAll(regex), matchToRange).reverse();
}

function generateGraphemesInRange(
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeOffsets> {
  throw new Error("Not implemented; we need to vendor this function in from VSCode source code");
}

3. Use `generateGraphemeAwareMatchesInRange` everywhere we need it

Switch tokeniser to use generateGraphemeAwareMatchesInRange combined with our existing token regex
Switch existing grapheme splitter to just perform normalization; using graphemes already output by generateGraphemeAwareMatchesInRange instead of doing its own grapheme splitting. This will require changes to allocateHats, which coordinates the interaction between tokeniser and grapheme splitter. We to probably rename existing grapheme splitter to normalizeGraphemes or something
Switch word scope to use generateGraphemeAwareMatchesInRange combined with our existing word regex
Switch identifier scope to use generateGraphemeAwareMatchesInRange combined with our existing identifier regex
Switch token scope to use generateGraphemeAwareMatchesInRange combined with our existing token regex
Switch character scope to use generateGraphemeAwareMatchesInRange
Disable regex token expansion (see eg

cursorless/src/core/updateSelections/getOffsetsForEmptyRangeInsert.ts

Line 68 in e17628a

const matches = text.match(leftAnchored(expansionBehavior.regex));

and

cursorless/src/core/updateSelections/getOffsetsForNonEmptyRangeInsert.ts

Line 82 in e17628a

case "regex": {

), as it will require work to get it to be grapheme aware, and it's not even clear if the regex expansion behaviour was ever desirable. We should just switch tokens to be closed

4. Add tests

We should add tests for the following:

🤷‍♂️ (zero-width joiner)
👍🏽 (Fitzpatrick modifier)
✏️ (variation selector)
💩 (surrogate pair)
\u0027\u005a\u0351\u036b\u0343\u036a\u0302\u036b\u033d\u034f\u0334\u0319\u0324\u031e\u0349\u035a\u032f\u031e\u0320\u034d\u0041\u036b\u0357\u0334\u0362\u0335\u031c\u0330\u0354\u004c\u0368\u0367\u0369\u0358\u0320\u005f\u0047\u0311\u0357\u030e\u0305\u035b\u0341\u0334\u033b\u0348\u034d\u0354\u0339\u004f\u0342\u030c\u030c\u0358\u0328\u0335\u0339\u033b\u031d\u0333\u005f\u0061\u0069\u0072\u0021\u033f\u030b\u0365\u0365\u0302\u0363\u0310\u0301\u0301\u035e\u035c\u0356\u032c\u0330\u0319\u0317\u0061\u0027

Steal test cases from lodash
\u0049\u00f1\u0074\u00eb\u0072\u006e\u00e2\u0074\u0069\u00f4\u006e\u00e0\u006c\u0069\u007a\u00e6\u0074\u0069\u00f8\u006e\u2603\u{1f4a9}

(from https://mathiasbynens.be/notes/javascript-unicode#poo-test)

Try to find VSCode Unicode tests

For each of the above strings, we should run:

Caveats

Technically speaking, we could end up splitting a token that we don't want to, if our token regex isn't smart enough to understand some construct that should actually be considered part of the middle of a token. Including \p{M} in our regex, as we do today, should solve almost every problem, and we can always improve our regex if necessary, so this should be a non-issue. In addition, the failure mode here is much less dramatic than today's; as we won't be corrupting graphemes by eg splitting surrogate pairs. We'll just end up with shorter tokens than we'd like in these rare cases

References

The text was updated successfully, but these errors were encountered:

pokey added the bug Something isn't working label Feb 10, 2023

pokey mentioned this issue Feb 10, 2023

Cursorless breaks if you insert the pencil emoji (✏️) in a file #1244

Open

pokey mentioned this issue Jul 7, 2023

Slice fails sometimes if using tabs. #1586

Open

pokey mentioned this issue Aug 7, 2023

Scope visualizer: handle tab characters (\t) #1741

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solidify our Unicode support #1264

Solidify our Unicode support #1264

pokey commented Feb 10, 2023 •

edited

Loading

Solidify our Unicode support #1264

Solidify our Unicode support #1264

Comments

pokey commented Feb 10, 2023 • edited Loading

The problem

The solution

Steps

1. Copy grapheme splitter from VSCode

2. Create generateGraphemeAwareMatchesInRange function

3. Use generateGraphemeAwareMatchesInRange everywhere we need it

4. Add tests

Caveats

References

pokey commented Feb 10, 2023 •

edited

Loading

2. Create `generateGraphemeAwareMatchesInRange` function

3. Use `generateGraphemeAwareMatchesInRange` everywhere we need it