Skip to content

Solidify our Unicode support #1264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
22 tasks
pokey opened this issue Feb 10, 2023 · 0 comments
Open
22 tasks

Solidify our Unicode support #1264

pokey opened this issue Feb 10, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@pokey
Copy link
Member

pokey commented Feb 10, 2023

The problem

Cursorless has some support for Unicode, but it could be improved. The core of the problem is that we don't properly determine grapheme boundaries, so our tokeniser and hat placement will sometimes split individual graphemes, resulting in problems such as

  • emoji being split in half by hat renderer (eg 🤷‍♂️ being split into 🤷♂),
  • deleting half of a surrogate pair, which would result in invalid Unicode in a document
  • etc

We need to properly handle Unicode in the following areas:

The solution

We'd like to vendor in VSCode's grapheme splitter, which appears to be a complete implementation of Unicode's recommended grapheme cluster splitting algorithm, optimised for use in a code editor. We will use this directly for grapheme splitting. For our other, regex-based segmentation algorithms, we will modify them to automatically extend until the nearest cluster boundary after they do regex splitting.

  • Look into https://github.com/flmnt/graphemer as possible replacement for VSCode splitter? Should do some benchmarking, as the VSCode implementation seems well optimised for code, where the common case is mostly ASCII

Steps

1. Copy grapheme splitter from VSCode

We begin by copying VSCode's grapheme splitter into our vendor directory (note that we could prob reuse the copy of CharCode that we've already vendored in if necessary). See https://github.com/microsoft/vscode/blob/9bc5f9b708c36a38ba7e5d41e291c1f244af5972/src/vs/editor/common/core/cursorColumns.ts#L44 for example usage of their splitter.

We will probably want to wrap their splitter in something more ergonomic, eg something that returns an iterable of ranges:

function generateGraphemesInRange(editor: TextEditor, range: Range, direction: Direction): Iterable<RangeOffsets>;

2. Create generateGraphemeAwareMatchesInRange function

The file will look roughly as follows:

import { Range, TextEditor } from "@cursorless/common";
import { RangeOffsets } from "../../../typings/updateSelections";
import { imap } from "itertools";

interface RangeWithGraphemes {
  offsets: RangeOffsets;
  graphemes: RangeOffsets[];
}

function* generateGraphemeAwareMatchesInRange(
  regex: Regexp,
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeWithGraphemes> {
  // Instantiate grapheme iterator for range
  const graphemeIterator = generateGraphemesInRange(editor, range, direction)[
    Symbol.iterator
  ]();
  let grapheme = graphemeIterator.next();

  const matchIterable = generateMatchOffsetsInRange(
    regex,
    editor,
    range,
    direction,
  );
  for (const match of matchIterable) {
    // For each match returned by regex:

    // Discard graphemes until we find the first that overlaps with match
    while (!grapheme.done && grapheme.value.end <= match.start) {
      grapheme = graphemeIterator.next();
    }

    // Collect all graphemes that overlap with match
    const graphemesInMatch: RangeOffsets[] = [];
    while (!grapheme.done && grapheme.value.start < match.end) {
      graphemesInMatch.push(grapheme.value);
      grapheme = graphemeIterator.next();
    }

    if (graphemesInMatch.length === 0) {
      throw new Error("No graphemes in match; this should never happen");
    }

    // Expand match to include all graphemes, and return the graphemes for
    // convenience
    yield {
      offsets: {
        start: graphemesInMatch[0].start,
        end: graphemesInMatch[graphemesInMatch.length - 1].end,
      },
      graphemes: graphemesInMatch,
    };
  }
}

type Direction = "forward" | "backward";

function generateMatchOffsetsInRange(
  regex: RegExp,
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeOffsets> {
  const offset = editor.document.offsetAt(range.start);
  const text = editor.document.getText(range);

  const matchToRange = (match: RegExpMatchArray): RangeOffsets => ({
    start: offset + match.index!,
    end: offset + match.index! + match[0].length,
  });

  // Reset the regex to start at the beginning of string, in case the regex has
  // been used before.
  // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec#finding_successive_matches
  regex.lastIndex = 0;

  return direction === "forward"
    ? imap(text.matchAll(regex), matchToRange)
    : Array.from(text.matchAll(regex), matchToRange).reverse();
}

function generateGraphemesInRange(
  editor: TextEditor,
  range: Range,
  direction: Direction,
): Iterable<RangeOffsets> {
  throw new Error("Not implemented; we need to vendor this function in from VSCode source code");
}

3. Use generateGraphemeAwareMatchesInRange everywhere we need it

  • Switch tokeniser to use generateGraphemeAwareMatchesInRange combined with our existing token regex
  • Switch existing grapheme splitter to just perform normalization; using graphemes already output by generateGraphemeAwareMatchesInRange instead of doing its own grapheme splitting. This will require changes to allocateHats, which coordinates the interaction between tokeniser and grapheme splitter. We to probably rename existing grapheme splitter to normalizeGraphemes or something
  • Switch word scope to use generateGraphemeAwareMatchesInRange combined with our existing word regex
  • Switch identifier scope to use generateGraphemeAwareMatchesInRange combined with our existing identifier regex
  • Switch token scope to use generateGraphemeAwareMatchesInRange combined with our existing token regex
  • Switch character scope to use generateGraphemeAwareMatchesInRange
  • Disable regex token expansion (see eg
    const matches = text.match(leftAnchored(expansionBehavior.regex));
    and ), as it will require work to get it to be grapheme aware, and it's not even clear if the regex expansion behaviour was ever desirable. We should just switch tokens to be closed

4. Add tests

We should add tests for the following:

  • 🤷‍♂️ (zero-width joiner)
  • 👍🏽 (Fitzpatrick modifier)
  • ✏️ (variation selector)
  • 💩 (surrogate pair)
  • \u0027\u005a\u0351\u036b\u0343\u036a\u0302\u036b\u033d\u034f\u0334\u0319\u0324\u031e\u0349\u035a\u032f\u031e\u0320\u034d\u0041\u036b\u0357\u0334\u0362\u0335\u031c\u0330\u0354\u004c\u0368\u0367\u0369\u0358\u0320\u005f\u0047\u0311\u0357\u030e\u0305\u035b\u0341\u0334\u033b\u0348\u034d\u0354\u0339\u004f\u0342\u030c\u030c\u0358\u0328\u0335\u0339\u033b\u031d\u0333\u005f\u0061\u0069\u0072\u0021\u033f\u030b\u0365\u0365\u0302\u0363\u0310\u0301\u0301\u035e\u035c\u0356\u032c\u0330\u0319\u0317\u0061\u0027

image

  • Steal test cases from lodash
  • \u0049\u00f1\u0074\u00eb\u0072\u006e\u00e2\u0074\u0069\u00f4\u006e\u00e0\u006c\u0069\u007a\u00e6\u0074\u0069\u00f8\u006e\u2603\u{1f4a9}

image

(from https://mathiasbynens.be/notes/javascript-unicode#poo-test)

  • Try to find VSCode Unicode tests

For each of the above strings, we should run:

  • Tokeniser
  • Grapheme splitter
  • Word scope type
  • Character scope type
  • Token scope type
  • Identifier scope type

Caveats

Technically speaking, we could end up splitting a token that we don't want to, if our token regex isn't smart enough to understand some construct that should actually be considered part of the middle of a token. Including \p{M} in our regex, as we do today, should solve almost every problem, and we can always improve our regex if necessary, so this should be a non-issue. In addition, the failure mode here is much less dramatic than today's; as we won't be corrupting graphemes by eg splitting surrogate pairs. We'll just end up with shorter tokens than we'd like in these rare cases

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant