The Optimistic Runtime Parser #289

stasm · 2018-10-10T10:42:23Z

I re-wrote the runtime parser. It supports Syntax 0.7, runs against the reference fixtures, has half the lines of code, aaand it's a bit slower on SpiderMonkey (but also a bit faster on V8, so I assume the perf will improve over time).

Goals

Support 100% of Fluent Syntax 0.7. This includes the indentation relaxation, dropping tabs and CR as syntax whitespace, normalizing new lines to LF, and only allowing numbers and identifiers as variant keys.
Maintain good performance. The parser is used in performance-critical code paths. Back in the days of Firefox OS it had to be both fast and produce tightly packed results so that translations don't take up too much space on the device. I think the storage requirements can be relaxed these days.
Write code which will be easy to maintain in the future. The parser was first written even before Fluent branched off from L20n. It's seen many changes and additions over the last two years. As new features accrued it became hard to maintain it and also to keep track of all known bugs. My goal for the re-write was not only to clean it up but also to define the conformance story for the future and to improve the testing infrastructure.

Design

The parser focuses on minimizing the number of false negatives at the expense of increasing the risk of false positives. In other words, it aims at parsing valid Fluent messages with a success rate of 100%, but it may also parse some invalid messages which the reference parser would reject. The parser doesn't perform any validation and may produce entries which wouldn't make sense in the real world. For best results users are advised to validate translations with the fluent-syntax parser pre-runtime.

The main parser loop iterates over the beginnings of messages and terms. This is to efficiently skip over comments (which have no use on runtime), and to recover from errors. When a fatal error is encountered, the parser instantly bails out of the currently-parsed message and moves on to the next one. Errors are discarded and are not visible to the users of FluentResource. The do carry a minimal description of what went wrong which may be useful when reading the code and for debugging, though.

The parser makes an extensive use of sticky regexes which can be anchored to any offset of the source string without slicing it. In some places, it's easier to just check the character currently at the cursor, so it does a fair share of that, too.

Conformance

My original plan was to base the parser on the EBNF and only parse well-formed syntax. In this PR, I went for something a bit wider than that: a superset of well-formed syntax. The main deviation from the EBNF is related to parsing VariantExpressions and CallExpressions. The EBNF verifies that the they are called on Terms and Functions respectively. The optimistic parser doesn't differentiate between Messages, Terms and Functions. I decided to implement it this way because this code might soon change anyways (see projectfluent/fluent#176).

Another deviation is that the parser treats commas in argument lists as whitespace, similar to how Clojure treats them in sequence lists. I might suggest we upstream this in the spec, too, because it makes the implementation of args lists much simpler.

I based this PR on top of the zeroseven branch. The fluent-syntax parser already supports Syntax 0.7 and passes the reference fixtures. This made it possible to also turn on the reference testing in the runtime parser, too. make fixtures creates the parsed results for all reference fixtures; for now they must be verified manually before they're committed. make test can be used in development to assert that the output of the runtime parser still matches the committed one.

Performance

I'm seeing a slight slowdown when running make perf with SpiderMonkey (make perf-jsshell). Interestingly, I'm also seeing a slight speed up in V8.

SpiderMonkey 63 (in milliseconds, relative to master):
   mean:   5.99 (+26%)
   stdev:  0.22
   sample: 30

node 9.11 (in milliseconds, relative to master):
    mean:   3.14 (-15%)
    stdev:  0.12
    sample: 30

I asked @zbraniecki to run a Talos test using this branch. We'll see if the slow down is visible during Firefox startup and we'll then decide what do about it. I'm also seeing an opportunity to create a test case for the SpiderMonkey team. Maybe we're using features of the engine which haven't been optimized too much, yet? (This also applies to the fluent-syntax parser which is 4 times slower in SM than it is in V8).

zbraniecki · 2018-10-10T15:23:31Z

I created a talos try run which compares about_preferences loading with cache disabled on m-c vs. this patch.

My hypothesis is that this will allow us if over 20 runs of about:preferences in Firefox we can observe any measurable performance impact of this patchset comparing to central.

The results should be in within 2 hours: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=28966f66cf31&newProject=try&newRevision=c6ac0d1a5166&framework=1

See projectfluent/fluent.js#289

stasm · 2018-10-10T17:25:30Z

I created an isolated test case for measuring the performance of the parser: https://projectfluent.org/fluent-runtime-perf/. It looks like SpiderMonkey is slower than V8 only with smaller resources. The real-life resources are usually on the smaller side.

zbraniecki · 2018-10-11T00:25:12Z

I compared central, no-cache and no-cache + this scenarios: https://pike.github.io/talos-compare/?revision=f293199791bf&revision=28966f66cf31&revision=c6ac0d1a5166

Based on the read, I'd say there's a no impact to minor win on all platforms. I'm confident that this change does not pose a performance regression in Gecko scenario.

stasm · 2018-10-11T07:23:00Z

@Pike, @zbraniecki, I'm asking you both to review this. Let me know if you're available and if you'd like me to provide more details on this PR. Thanks!

…lues

stasm · 2018-10-11T10:14:45Z

I may have found the cause of the slowdown which I observed yesterday. I documented the regexes used by the parser this morning and in the process I realized that the RE_ATTRIBUTE_START and the RE_VARIANT_START regexes could be used more idiomatically. Previously, I would use them inside of the corresponding while loop to break out of it. I changed the code to use the regex as the while loop's end condition instead. That's not what made a difference performance-wise, however. What did make a difference is that I also removed an unneeded skipBlank from the while loop. Pattern already skips all its trailing blank space; doing it again before the next attribute/variant wasn't useful.

With the extra skipBlank (milliseconds, compared to master):
  mean:   6.02 (+27%)
  stdev:  0.38
  sample: 30

Without it (milliseconds, compared to master):
  mean:   4.82 (+1%)
  stdev:  0.32
  sample: 30

zbraniecki

This looks great! Very well documented and clean to read. I'm surprised with the performance of this code because I remember such code not to work great in SM back when we started working on runtime parser, but am happy to see this improved :)

Stylistically, I'm not a huge fan of naming functions in the parser starting with a capital letter, as per [0]:

In JavaScript, functions should use camelCase, but should not capitalize the first letter.

but that's no a significant concern and if you prefer it, I'm sure we can read the code just fine :)

[0] https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style#General_C.2FC.2B.2B_Practices

zbraniecki · 2018-10-11T13:37:19Z

fluent/src/resource.js

+const TOKEN_ARROW = /\s*->\s*/y;
+const TOKEN_COLON = /\s*:\s*/y;
+// As a deviation from the well-formed Fluent grammar, accept argument lists
+// without commas between arguments.


I don't think I understand this comment. How's it different from other normalizations?

Can you help me re-word the comment? What do you mean by "other normalizations"? The regex under this comment has an optional comma (.?) which essentially means that commas in argument lists are treated as whitespace.

How does the following sound?

// Note the optional comma. As a deviation from the Fluent EBNF, the parser // accepts lists of call arguments without commas between them. const TOKEN_COMMA = /\s*,?\s*/y;

Ah, I see. I somehow confused the message as "accept argument lists with commas between arguments". Maybe - As a deviation from the well-formed Fluent grammar, this parser does not enforce commas between arguments.

zbraniecki · 2018-10-11T13:42:27Z

fluent/src/resource.js

+        var first = Match(RE_TEXT_RUN);
+      }
+
+      // If there's an backslash escape or a placeable on the first line, fall


"there's a backslash"

stasm · 2018-10-11T14:53:43Z

Thanks, @zbraniecki! I've wondered about the naming (of course I have :)) and I like using the nouns over verbs specifically for writing parsers. I've used this same convention in the reference parser, so I might be biased by the code I already wrote before. To me, nouns represent well the concepts of tokens and grammar productions. I understand that not everyone may think that way.

I used the capitalization consistently for functions which return a parsed result, as opposed to functions which are operators or helpers. In a way, the parsers uses its own DSL to compose smaller productions into bigger ones. Again, this is something that I took away from writing the reference parser.

let selector = InlineExpression();
if (token(TOKEN_BRACE_CLOSE)) {
    return selector;
}

Do you think using verbs would make the code easier to read for you and for newcomers? Example:

let selector = parseInlineExpression();
if (skipToken(TOKEN_BRACE_CLOSE)) {
    return selector;
}

stasm · 2018-10-11T14:54:45Z

Related: what's your take on using 4 spaces for indentation? I know Mozilla guidelines say 2 spaces for JS, but having more breathing room makes the code easier to read for me. I wonder if it's the same for other people. It also occurs to me that 2 spaces were perhaps a good compromise in the days of callbacks. A compromise which may have outlived its usefulness in the current days of async functions :) Can we turn this lint off in mozilla-central for Fluent files? This re-write would be a good opportunity to change the indentation.

zbraniecki · 2018-10-11T14:59:52Z

Do you think using verbs would make the code easier to read for you and for newcomers?

I don't think it matters that much. My only concern with attrs.push(Attribute()) is that it looks like a class, rather than a function.

I looked at rust parser [0] and they do use parse_foo, get_foo, expect_foo, eat_foo etc.

what's your take on using 4 spaces for indentation?

No strong opinion. I'm comfortable with 2 spaces in JS, C++ and with 4 in Rust, so I don't expect a big difference.

One thing is that I would prefer to migrate all at once and would prefer not to touch it while we're updating the code. Already I'd say that renaming variables/properties/classes/methods while updating the logic is the single greatest burden on reviewing the code in Fluent I encountered. I'd love not to add white space convention to that. :)

[0] https://github.com/rust-lang/rust/blob/master/src/libsyntax/parse/parser.rs

stasm · 2018-10-11T15:02:41Z

OK, let's keep 2 spaces in this PR and discuss changing all code to 4 in a separate issue.

stasm · 2018-10-11T16:57:30Z

I decided to go with the verb+noun naming scheme: consumeToken, parseInlineExpression etc. It's explicit, predictable, can't be confused with class construction, and hopefully won't catch anyone off-guard. Thanks for the feedback!

This is now ready to merge.

This is a re-write of the runtime parser. It supports Fluent Syntax 0.7, runs against the reference fixtures, has half the lines of code, and is as fast in SpiderMonkey as the old one (and slightly faster in V8). Goals 1. Support 100% of Fluent Syntax 0.7. This includes the indentation relaxation, dropping tabs and CR as syntax whitespace, normalizing new lines to LF, and only allowing numbers and identifiers as variant keys. 2. Maintain good performance. The parser is used in performance-critical code paths. Back in the days of Firefox OS it had to be both fast _and_ produce tightly packed results so that translations don't take up too much space on the device. I think the storage requirements can be relaxed these days. 3. Write code which will be easy to maintain in the future. The parser was first written even before Fluent branched off from L20n. It's seen many changes and additions over the last two years. As new features accrued it became hard to maintain it and also to keep track of all known bugs. My goal for the re-write was not only to clean it up but also to define the conformance story for the future and to improve the testing infrastructure. Design The parser focuses on minimizing the number of false negatives at the expense of increasing the risk of false positives. In other words, it aims at parsing _valid_ Fluent messages with a success rate of 100%, but it may also parse some invalid messages which the reference parser would reject. The parser doesn't perform any validation and may produce entries which wouldn't make sense in the real world. For best results users are advised to validate translations with the fluent-syntax parser pre-runtime. The main parser loop iterates over the beginnings of messages and terms. This is to efficiently skip over comments (which have no use on runtime), and to recover from errors. When a fatal error is encountered, the parser instantly bails out of the currently-parsed message and moves on to the next one. Errors are discarded and are not visible to the users of `FluentResource`. The do carry a minimal description of what went wrong which may be useful when reading the code and for debugging, though. The parser makes an extensive use of sticky regexes which can be anchored to any offset of the source string without slicing it. In some places, it's easier to just check the character currently at the cursor, so it does a fair share of that, too. Conformance My original plan was to base the parser on the EBNF and only parse well-formed syntax. In this PR, I went for something a bit wider than that: a superset of well-formed syntax. The main deviation from the EBNF is related to parsing `VariantExpressions` and `CallExpressions`. The EBNF verifies that the they are called on `Terms` and `Functions` respectively. The optimistic parser doesn't differentiate between `Messages`, `Terms` and `Functions`. I decided to implement it this way because this code might soon change anyways (see projectfluent/fluent#176). Another deviation is that the parser treats commas in argument lists as whitespace, similar to how Clojure treats them in sequence lists. I might suggest we upstream this in the spec, too, because it makes the implementation of args lists _much_ simpler. I based this PR on top of the `zeroseven` branch. The `fluent-syntax` parser already supports Syntax 0.7 and passes the [reference fixtures](https://github.com/projectfluent/fluent/tree/master/test/fixtures). This made it possible to also turn on the reference testing in the runtime parser, too. `make fixtures` creates the parsed results for all reference fixtures; for now they must be verified manually before they're committed. `make test` can be used in development to assert that the output of the runtime parser still matches the committed one.

stasm added 30 commits October 9, 2018 09:12

wip

0110472

Remove old gePattern methods

cfcd944

Optimize simple Patterns

c89f695

Rename regexps

4573c27

Clean up regex tests

9c5be21

test attributes

47bf6dc

fix call args

de35319

Fix select exprs

9f30460

Support unindented placeables

67d8e33

Simplify entryOffsets with /mg

f6dbe2a

RuntimeParser.entries

6eaf3f1

Drop _

39a820b

Simplify entryOffsets

a390969

Add .js to imports

32e9c9c

Clean up variants

1fa81e1

getInlineExpression

14d33d2

Pretty print

d53c7bc

One generator

efd7395

Move parsing to FluentResource

9168795

Rename to parse()

6f8d2b0

More cleanups

b99f273

skipBlank

57adf49

Skip attrs if missing

ca3f07d

Stub of tests fix

f327e6f

Skip leading block blanks

10e97e8

Normalize trailing whitespace

0d4c1f0

Fix attr storage

279fd6b

Escapes in text

c74a2be

Escape in strings

1335282

Rearrange file

6a72c55

stasm added a commit to projectfluent/fluent-runtime-perf that referenced this pull request Oct 10, 2018

Test the optimistic runtime parser

9c15a2a

See projectfluent/fluent.js#289

stasm requested review from zbraniecki and Pike October 11, 2018 07:21

stasm added 5 commits October 11, 2018 09:56

Document the regexes

abbb991

Use test(re) as the while loop condition for Attributes and Variants

9dc1faa

Use tokens

b6a213f

Use more declarative names: char, token

6f7b4e5

Rename match to Match to align with other methods returning parsed va…

a285a1f

…lues

stasm added 2 commits October 11, 2018 12:26

Document Indent

74642a6

Unskip one resolver test

4ea336f

zbraniecki approved these changes Oct 11, 2018

View reviewed changes

stasm added 3 commits October 11, 2018 17:28

Fix comments according to feedback

4990f23

Use verbs in function names

b7e1e7d

Better comment on consume{Char,Token}

0a0530f

stasm merged commit 2071570 into projectfluent:zeroseven Oct 12, 2018

stasm deleted the optimistic-runtime branch October 12, 2018 06:37

stasm mentioned this pull request Oct 12, 2018

WIP: start l10n tooling mozilla/blurts-server#486

Merged

This was referenced Nov 3, 2018

Document the optimistic runtime parser behavior in tests #301

Open

Implement MessageContext.format projectfluent/python-fluent#67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Optimistic Runtime Parser #289

The Optimistic Runtime Parser #289

stasm commented Oct 10, 2018 •

edited

Loading

zbraniecki commented Oct 10, 2018

stasm commented Oct 10, 2018

zbraniecki commented Oct 11, 2018

stasm commented Oct 11, 2018

stasm commented Oct 11, 2018

zbraniecki left a comment

zbraniecki Oct 11, 2018

stasm Oct 11, 2018

stasm Oct 11, 2018

zbraniecki Oct 11, 2018

zbraniecki Oct 11, 2018

stasm commented Oct 11, 2018 •

edited

Loading

stasm commented Oct 11, 2018

zbraniecki commented Oct 11, 2018

stasm commented Oct 11, 2018

stasm commented Oct 11, 2018 •

edited

Loading

The Optimistic Runtime Parser #289

The Optimistic Runtime Parser #289

Conversation

stasm commented Oct 10, 2018 • edited Loading

Goals

Design

Conformance

Performance

zbraniecki commented Oct 10, 2018

stasm commented Oct 10, 2018

zbraniecki commented Oct 11, 2018

stasm commented Oct 11, 2018

stasm commented Oct 11, 2018

zbraniecki left a comment

Choose a reason for hiding this comment

zbraniecki Oct 11, 2018

Choose a reason for hiding this comment

stasm Oct 11, 2018

Choose a reason for hiding this comment

stasm Oct 11, 2018

Choose a reason for hiding this comment

zbraniecki Oct 11, 2018

Choose a reason for hiding this comment

zbraniecki Oct 11, 2018

Choose a reason for hiding this comment

stasm commented Oct 11, 2018 • edited Loading

stasm commented Oct 11, 2018

zbraniecki commented Oct 11, 2018

stasm commented Oct 11, 2018

stasm commented Oct 11, 2018 • edited Loading

stasm commented Oct 10, 2018 •

edited

Loading

stasm commented Oct 11, 2018 •

edited

Loading

stasm commented Oct 11, 2018 •

edited

Loading