Skip to content

The Optimistic Runtime Parser #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 70 commits into from
Oct 12, 2018
Merged

Conversation

stasm
Copy link
Contributor

@stasm stasm commented Oct 10, 2018

I re-wrote the runtime parser. It supports Syntax 0.7, runs against the reference fixtures, has half the lines of code, aaand it's a bit slower on SpiderMonkey (but also a bit faster on V8, so I assume the perf will improve over time).

Goals

  1. Support 100% of Fluent Syntax 0.7. This includes the indentation relaxation, dropping tabs and CR as syntax whitespace, normalizing new lines to LF, and only allowing numbers and identifiers as variant keys.
  2. Maintain good performance. The parser is used in performance-critical code paths. Back in the days of Firefox OS it had to be both fast and produce tightly packed results so that translations don't take up too much space on the device. I think the storage requirements can be relaxed these days.
  3. Write code which will be easy to maintain in the future. The parser was first written even before Fluent branched off from L20n. It's seen many changes and additions over the last two years. As new features accrued it became hard to maintain it and also to keep track of all known bugs. My goal for the re-write was not only to clean it up but also to define the conformance story for the future and to improve the testing infrastructure.

Design

The parser focuses on minimizing the number of false negatives at the expense of increasing the risk of false positives. In other words, it aims at parsing valid Fluent messages with a success rate of 100%, but it may also parse some invalid messages which the reference parser would reject. The parser doesn't perform any validation and may produce entries which wouldn't make sense in the real world. For best results users are advised to validate translations with the fluent-syntax parser pre-runtime.

The main parser loop iterates over the beginnings of messages and terms. This is to efficiently skip over comments (which have no use on runtime), and to recover from errors. When a fatal error is encountered, the parser instantly bails out of the currently-parsed message and moves on to the next one. Errors are discarded and are not visible to the users of FluentResource. The do carry a minimal description of what went wrong which may be useful when reading the code and for debugging, though.

The parser makes an extensive use of sticky regexes which can be anchored to any offset of the source string without slicing it. In some places, it's easier to just check the character currently at the cursor, so it does a fair share of that, too.

Conformance

My original plan was to base the parser on the EBNF and only parse well-formed syntax. In this PR, I went for something a bit wider than that: a superset of well-formed syntax. The main deviation from the EBNF is related to parsing VariantExpressions and CallExpressions. The EBNF verifies that the they are called on Terms and Functions respectively. The optimistic parser doesn't differentiate between Messages, Terms and Functions. I decided to implement it this way because this code might soon change anyways (see projectfluent/fluent#176).

Another deviation is that the parser treats commas in argument lists as whitespace, similar to how Clojure treats them in sequence lists. I might suggest we upstream this in the spec, too, because it makes the implementation of args lists much simpler.

I based this PR on top of the zeroseven branch. The fluent-syntax parser already supports Syntax 0.7 and passes the reference fixtures. This made it possible to also turn on the reference testing in the runtime parser, too. make fixtures creates the parsed results for all reference fixtures; for now they must be verified manually before they're committed. make test can be used in development to assert that the output of the runtime parser still matches the committed one.

Performance

I'm seeing a slight slowdown when running make perf with SpiderMonkey (make perf-jsshell). Interestingly, I'm also seeing a slight speed up in V8.

SpiderMonkey 63 (in milliseconds, relative to master):
   mean:   5.99 (+26%)
   stdev:  0.22
   sample: 30

node 9.11 (in milliseconds, relative to master):
    mean:   3.14 (-15%)
    stdev:  0.12
    sample: 30

I asked @zbraniecki to run a Talos test using this branch. We'll see if the slow down is visible during Firefox startup and we'll then decide what do about it. I'm also seeing an opportunity to create a test case for the SpiderMonkey team. Maybe we're using features of the engine which haven't been optimized too much, yet? (This also applies to the fluent-syntax parser which is 4 times slower in SM than it is in V8).

@zbraniecki
Copy link
Collaborator

I created a talos try run which compares about_preferences loading with cache disabled on m-c vs. this patch.

My hypothesis is that this will allow us if over 20 runs of about:preferences in Firefox we can observe any measurable performance impact of this patchset comparing to central.

The results should be in within 2 hours: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=28966f66cf31&newProject=try&newRevision=c6ac0d1a5166&framework=1

stasm added a commit to projectfluent/fluent-runtime-perf that referenced this pull request Oct 10, 2018
@stasm
Copy link
Contributor Author

stasm commented Oct 10, 2018

I created an isolated test case for measuring the performance of the parser: https://projectfluent.org/fluent-runtime-perf/. It looks like SpiderMonkey is slower than V8 only with smaller resources. The real-life resources are usually on the smaller side.

@zbraniecki
Copy link
Collaborator

I compared central, no-cache and no-cache + this scenarios: https://pike.github.io/talos-compare/?revision=f293199791bf&revision=28966f66cf31&revision=c6ac0d1a5166

Based on the read, I'd say there's a no impact to minor win on all platforms. I'm confident that this change does not pose a performance regression in Gecko scenario.

@stasm stasm requested review from zbraniecki and Pike October 11, 2018 07:21
@stasm
Copy link
Contributor Author

stasm commented Oct 11, 2018

@Pike, @zbraniecki, I'm asking you both to review this. Let me know if you're available and if you'd like me to provide more details on this PR. Thanks!

@stasm
Copy link
Contributor Author

stasm commented Oct 11, 2018

I may have found the cause of the slowdown which I observed yesterday. I documented the regexes used by the parser this morning and in the process I realized that the RE_ATTRIBUTE_START and the RE_VARIANT_START regexes could be used more idiomatically. Previously, I would use them inside of the corresponding while loop to break out of it. I changed the code to use the regex as the while loop's end condition instead. That's not what made a difference performance-wise, however. What did make a difference is that I also removed an unneeded skipBlank from the while loop. Pattern already skips all its trailing blank space; doing it again before the next attribute/variant wasn't useful.

With the extra skipBlank (milliseconds, compared to master):
  mean:   6.02 (+27%)
  stdev:  0.38
  sample: 30

Without it (milliseconds, compared to master):
  mean:   4.82 (+1%)
  stdev:  0.32
  sample: 30

Copy link
Collaborator

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Very well documented and clean to read. I'm surprised with the performance of this code because I remember such code not to work great in SM back when we started working on runtime parser, but am happy to see this improved :)

Stylistically, I'm not a huge fan of naming functions in the parser starting with a capital letter, as per [0]:

In JavaScript, functions should use camelCase, but should not capitalize the first letter. 

but that's no a significant concern and if you prefer it, I'm sure we can read the code just fine :)

[0] https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style#General_C.2FC.2B.2B_Practices

const TOKEN_ARROW = /\s*->\s*/y;
const TOKEN_COLON = /\s*:\s*/y;
// As a deviation from the well-formed Fluent grammar, accept argument lists
// without commas between arguments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this comment. How's it different from other normalizations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me re-word the comment? What do you mean by "other normalizations"? The regex under this comment has an optional comma (.?) which essentially means that commas in argument lists are treated as whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the following sound?

// Note the optional comma. As a deviation from the Fluent EBNF, the parser
// accepts lists of call arguments without commas between them.
const TOKEN_COMMA = /\s*,?\s*/y;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I somehow confused the message as "accept argument lists with commas between arguments". Maybe - As a deviation from the well-formed Fluent grammar, this parser does not enforce commas between arguments.

var first = Match(RE_TEXT_RUN);
}

// If there's an backslash escape or a placeable on the first line, fall
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"there's a backslash"

@stasm
Copy link
Contributor Author

stasm commented Oct 11, 2018

Thanks, @zbraniecki! I've wondered about the naming (of course I have :)) and I like using the nouns over verbs specifically for writing parsers. I've used this same convention in the reference parser, so I might be biased by the code I already wrote before. To me, nouns represent well the concepts of tokens and grammar productions. I understand that not everyone may think that way.

I used the capitalization consistently for functions which return a parsed result, as opposed to functions which are operators or helpers. In a way, the parsers uses its own DSL to compose smaller productions into bigger ones. Again, this is something that I took away from writing the reference parser.

let selector = InlineExpression();
if (token(TOKEN_BRACE_CLOSE)) {
    return selector;
}

Do you think using verbs would make the code easier to read for you and for newcomers? Example:

let selector = parseInlineExpression();
if (skipToken(TOKEN_BRACE_CLOSE)) {
    return selector;
}

@stasm
Copy link
Contributor Author

stasm commented Oct 11, 2018

Related: what's your take on using 4 spaces for indentation? I know Mozilla guidelines say 2 spaces for JS, but having more breathing room makes the code easier to read for me. I wonder if it's the same for other people. It also occurs to me that 2 spaces were perhaps a good compromise in the days of callbacks. A compromise which may have outlived its usefulness in the current days of async functions :) Can we turn this lint off in mozilla-central for Fluent files? This re-write would be a good opportunity to change the indentation.

@zbraniecki
Copy link
Collaborator

Do you think using verbs would make the code easier to read for you and for newcomers?

I don't think it matters that much. My only concern with attrs.push(Attribute()) is that it looks like a class, rather than a function.

I looked at rust parser [0] and they do use parse_foo, get_foo, expect_foo, eat_foo etc.

what's your take on using 4 spaces for indentation?

No strong opinion. I'm comfortable with 2 spaces in JS, C++ and with 4 in Rust, so I don't expect a big difference.

One thing is that I would prefer to migrate all at once and would prefer not to touch it while we're updating the code. Already I'd say that renaming variables/properties/classes/methods while updating the logic is the single greatest burden on reviewing the code in Fluent I encountered. I'd love not to add white space convention to that. :)

[0] https://github.com/rust-lang/rust/blob/master/src/libsyntax/parse/parser.rs

@stasm
Copy link
Contributor Author

stasm commented Oct 11, 2018

OK, let's keep 2 spaces in this PR and discuss changing all code to 4 in a separate issue.

@stasm
Copy link
Contributor Author

stasm commented Oct 11, 2018

I decided to go with the verb+noun naming scheme: consumeToken, parseInlineExpression etc. It's explicit, predictable, can't be confused with class construction, and hopefully won't catch anyone off-guard. Thanks for the feedback!

This is now ready to merge.

@stasm stasm merged commit 2071570 into projectfluent:zeroseven Oct 12, 2018
@stasm stasm deleted the optimistic-runtime branch October 12, 2018 06:37
stasm added a commit that referenced this pull request Oct 12, 2018
This is a re-write of the runtime parser. It supports Fluent Syntax 0.7, runs
against the reference fixtures, has half the lines of code, and is as fast in
SpiderMonkey as the old one (and slightly faster in V8).


        Goals

  1. Support 100% of Fluent Syntax 0.7. This includes the indentation
     relaxation, dropping tabs and CR as syntax whitespace, normalizing new
     lines to LF, and only allowing numbers and identifiers as variant keys.

  2. Maintain good performance. The parser is used in performance-critical code
     paths. Back in the days of Firefox OS it had to be both fast _and_ produce
     tightly packed results so that translations don't take up too much space
     on the device. I think the storage requirements can be relaxed these days.

  3. Write code which will be easy to maintain in the future. The parser was
     first written even before Fluent branched off from L20n. It's seen many
     changes and additions over the last two years. As new features accrued it
     became hard to maintain it and also to keep track of all known bugs. My
     goal for the re-write was not only to clean it up but also to define the
     conformance story for the future and to improve the testing
     infrastructure.


        Design

The parser focuses on minimizing the number of false negatives at the expense
of increasing the risk of false positives. In other words, it aims at parsing
_valid_ Fluent messages with a success rate of 100%, but it may also parse some
invalid messages which the reference parser would reject. The parser doesn't
perform any validation and may produce entries  which wouldn't make sense in
the real world. For best results users are advised to validate translations
with the fluent-syntax parser pre-runtime.

The main parser loop iterates over the beginnings of messages and terms. This
is to efficiently skip over comments (which have no use on runtime), and to
recover from errors. When a fatal error is encountered, the parser instantly
bails out of the currently-parsed message and moves on to the next one. Errors
are discarded and are not visible to the users of `FluentResource`. The do
carry a minimal description of what went wrong which may be useful when reading
the code and for debugging, though.

The parser makes an extensive use of sticky regexes which can be anchored to
any offset of the source string without slicing it. In some places, it's easier
to just check the character currently at the cursor, so it does a fair share of
that, too.


        Conformance

My original plan was to base the parser on the EBNF and only parse well-formed
syntax. In this PR, I went for something a bit wider than that: a superset of
well-formed syntax. The main deviation from the EBNF is related to parsing
`VariantExpressions` and `CallExpressions`. The EBNF verifies that the they are
called on `Terms` and `Functions` respectively. The optimistic parser doesn't
differentiate between `Messages`, `Terms` and `Functions`. I decided to
implement it this way because this code might soon change anyways (see
projectfluent/fluent#176).

Another deviation is that the parser treats commas in argument lists  as
whitespace, similar to how Clojure treats them in sequence lists. I might
suggest we upstream this in the spec, too, because it makes the implementation
of args lists _much_ simpler.

I based this PR on top of the `zeroseven` branch. The `fluent-syntax` parser
already supports Syntax 0.7 and passes the [reference
fixtures](https://github.com/projectfluent/fluent/tree/master/test/fixtures).
This made it possible to also turn on the reference testing in the runtime
parser, too. `make fixtures` creates the parsed results for all reference
fixtures; for now they must be verified manually before they're committed.
`make test` can be used in development to assert that the output of the runtime
parser still matches the committed one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants