-
Notifications
You must be signed in to change notification settings - Fork 79
The Optimistic Runtime Parser #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I created a talos try run which compares about_preferences loading with cache disabled on m-c vs. this patch. My hypothesis is that this will allow us if over 20 runs of about:preferences in Firefox we can observe any measurable performance impact of this patchset comparing to central. The results should be in within 2 hours: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=28966f66cf31&newProject=try&newRevision=c6ac0d1a5166&framework=1 |
I created an isolated test case for measuring the performance of the parser: https://projectfluent.org/fluent-runtime-perf/. It looks like SpiderMonkey is slower than V8 only with smaller resources. The real-life resources are usually on the smaller side. |
I compared central, no-cache and no-cache + this scenarios: https://pike.github.io/talos-compare/?revision=f293199791bf&revision=28966f66cf31&revision=c6ac0d1a5166 Based on the read, I'd say there's a no impact to minor win on all platforms. I'm confident that this change does not pose a performance regression in Gecko scenario. |
@Pike, @zbraniecki, I'm asking you both to review this. Let me know if you're available and if you'd like me to provide more details on this PR. Thanks! |
I may have found the cause of the slowdown which I observed yesterday. I documented the regexes used by the parser this morning and in the process I realized that the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Very well documented and clean to read. I'm surprised with the performance of this code because I remember such code not to work great in SM back when we started working on runtime parser, but am happy to see this improved :)
Stylistically, I'm not a huge fan of naming functions in the parser starting with a capital letter, as per [0]:
In JavaScript, functions should use camelCase, but should not capitalize the first letter.
but that's no a significant concern and if you prefer it, I'm sure we can read the code just fine :)
fluent/src/resource.js
Outdated
const TOKEN_ARROW = /\s*->\s*/y; | ||
const TOKEN_COLON = /\s*:\s*/y; | ||
// As a deviation from the well-formed Fluent grammar, accept argument lists | ||
// without commas between arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this comment. How's it different from other normalizations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me re-word the comment? What do you mean by "other normalizations"? The regex under this comment has an optional comma (.?
) which essentially means that commas in argument lists are treated as whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the following sound?
// Note the optional comma. As a deviation from the Fluent EBNF, the parser
// accepts lists of call arguments without commas between them.
const TOKEN_COMMA = /\s*,?\s*/y;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I somehow confused the message as "accept argument lists with commas between arguments". Maybe - As a deviation from the well-formed Fluent grammar, this parser does not enforce commas between arguments.
fluent/src/resource.js
Outdated
var first = Match(RE_TEXT_RUN); | ||
} | ||
|
||
// If there's an backslash escape or a placeable on the first line, fall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"there's a backslash"
Thanks, @zbraniecki! I've wondered about the naming (of course I have :)) and I like using the nouns over verbs specifically for writing parsers. I've used this same convention in the reference parser, so I might be biased by the code I already wrote before. To me, nouns represent well the concepts of tokens and grammar productions. I understand that not everyone may think that way. I used the capitalization consistently for functions which return a parsed result, as opposed to functions which are operators or helpers. In a way, the parsers uses its own DSL to compose smaller productions into bigger ones. Again, this is something that I took away from writing the reference parser. let selector = InlineExpression();
if (token(TOKEN_BRACE_CLOSE)) {
return selector;
} Do you think using verbs would make the code easier to read for you and for newcomers? Example: let selector = parseInlineExpression();
if (skipToken(TOKEN_BRACE_CLOSE)) {
return selector;
} |
Related: what's your take on using 4 spaces for indentation? I know Mozilla guidelines say 2 spaces for JS, but having more breathing room makes the code easier to read for me. I wonder if it's the same for other people. It also occurs to me that 2 spaces were perhaps a good compromise in the days of callbacks. A compromise which may have outlived its usefulness in the current days of async functions :) Can we turn this lint off in mozilla-central for Fluent files? This re-write would be a good opportunity to change the indentation. |
I don't think it matters that much. My only concern with I looked at rust parser [0] and they do use
No strong opinion. I'm comfortable with 2 spaces in JS, C++ and with 4 in Rust, so I don't expect a big difference. One thing is that I would prefer to migrate all at once and would prefer not to touch it while we're updating the code. Already I'd say that renaming variables/properties/classes/methods while updating the logic is the single greatest burden on reviewing the code in Fluent I encountered. I'd love not to add white space convention to that. :) [0] https://github.com/rust-lang/rust/blob/master/src/libsyntax/parse/parser.rs |
OK, let's keep 2 spaces in this PR and discuss changing all code to 4 in a separate issue. |
I decided to go with the verb+noun naming scheme: This is now ready to merge. |
This is a re-write of the runtime parser. It supports Fluent Syntax 0.7, runs against the reference fixtures, has half the lines of code, and is as fast in SpiderMonkey as the old one (and slightly faster in V8). Goals 1. Support 100% of Fluent Syntax 0.7. This includes the indentation relaxation, dropping tabs and CR as syntax whitespace, normalizing new lines to LF, and only allowing numbers and identifiers as variant keys. 2. Maintain good performance. The parser is used in performance-critical code paths. Back in the days of Firefox OS it had to be both fast _and_ produce tightly packed results so that translations don't take up too much space on the device. I think the storage requirements can be relaxed these days. 3. Write code which will be easy to maintain in the future. The parser was first written even before Fluent branched off from L20n. It's seen many changes and additions over the last two years. As new features accrued it became hard to maintain it and also to keep track of all known bugs. My goal for the re-write was not only to clean it up but also to define the conformance story for the future and to improve the testing infrastructure. Design The parser focuses on minimizing the number of false negatives at the expense of increasing the risk of false positives. In other words, it aims at parsing _valid_ Fluent messages with a success rate of 100%, but it may also parse some invalid messages which the reference parser would reject. The parser doesn't perform any validation and may produce entries which wouldn't make sense in the real world. For best results users are advised to validate translations with the fluent-syntax parser pre-runtime. The main parser loop iterates over the beginnings of messages and terms. This is to efficiently skip over comments (which have no use on runtime), and to recover from errors. When a fatal error is encountered, the parser instantly bails out of the currently-parsed message and moves on to the next one. Errors are discarded and are not visible to the users of `FluentResource`. The do carry a minimal description of what went wrong which may be useful when reading the code and for debugging, though. The parser makes an extensive use of sticky regexes which can be anchored to any offset of the source string without slicing it. In some places, it's easier to just check the character currently at the cursor, so it does a fair share of that, too. Conformance My original plan was to base the parser on the EBNF and only parse well-formed syntax. In this PR, I went for something a bit wider than that: a superset of well-formed syntax. The main deviation from the EBNF is related to parsing `VariantExpressions` and `CallExpressions`. The EBNF verifies that the they are called on `Terms` and `Functions` respectively. The optimistic parser doesn't differentiate between `Messages`, `Terms` and `Functions`. I decided to implement it this way because this code might soon change anyways (see projectfluent/fluent#176). Another deviation is that the parser treats commas in argument lists as whitespace, similar to how Clojure treats them in sequence lists. I might suggest we upstream this in the spec, too, because it makes the implementation of args lists _much_ simpler. I based this PR on top of the `zeroseven` branch. The `fluent-syntax` parser already supports Syntax 0.7 and passes the [reference fixtures](https://github.com/projectfluent/fluent/tree/master/test/fixtures). This made it possible to also turn on the reference testing in the runtime parser, too. `make fixtures` creates the parsed results for all reference fixtures; for now they must be verified manually before they're committed. `make test` can be used in development to assert that the output of the runtime parser still matches the committed one.
I re-wrote the runtime parser. It supports Syntax 0.7, runs against the reference fixtures, has half the lines of code, aaand it's a bit slower on SpiderMonkey (but also a bit faster on V8, so I assume the perf will improve over time).
Goals
Design
The parser focuses on minimizing the number of false negatives at the expense of increasing the risk of false positives. In other words, it aims at parsing valid Fluent messages with a success rate of 100%, but it may also parse some invalid messages which the reference parser would reject. The parser doesn't perform any validation and may produce entries which wouldn't make sense in the real world. For best results users are advised to validate translations with the fluent-syntax parser pre-runtime.
The main parser loop iterates over the beginnings of messages and terms. This is to efficiently skip over comments (which have no use on runtime), and to recover from errors. When a fatal error is encountered, the parser instantly bails out of the currently-parsed message and moves on to the next one. Errors are discarded and are not visible to the users of
FluentResource
. The do carry a minimal description of what went wrong which may be useful when reading the code and for debugging, though.The parser makes an extensive use of sticky regexes which can be anchored to any offset of the source string without slicing it. In some places, it's easier to just check the character currently at the cursor, so it does a fair share of that, too.
Conformance
My original plan was to base the parser on the EBNF and only parse well-formed syntax. In this PR, I went for something a bit wider than that: a superset of well-formed syntax. The main deviation from the EBNF is related to parsing
VariantExpressions
andCallExpressions
. The EBNF verifies that the they are called onTerms
andFunctions
respectively. The optimistic parser doesn't differentiate betweenMessages
,Terms
andFunctions
. I decided to implement it this way because this code might soon change anyways (see projectfluent/fluent#176).Another deviation is that the parser treats commas in argument lists as whitespace, similar to how Clojure treats them in sequence lists. I might suggest we upstream this in the spec, too, because it makes the implementation of args lists much simpler.
I based this PR on top of the
zeroseven
branch. Thefluent-syntax
parser already supports Syntax 0.7 and passes the reference fixtures. This made it possible to also turn on the reference testing in the runtime parser, too.make fixtures
creates the parsed results for all reference fixtures; for now they must be verified manually before they're committed.make test
can be used in development to assert that the output of the runtime parser still matches the committed one.Performance
I'm seeing a slight slowdown when running
make perf
with SpiderMonkey (make perf-jsshell
). Interestingly, I'm also seeing a slight speed up in V8.I asked @zbraniecki to run a Talos test using this branch. We'll see if the slow down is visible during Firefox startup and we'll then decide what do about it. I'm also seeing an opportunity to create a test case for the SpiderMonkey team. Maybe we're using features of the engine which haven't been optimized too much, yet? (This also applies to the
fluent-syntax
parser which is 4 times slower in SM than it is in V8).