✨♻️🐛⚡ Support for UTF8, optional text, fix namespace and bodystructure bugs #104

nevans · 2023-02-08T14:23:23Z

These parser updates share some of their foundations, and would have many conflicts in separate PRs. So, even though it's several different features, bug fixes, or refactorings, it's simplest to combine them. They also fix some existing issues and lay foundations for various other features I've worked on.

This also showcases the coding style I've been using. It isn't my normal style... my standard rubocop config would complain! 😉 But I personally found that it was much easier to compare code to ABNF this way, while still staying within a simple recursive descent, predictive parser paradigm. @shugo (and anyone else), please let me know if this style is off-putting or confusing, and you think it should be changed to a less "clever" style. I have also started some simple experiments with e.g. racc and ragel, and parser combinators, and other more complex code generation, and even meta-programmed method inlining... but I'd prefer to avoid that added complexity, at least for now. 😉

My preliminary benchmarks were all positive. No enormous speedups, but in most cases a little bit faster. I'll edit this description with some of the relevant benchmarks later. In particular, I wanted to avoid any code generation with eval unless the benchmarks justify it (but I think they do).

✨ Make text in resp-text optional (IMAP4rev2): `fe9ded8`

moved to #111

RFC3501 (IMAP4rev1):

resp-text       = ["[" resp-text-code "]" SP] text

RFC9051 (IMAP4rev2):

resp-text       = "[" resp-text-code "]" SP] [text]

And in Appendix E:

23.  resp-text ABNF non-terminal was updated to allow for empty text.

In the spirit of Appendix E. 23 (and based on some actual server responses I've seen over the years), I've leniently re-interpreted this as also allowing us to drop the trailing SP char after [resp-text-code parsable code data], like so:

resp-text       = ["[" resp-text-code "]" [SP [text]] / [text]

Actually, the original parser already mostly behaved this way, because the original regexps for T_TEXT used * and not +. But, as I updated the parser in many other places to more closely match the RFCs, I broke this behavior. This commit originally came after many many other changes. But, while rebasing, I moved this commit first because it simplified later commits.

Also:

♻️ Add Patterns module, to organize regexps.
♻️ Use Patterns::CharClassSubtraction refinement to simplify exceptions.
♻️ Add ParserUtils::Generator#def_char_matchers to define SP, LBRA, RBRA.
♻️ Add ParserUtils#{match,accept}_re to replace TEXT, CTEXT lex states.
♻️ Remove unused lex_state kwarg from match

✨ Add UTF-8 support for quoted and text: `f30690e`

moved to #111

The parser update supports both RFC6855 (UTF8=ALLOW, UTF8=ONLY) and the extra UTF8 requirements of IMAP4rev2 (the human-readable text in resp-text).

Also updated #enable documentation and method signature:

📖 Document UTF8=ACCEPT as "supported"
♻️ Use *rest args => flatten => map(aliases) => uniq
✨ Add :utf8 as an alias for UTF8=ACCEPT

🐛⚡♻️ NAMESPACE: fix parsing (not SP-delimited!): `5f08055`

moved to #112

I misread or misunderstood the spec when I first implemented this...!
I wrongly inserted SP-delimiters. Most servers don't list more than one
namespace, so I guess that very few noticed the bug!

Also:

♻️ Rewrote using "new parser style" to more directly imitate the ABNF.
⚡ Small but measurable performance improvement.
♻️ Add ParserUtils::Generator#def_token_matchers for quoted, string, nil, tagged_ext_label, etc.
♻️ Move atom, astring, nstring, etc to top, so they can be aliased.
♻️ Use NIL? in nstring, nquoted

✨🐛 Update FETCH BODYSTRUCTURE msg-att parser: `a449243`

moved to #113

✨ Add missing location extension data. This was missing from RFC2060 but part of RFC3501.
It was also missing from Net::IMAP... until now! 😄

🐛 Fix many bugs. Most importantly:

More strict about where NIL is allowed, e.g: number, envelope, and body. Ignoring these rare server bugs made it difficult to workaround much more common server bugs elsewhere.
🗑️ BodyTypeAttachment and BodyTypeExtension won't be returned any more and the constants have been deprecated.
Better workaround for multipart parts with... zero parts.

🚧 TODO: Although this will parse most strange BODYSTRUCTURE msg-att found in the wild, a future PR will backtrack on parse errors and try one or more "fool-proof" algorithms that partially parse nearly all invalid body structures sent by buggy servers... even in pathological cases, such as when servers send the message-id as a quoted string
containing unescaped quotation marks!

Also:

♻️ Add lookahead and peek methods to def_char_matchers, and peek_str?, peek_re, for matching without consuming and using MatchData.
♻️ rename case_insensitive__string to match new parser style.
♻️ add number64 aliases. (size is unenforced)

This is especially helpful when making big changes to the parser. :)

RFC3501 (IMAP4rev1): resp-text = ["[" resp-text-code "]" SP] text RFC9051 (IMAP4rev2): resp-text = ["[" resp-text-code "]" SP] [text] And in Appendix E: 23. resp-text ABNF non-terminal was updated to allow for empty text. We leniently re-interpret this as also allowing us to drop the trailing SP char after "[resp-text-code parsable code data]", like so: resp-text = ["[" resp-text-code "]" [SP [text]] / [text] Also: ♻️ Add Patterns module, to organize regexps. ♻️ Use Patterns::CharClassSubtraction refinement to simplify exceptions. ♻️ Add ParserUtils::Generator#def_char_matchers to define SP, LBRA, RBRA. ♻️ Add ParserUtils#{match,accept}_re to replace TEXT, CTEXT lex_states. ♻️ Remove unused lex_state kwarg from match

The parser update supports both RFC6855 (UTF8=ALLOW, UTF8=ONLY) and the UTF8 requirements of IMAP4rev2 (resp-text). Also updated #enable documentation and method signature: * document `UTF8=ACCEPT` as "supported" * use `*rest` args => flatten => map(aliases) => uniq * add `:utf8` as an alias for `UTF8=ACCEPT`

I misread or misunderstood the spec when I first implemented this... I wrongly inserted SP-delimiters. Most servers don't list more than one namespace, so probably very few noticed the bug! Also: * ♻️ Rewrote using "new parser style" to more directly imitate the ABNF. * ⚡️ Small but measurable performance improvement. * ♻️ Add ParserUtils::Generator#def_token_matchers for quoted, string, nil, tagged_ext_label, etc. * ♻️ Move atom, astring, nstring, etc to top, so they can be aliased. * ♻️ Use NIL in nstring, nquoted

✨ Add missing "location" extension data. This was missing from RFC2060 but part of RFC3501. It was also missing from Net::IMAP... until now! 😄 🐛 Fix many bugs. Most importantly: * More strict about where NIL is allowed, e.g: `number`, `envelope`, and `body`. Ignoring these rare server bugs made it difficult to workaround much more common server bugs elsewhere. * BodyTypeAttachment and BodyTypeExtension won't be returned any more and the constants have been deprecated. * Better workaround for multipart parts with... zero parts. 🚧 TODO: Although this will parse *most* strange BODYSTRUCTURE msg-att found in the wild, a future PR will backtrack on parse errors and try one or more "fool-proof" algorithms that partially parse *nearly* all invalid body structures sent by buggy servers... even in pathological cases, such as when servers send the message-id as a quoted string containing unescaped quotation marks! ♻️ Add lookahead and peek methods to def_char_matchers, and peek_str?, peek_re, for matching without consuming and using MatchData. ♻️ rename case_insensitive__string to match new parser style. ♻️ add number64 aliases. (size is unenforced)

nevans · 2023-02-12T06:24:43Z

n.b, this PR was split into three other more cohesive PRs:

nevans requested a review from shugo February 8, 2023 14:46

nevans force-pushed the parser-text-utf8-namespace-bodystructure branch 2 times, most recently from a2bf1be to a0fe7d5 Compare February 10, 2023 14:09

Base automatically changed from parser-tests-and-benchmarks to master February 10, 2023 14:33

nevans added 5 commits February 10, 2023 09:34

🔎 Improve parse error debugging

de223a6

This is especially helpful when making big changes to the parser. :)

nevans force-pushed the parser-text-utf8-namespace-bodystructure branch from a0fe7d5 to e5aeeb0 Compare February 10, 2023 14:36

nevans mentioned this pull request Feb 12, 2023

Support for IMAP4rev2 and modern extensions #12

Open

nevans added the IMAP4rev2 Requirement for IMAP4rev2, RFC9051 label Feb 12, 2023

This was referenced Feb 12, 2023

✨ Parse UTF-8 encoded strings, for UTF8=ACCEPT and IMAP4rev2 #111

Merged

🐛 Fix NAMESPACE parsing (and other ♻️ refactoring) #112

Merged

✨🐛 Update BODYSTRUCTURE parser; add location; fix bugs #113

Merged

nevans closed this Feb 12, 2023

nevans deleted the parser-text-utf8-namespace-bodystructure branch February 12, 2023 06:24

nevans restored the parser-text-utf8-namespace-bodystructure branch February 12, 2023 06:24

nevans deleted the parser-text-utf8-namespace-bodystructure branch February 12, 2023 06:24

nevans added the IMAP4rev1 Requirement for IMAP4rev1, RFC3501 label Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨♻️🐛⚡ Support for UTF8, optional text, fix namespace and bodystructure bugs #104

✨♻️🐛⚡ Support for UTF8, optional text, fix namespace and bodystructure bugs #104

Uh oh!

nevans commented Feb 8, 2023 •

edited

Loading

Uh oh!

nevans commented Feb 12, 2023

Uh oh!

Uh oh!

✨♻️🐛⚡ Support for UTF8, optional text, fix namespace and bodystructure bugs #104

✨♻️🐛⚡ Support for UTF8, optional text, fix namespace and bodystructure bugs #104

Uh oh!

Conversation

nevans commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Make text in resp-text optional (IMAP4rev2): fe9ded8

✨ Add UTF-8 support for quoted and text: f30690e

🐛⚡♻️ NAMESPACE: fix parsing (not SP-delimited!): 5f08055

✨🐛 Update FETCH BODYSTRUCTURE msg-att parser: a449243

Uh oh!

nevans commented Feb 12, 2023

Uh oh!

Uh oh!

nevans commented Feb 8, 2023 •

edited

Loading

✨ Make text in resp-text optional (IMAP4rev2): `fe9ded8`

✨ Add UTF-8 support for quoted and text: `f30690e`

🐛⚡♻️ NAMESPACE: fix parsing (not SP-delimited!): `5f08055`

✨🐛 Update FETCH BODYSTRUCTURE msg-att parser: `a449243`