Skip to content

✨♻️🐛⚡ Support for UTF8, optional text, fix namespace and bodystructure bugs #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

nevans
Copy link
Collaborator

@nevans nevans commented Feb 8, 2023

These parser updates share some of their foundations, and would have many conflicts in separate PRs. So, even though it's several different features, bug fixes, or refactorings, it's simplest to combine them. They also fix some existing issues and lay foundations for various other features I've worked on.

This also showcases the coding style I've been using. It isn't my normal style... my standard rubocop config would complain! 😉 But I personally found that it was much easier to compare code to ABNF this way, while still staying within a simple recursive descent, predictive parser paradigm. @shugo (and anyone else), please let me know if this style is off-putting or confusing, and you think it should be changed to a less "clever" style. I have also started some simple experiments with e.g. racc and ragel, and parser combinators, and other more complex code generation, and even meta-programmed method inlining... but I'd prefer to avoid that added complexity, at least for now. 😉

My preliminary benchmarks were all positive. No enormous speedups, but in most cases a little bit faster. I'll edit this description with some of the relevant benchmarks later. In particular, I wanted to avoid any code generation with eval unless the benchmarks justify it (but I think they do).


✨ Make text in resp-text optional (IMAP4rev2): fe9ded8

moved to #111

RFC3501 (IMAP4rev1):

resp-text       = ["[" resp-text-code "]" SP] text

RFC9051 (IMAP4rev2):

resp-text       = "[" resp-text-code "]" SP] [text]

And in Appendix E:

23.  resp-text ABNF non-terminal was updated to allow for empty text.

In the spirit of Appendix E. 23 (and based on some actual server responses I've seen over the years), I've leniently re-interpreted this as also allowing us to drop the trailing SP char after [resp-text-code parsable code data], like so:

resp-text       = ["[" resp-text-code "]" [SP [text]] / [text]

Actually, the original parser already mostly behaved this way, because the original regexps for T_TEXT used * and not +. But, as I updated the parser in many other places to more closely match the RFCs, I broke this behavior. This commit originally came after many many other changes. But, while rebasing, I moved this commit first because it simplified later commits.

Also:

  • ♻️ Add Patterns module, to organize regexps.
  • ♻️ Use Patterns::CharClassSubtraction refinement to simplify exceptions.
  • ♻️ Add ParserUtils::Generator#def_char_matchers to define SP, LBRA, RBRA.
  • ♻️ Add ParserUtils#{match,accept}_re to replace TEXT, CTEXT lex states.
  • ♻️ Remove unused lex_state kwarg from match

✨ Add UTF-8 support for quoted and text: f30690e

moved to #111

The parser update supports both RFC6855 (UTF8=ALLOW, UTF8=ONLY) and the extra UTF8 requirements of IMAP4rev2 (the human-readable text in resp-text).

Also updated #enable documentation and method signature:

  • 📖 Document UTF8=ACCEPT as "supported"
  • ♻️ Use *rest args => flatten => map(aliases) => uniq
  • ✨ Add :utf8 as an alias for UTF8=ACCEPT

🐛⚡♻️ NAMESPACE: fix parsing (not SP-delimited!): 5f08055

moved to #112

I misread or misunderstood the spec when I first implemented this...!
I wrongly inserted SP-delimiters. Most servers don't list more than one
namespace, so I guess that very few noticed the bug!

Also:

  • ♻️ Rewrote using "new parser style" to more directly imitate the ABNF.
  • ⚡ Small but measurable performance improvement.
  • ♻️ Add ParserUtils::Generator#def_token_matchers for quoted, string, nil, tagged_ext_label, etc.
  • ♻️ Move atom, astring, nstring, etc to top, so they can be aliased.
  • ♻️ Use NIL? in nstring, nquoted

✨🐛 Update FETCH BODYSTRUCTURE msg-att parser: a449243

moved to #113

✨ Add missing location extension data. This was missing from RFC2060 but part of RFC3501.
It was also missing from Net::IMAP... until now! 😄

🐛 Fix many bugs. Most importantly:

  • More strict about where NIL is allowed, e.g: number, envelope, and body. Ignoring these rare server bugs made it difficult to workaround much more common server bugs elsewhere.
  • 🗑️ BodyTypeAttachment and BodyTypeExtension won't be returned any more and the constants have been deprecated.
  • Better workaround for multipart parts with... zero parts.

🚧 TODO: Although this will parse most strange BODYSTRUCTURE msg-att found in the wild, a future PR will backtrack on parse errors and try one or more "fool-proof" algorithms that partially parse nearly all invalid body structures sent by buggy servers... even in pathological cases, such as when servers send the message-id as a quoted string
containing unescaped quotation marks!

Also:

  • ♻️ Add lookahead and peek methods to def_char_matchers, and peek_str?, peek_re, for matching without consuming and using MatchData.
  • ♻️ rename case_insensitive__string to match new parser style.
  • ♻️ add number64 aliases. (size is unenforced)

@nevans nevans requested a review from shugo February 8, 2023 14:46
@nevans nevans force-pushed the parser-text-utf8-namespace-bodystructure branch 2 times, most recently from a2bf1be to a0fe7d5 Compare February 10, 2023 14:09
Base automatically changed from parser-tests-and-benchmarks to master February 10, 2023 14:33
This is especially helpful when making big changes to the parser. :)
RFC3501 (IMAP4rev1):
  resp-text       = ["[" resp-text-code "]" SP] text

RFC9051 (IMAP4rev2):
  resp-text       = ["[" resp-text-code "]" SP] [text]

And in Appendix E:
   23.  resp-text ABNF non-terminal was updated to allow for empty text.

We leniently re-interpret this as also allowing us to drop the trailing
SP char after "[resp-text-code parsable code data]", like so:
  resp-text       = ["[" resp-text-code "]" [SP [text]] / [text]

Also:
♻️ Add Patterns module, to organize regexps.
♻️ Use Patterns::CharClassSubtraction refinement to simplify exceptions.
♻️ Add ParserUtils::Generator#def_char_matchers to define SP, LBRA, RBRA.
♻️ Add ParserUtils#{match,accept}_re to replace TEXT, CTEXT lex_states.
♻️ Remove unused lex_state kwarg from match
The parser update supports both RFC6855 (UTF8=ALLOW, UTF8=ONLY) and the
UTF8 requirements of IMAP4rev2 (resp-text).

Also updated #enable documentation and method signature:
* document `UTF8=ACCEPT` as "supported"
* use `*rest` args => flatten => map(aliases) => uniq
* add `:utf8` as an alias for `UTF8=ACCEPT`
I misread or misunderstood the spec when I first implemented this...
I wrongly inserted SP-delimiters.  Most servers don't list more than one
namespace, so probably very few noticed the bug!

Also:
* ♻️ Rewrote using "new parser style" to more directly imitate the ABNF.
* ⚡️ Small but measurable performance improvement.
* ♻️ Add ParserUtils::Generator#def_token_matchers for quoted, string,
  nil, tagged_ext_label, etc.
* ♻️ Move atom, astring, nstring, etc to top, so they can be aliased.
* ♻️ Use NIL in nstring, nquoted
✨ Add missing "location" extension data.

This was missing from RFC2060 but part of RFC3501.
It was also missing from Net::IMAP... until now! 😄

🐛 Fix many bugs.  Most importantly:
* More strict about where NIL is allowed, e.g: `number`, `envelope`,
  and `body`.  Ignoring these rare server bugs made it difficult to
  workaround much more common server bugs elsewhere.
* BodyTypeAttachment and BodyTypeExtension won't be returned any more
  and the constants have been deprecated.
* Better workaround for multipart parts with... zero parts.

🚧 TODO: Although this will parse *most* strange BODYSTRUCTURE msg-att
found in the wild, a future PR will backtrack on parse errors and try
one or more "fool-proof" algorithms that partially parse *nearly* all
invalid body structures sent by buggy servers... even in pathological
cases, such as when servers send the message-id as a quoted string
containing unescaped quotation marks!

♻️ Add lookahead and peek methods to def_char_matchers, and peek_str?,
peek_re, for matching without consuming and using MatchData.
♻️ rename case_insensitive__string to match new parser style.
♻️ add number64 aliases. (size is unenforced)
@nevans
Copy link
Collaborator Author

nevans commented Feb 12, 2023

@nevans nevans closed this Feb 12, 2023
@nevans nevans deleted the parser-text-utf8-namespace-bodystructure branch February 12, 2023 06:24
@nevans nevans restored the parser-text-utf8-namespace-bodystructure branch February 12, 2023 06:24
@nevans nevans deleted the parser-text-utf8-namespace-bodystructure branch February 12, 2023 06:24
@nevans nevans added the IMAP4rev1 Requirement for IMAP4rev1, RFC3501 label Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IMAP4rev1 Requirement for IMAP4rev1, RFC3501 IMAP4rev2 Requirement for IMAP4rev2, RFC9051
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant