-
Notifications
You must be signed in to change notification settings - Fork 34
✨♻️🐛⚡ Support for UTF8, optional text, fix namespace and bodystructure bugs #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
a2bf1be
to
a0fe7d5
Compare
This is especially helpful when making big changes to the parser. :)
RFC3501 (IMAP4rev1): resp-text = ["[" resp-text-code "]" SP] text RFC9051 (IMAP4rev2): resp-text = ["[" resp-text-code "]" SP] [text] And in Appendix E: 23. resp-text ABNF non-terminal was updated to allow for empty text. We leniently re-interpret this as also allowing us to drop the trailing SP char after "[resp-text-code parsable code data]", like so: resp-text = ["[" resp-text-code "]" [SP [text]] / [text] Also: ♻️ Add Patterns module, to organize regexps. ♻️ Use Patterns::CharClassSubtraction refinement to simplify exceptions. ♻️ Add ParserUtils::Generator#def_char_matchers to define SP, LBRA, RBRA. ♻️ Add ParserUtils#{match,accept}_re to replace TEXT, CTEXT lex_states. ♻️ Remove unused lex_state kwarg from match
The parser update supports both RFC6855 (UTF8=ALLOW, UTF8=ONLY) and the UTF8 requirements of IMAP4rev2 (resp-text). Also updated #enable documentation and method signature: * document `UTF8=ACCEPT` as "supported" * use `*rest` args => flatten => map(aliases) => uniq * add `:utf8` as an alias for `UTF8=ACCEPT`
I misread or misunderstood the spec when I first implemented this... I wrongly inserted SP-delimiters. Most servers don't list more than one namespace, so probably very few noticed the bug! Also: * ♻️ Rewrote using "new parser style" to more directly imitate the ABNF. * ⚡️ Small but measurable performance improvement. * ♻️ Add ParserUtils::Generator#def_token_matchers for quoted, string, nil, tagged_ext_label, etc. * ♻️ Move atom, astring, nstring, etc to top, so they can be aliased. * ♻️ Use NIL in nstring, nquoted
✨ Add missing "location" extension data. This was missing from RFC2060 but part of RFC3501. It was also missing from Net::IMAP... until now! 😄 🐛 Fix many bugs. Most importantly: * More strict about where NIL is allowed, e.g: `number`, `envelope`, and `body`. Ignoring these rare server bugs made it difficult to workaround much more common server bugs elsewhere. * BodyTypeAttachment and BodyTypeExtension won't be returned any more and the constants have been deprecated. * Better workaround for multipart parts with... zero parts. 🚧 TODO: Although this will parse *most* strange BODYSTRUCTURE msg-att found in the wild, a future PR will backtrack on parse errors and try one or more "fool-proof" algorithms that partially parse *nearly* all invalid body structures sent by buggy servers... even in pathological cases, such as when servers send the message-id as a quoted string containing unescaped quotation marks! ♻️ Add lookahead and peek methods to def_char_matchers, and peek_str?, peek_re, for matching without consuming and using MatchData. ♻️ rename case_insensitive__string to match new parser style. ♻️ add number64 aliases. (size is unenforced)
a0fe7d5
to
e5aeeb0
Compare
This was referenced Feb 12, 2023
n.b, this PR was split into three other more cohesive PRs: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These parser updates share some of their foundations, and would have many conflicts in separate PRs. So, even though it's several different features, bug fixes, or refactorings, it's simplest to combine them. They also fix some existing issues and lay foundations for various other features I've worked on.
This also showcases the coding style I've been using. It isn't my normal style... my standard rubocop config would complain! 😉 But I personally found that it was much easier to compare code to ABNF this way, while still staying within a simple recursive descent, predictive parser paradigm. @shugo (and anyone else), please let me know if this style is off-putting or confusing, and you think it should be changed to a less "clever" style. I have also started some simple experiments with e.g. racc and ragel, and parser combinators, and other more complex code generation, and even meta-programmed method inlining... but I'd prefer to avoid that added complexity, at least for now. 😉
My preliminary benchmarks were all positive. No enormous speedups, but in most cases a little bit faster. I'll edit this description with some of the relevant benchmarks later. In particular, I wanted to avoid any code generation with
eval
unless the benchmarks justify it (but I think they do).✨ Make text in resp-text optional (IMAP4rev2): fe9ded8
moved to #111
RFC3501 (IMAP4rev1):
RFC9051 (IMAP4rev2):
And in Appendix E:
In the spirit of Appendix E. 23 (and based on some actual server responses I've seen over the years), I've leniently re-interpreted this as also allowing us to drop the trailing
SP
char after[resp-text-code parsable code data]
, like so:Actually, the original parser already mostly behaved this way, because the original regexps for
T_TEXT
used*
and not+
. But, as I updated the parser in many other places to more closely match the RFCs, I broke this behavior. This commit originally came after many many other changes. But, while rebasing, I moved this commit first because it simplified later commits.Also:
Patterns
module, to organize regexps.Patterns::CharClassSubtraction
refinement to simplify exceptions.ParserUtils::Generator#def_char_matchers
to defineSP
,LBRA
,RBRA
.ParserUtils#{match,accept}_re
to replaceTEXT
,CTEXT
lex states.lex_state
kwarg frommatch
✨ Add UTF-8 support for quoted and text: f30690e
moved to #111
The parser update supports both RFC6855 (
UTF8=ALLOW
,UTF8=ONLY
) and the extra UTF8 requirements ofIMAP4rev2
(the human-readabletext
inresp-text
).Also updated
#enable
documentation and method signature:UTF8=ACCEPT
as "supported"*rest
args => flatten => map(aliases) => uniq:utf8
as an alias forUTF8=ACCEPT
🐛⚡♻️ NAMESPACE: fix parsing (not SP-delimited!): 5f08055
moved to #112
I misread or misunderstood the spec when I first implemented this...!
I wrongly inserted
SP
-delimiters. Most servers don't list more than onenamespace, so I guess that very few noticed the bug!
Also:
ParserUtils::Generator#def_token_matchers
forquoted
,string
,nil
,tagged_ext_label
, etc.atom
,astring
,nstring
, etc to top, so they can be aliased.NIL?
in nstring, nquoted✨🐛 Update FETCH BODYSTRUCTURE msg-att parser: a449243
moved to #113
✨ Add missing
location
extension data. This was missing from RFC2060 but part of RFC3501.It was also missing from
Net::IMAP
... until now! 😄🐛 Fix many bugs. Most importantly:
NIL
is allowed, e.g:number
,envelope
, andbody
. Ignoring these rare server bugs made it difficult to workaround much more common server bugs elsewhere.BodyTypeAttachment
andBodyTypeExtension
won't be returned any more and the constants have been deprecated.🚧 TODO: Although this will parse most strange
BODYSTRUCTURE
msg-att
found in the wild, a future PR will backtrack on parse errors and try one or more "fool-proof" algorithms that partially parse nearly all invalid body structures sent by buggy servers... even in pathological cases, such as when servers send the message-id as a quoted stringcontaining unescaped quotation marks!
Also:
lookahead
andpeek
methods todef_char_matchers
, andpeek_str?
,peek_re
, for matching without consuming and usingMatchData
.case_insensitive__string
to match new parser style.number64
aliases. (size is unenforced)