-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
"Short circuiting" in base64's b64decode, decode, decodebytes #79013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When given an invalid base64 string that starts with a valid base64 substring, the functions will return the decoded bytes only up to the substring rather then ignoring the non-alphabet character. Examples:
>>> base64.b64decode("AAAAAAAA")
b'\x00\x00\x00\x00\x00\x00'
>>> base64.b64decode("AA=AAAAAA")
b'\x00\x00\x00\x00\x00\x00'
>>> base64.b64decode("AAA=AAAAA")
b'\x00\x00' |
I am not sure if simply ignoring the non-valid character is the best way to go. Feels like silencing errors. b64decode does accept the 'validate' flag - defaulted to False - that will halt the execution and throw an error. What might be a good idea is to implement an 'errors' argument that accepts 'ignore' as a value, like we do for bytes.decode (https://docs.python.org/3/library/stdtypes.html#bytes.decode) |
Actually, I'm not even sure if it makes sense to decode the 'first valid substring'... IMHO, we should warn the user |
For reference in future discussions, Python's base64 module implements RFC 3548 (https://tools.ietf.org/html/rfc3548) whose section 2.3 (https://tools.ietf.org/html/rfc3548#section-2.3) discusses about "Interpretation of non-alphabet characters in encoded data". The section's content is: Base encodings use a specific, reduced, alphabet to encode binary Implementations MUST reject the encoding if it contains characters In my opinion, the RFC is rather permissive about strange characters in the encoded data. The RFC refers to the MIME specification that ignores the data and hints the possibility of rejecting the pad symbol '=' unless it is found in the end of the string. I think that our best option if we would like to address this issue is to add an 'errors' argument whose default value will keep the current behavior for backwards compatibility but will accept more options in order to both ignore the strange characters and carry on with the processing - like bytes.decode's errors=ignore flag - and to raise an error in such situations, like bytes.decode's errors=strict. |
@fbidu, I agree with your reasoning. The default behavior should error on any non-base-alphabet characters. There are inconsistencies in the API, A single corrupt character will silently corrupt the rest of the data. Disagreement about which alphabet to use (urlsafe?) will lead to data that decodes fine but isn't the right data. Since it's fairly trivial (something like Related: People want to ignore incorrect padding: #73613 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: