"Short circuiting" in base64's b64decode, decode, decodebytes #79013

pwmichaelharris · 2018-09-28T12:20:13Z

BPO	34832
Nosy	@fbidu, @pwmichaelharris

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-09-28.12:20:13.340>
labels = ['type-bug', 'library']
title = '"Short circuiting" in base64\'s b64decode, decode, decodebytes'
updated_at = <Date 2018-09-29.16:05:02.898>
user = 'https://github.com/pwmichaelharris'

bugs.python.org fields:

activity = <Date 2018-09-29.16:05:02.898>
actor = 'fbidu'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2018-09-28.12:20:13.340>
creator = 'pw.michael.harris'
dependencies = []
files = []
hgrepos = []
issue_num = 34832
keywords = []
message_count = 4.0
messages = ['326630', '326661', '326662', '326680']
nosy_count = 2.0
nosy_names = ['fbidu', 'pw.michael.harris']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue34832'
versions = ['Python 3.5']

pwmichaelharris · 2018-09-28T12:20:13Z

When given an invalid base64 string that starts with a valid base64 substring, the functions will return the decoded bytes only up to the substring rather then ignoring the non-alphabet character.

Examples:
>>> base64.b64decode("AAAAAAAA")
b'\x00\x00\x00\x00\x00\x00'
>>> base64.b64decode("AA=AAAAAA")
b'\x00\x00\x00\x00\x00\x00'
>>> base64.b64decode("AAA=AAAAA")
b'\x00\x00'

fbidu · 2018-09-28T23:37:28Z

I am not sure if simply ignoring the non-valid character is the best way to go. Feels like silencing errors.

b64decode does accept the 'validate' flag - defaulted to False - that will halt the execution and throw an error.

What might be a good idea is to implement an 'errors' argument that accepts 'ignore' as a value, like we do for bytes.decode (https://docs.python.org/3/library/stdtypes.html#bytes.decode)

fbidu · 2018-09-28T23:49:43Z

Actually, I'm not even sure if it makes sense to decode the 'first valid substring'... IMHO, we should warn the user

fbidu · 2018-09-29T16:05:03Z

For reference in future discussions, Python's base64 module implements RFC 3548 (https://tools.ietf.org/html/rfc3548) whose section 2.3 (https://tools.ietf.org/html/rfc3548#section-2.3) discusses about "Interpretation of non-alphabet characters in encoded data".

The section's content is:

Base encodings use a specific, reduced, alphabet to encode binary
data. Non alphabet characters could exist within base encoded data,
caused by data corruption or by design. Non alphabet characters may
be exploited as a "covert channel", where non-protocol data can be
sent for nefarious purposes. Non alphabet characters might also be
sent in order to exploit implementation errors leading to, e.g.,
buffer overflow attacks.

Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise. Such specifications may, as MIME does, instead state that
characters outside the base encoding alphabet should simply be
ignored when interpreting data ("be liberal in what you accept").
Note that this means that any CRLF constitute "non alphabet
characters" and are ignored. Furthermore, such specifications may
consider the pad character, "=", as not part of the base alphabet
until the end of the string. If more than the allowed number of pad
characters are found at the end of the string, e.g., a base 64 string
terminated with "===", the excess pad characters could be ignored.

In my opinion, the RFC is rather permissive about strange characters in the encoded data. The RFC refers to the MIME specification that ignores the data and hints the possibility of rejecting the pad symbol '=' unless it is found in the end of the string.

I think that our best option if we would like to address this issue is to add an 'errors' argument whose default value will keep the current behavior for backwards compatibility but will accept more options in order to both ignore the strange characters and carry on with the processing - like bytes.decode's errors=ignore flag - and to raise an error in such situations, like bytes.decode's errors=strict.

simon-friedberger · 2023-08-08T06:25:07Z

@fbidu, I agree with your reasoning. The default behavior should error on any non-base-alphabet characters. validate=True should be the default. I would even suggest removing the option.

There are inconsistencies in the API, b64_decode has validate(default False) but urlsafe_b64decode and b32_decode do not. b32_decode implicitly has validate=True but has a switch for allowing lower-case characters.

A single corrupt character will silently corrupt the rest of the data. Disagreement about which alphabet to use (urlsafe?) will lead to data that decodes fine but isn't the right data.

Since it's fairly trivial (something like re.sub(r'[a-zA-Z0-9]', ' ', s)) to remove unwanted characters people who want that could just do that.

Related: People want to ignore incorrect padding: #73613
This can easily be done using +"==" as long as those characters get ignored so a solution which does remove validate should probably add an ignore_missing_padding.

pwmichaelharris mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 28, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

"Short circuiting" in base64's b64decode, decode, decodebytes #79013

"Short circuiting" in base64's b64decode, decode, decodebytes #79013

pwmichaelharris mannequin commented Sep 28, 2018

pwmichaelharris mannequin commented Sep 28, 2018

Uh oh!

fbidu mannequin commented Sep 28, 2018

Uh oh!

fbidu mannequin commented Sep 28, 2018

Uh oh!

fbidu mannequin commented Sep 29, 2018

Uh oh!

simon-friedberger commented Aug 8, 2023 •

edited

Loading

Uh oh!

Uh oh!

"Short circuiting" in base64's b64decode, decode, decodebytes #79013

"Short circuiting" in base64's b64decode, decode, decodebytes #79013

Comments

pwmichaelharris mannequin commented Sep 28, 2018

pwmichaelharris mannequin commented Sep 28, 2018

Uh oh!

fbidu mannequin commented Sep 28, 2018

Uh oh!

fbidu mannequin commented Sep 28, 2018

Uh oh!

fbidu mannequin commented Sep 29, 2018

Uh oh!

simon-friedberger commented Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-friedberger commented Aug 8, 2023 •

edited

Loading