-
Notifications
You must be signed in to change notification settings - Fork 429
Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library #3474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ah I see the problem, it's a problem introduced by JSON.dumps() so we need to do
To fix this |
|
looking... TIL, I always thought by default std lib JSON would use utf-8 since logging and all the rest is -- let's have that as a default if it backwards compatible. Would you like to send a PR for this so we can have it in the next release? PS: If not backwards compatible, we should at least improve our documentation to call this out! |
Thanks @heitorlessa - good to see it's not just me surprised by this behaviour. Actually I spotted that the solution I suggested is potentially going to introduce crashes into code bases For example,
Yields Whereas:
Yields:
So something like
Is what I have implemented so far, but would love if you have any suggestions? Especially if we'd consider changing the default behaviour, we'd need to take a lot of care here I think |
sensible, I ran tests and didn't run into any issue, which still makes me nervous (hyrum law) as we can't get all edge cases. Wrapping an exception is more helpful. We'll update the PR. Thank you! |
@leandrodamascena got to jump on a meeting I can't reschedule, could you take over, pretty please? Need to do some extensive exploratory testing besides what we have just in case. There could be data loss when converting to |
I also wonder if you think it would be safe to call LOGGER.exception() within the except statement here of the serializer? def serialiser(data: Any) -> str:
try:
return json.dumps(data, ensure_ascii=False)
except TypeError:
# e.g. if dumping a python set, convert to string them try again
LOGGER.exception("Failed to serialize object to JSON, converting to string")
return json.dumps(str(data), ensure_ascii=False) It's somewhat risk of recursion maybe - using the logging library inside it's own json_serialisation function. We use Sentry in our company for error reporting, so in the above case, it would also trigger a sentry alert something is wrong. Maybe safer I could do: from sentry_sdk import capture_message
def serialiser(data: Any) -> str:
"""A custom serialiser which allows displaying of Japanese text as utf-8 in logs."""
try:
return json.dumps(data, ensure_ascii=False)
except TypeError:
# e.g. if dumping a python set, convert to string them try again
capture_message(f"Failed to serialize object to JSON, converting to string. data: {data}")
print(f"Failed to serialize object to JSON, converting to string. data: {data}")
return json.dumps(str(data), ensure_ascii=False) |
Hey @bml1g12 this is a really interesting finding in our Logger utility. I tested our Logger with this new parameter I agree this is a bug because the default behavior in the Logging library is to convert the string and show the Japanese string, in this case, but I'm wondering if we should add this flag to our constructor, add a section in our documentation and let the clients decide what they want. What do you think? @bml1g12 and @heitorlessa |
I know it's late Friday, so...would you be available for a call on Monday to go through options @bml1g12? It'd be nice to put a face into a name too!! We're having a chat now and will answer with a few options with pros and cons, so if you can't do a call on Monday we can decide async too -- since there is a workaround however imperfect we'd be better off not making any hotfix release at haste until all options are on the table (backward compatibility is a tenet). email us if you are and we send a calendar link to choose the best slot: [email protected] |
@leandrodamascena and I had the call and here's the options we could think of - our preferred is 1. We'd love to hear your thoughts and anyone else from the community too:
|
Just to note, probably needs no explanation, but I gave the example snippet as
But as you say, it's possible some users are already using the default behaviour intentionally to output Unicode encoded strings - I think it's unlikely, but it's definately possible. It seems like a breaking change to me, so in my mind it would need to be 1 or 2. Thanks for the invite to discuss - we use this library a lot in our company (great work team!), so I'll discuss with my colleagues if they have any thoughts on best way to proceed on Monday and get back to you via this issue. Async is probably easiest for me. I'm also little unclear why it is that:
Does not throw an error when
Does? Maybe this TypeError I spotted is unrelated to how |
I think I found a solution but we’d need some form of mutation testing to
put my mind at ease — using ensure_ascii=False + our internal JSON Encoder
default method possibly combined with a singledispatch.
I’ll ping Nate from the SDK team who’s got experience with this ~6 years
ago in the requests module back in the days.
That should solve the edge cases since we’ve had this working for ~4 years
and only discovered it now.
…On Fri, 8 Dec 2023 at 20:57, Benjamin Lowe ***@***.***> wrote:
Just to note, probably needs no explanation, but I gave the example
snippet as japanese_string = "\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12"
but it's the same result if we provide japanese_string = "スコビルデモ2" which
of course would be the more likely example (user's providing string
naturally in their native language for logging).
is someone intentionally creating a Unicode-encoded log line to export it
maybe?
But as you say, it's possible some users are already using the default
behaviour intentionally to output Unicode encoded strings - I think it's
unlikely, but it's definately possible.
It seems like a breaking change to me, so in my mind it would need to be 1
or 2. Thanks for the invite to discuss - we use this library a lot in our
company (great work team!), so I'll discuss with my colleagues if they have
any thoughts on best way to proceed on Monday and get back to you via this
issue. Async is probably easiest for me.
I'm also little unclear why it is that:
japanese_set = {"スコビルデモ2"}
logger = Logger()logger.info(japanese_set)
Does not throw an error when
json.dumps(japanese_set)
Does?
—
Reply to this email directly, view it on GitHub
<#3474 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZPQBAFAR4VNDCA4VJMJ7LYINWJZAVCNFSM6AAAAABAMZ6QGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXG43DSMJTGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
I discussed with my team today. We all agree (3) would be ideal, but we appreciate the risks involved of any potential bugs introduced, given how widespread the library is. So if you could somehow guarantee (3) won't introduce unexpected errors for users of the default setting then that would be the way to go, but otherwise, we think option (2) would be the best way to go. We hope this can become the default behaviour in V3, given that currently the library would behave unexpectedly for most users in Asia I think. |
Great, thank you @bml1g12 for gathering everyone's consensus (say hello to team on our behalf!) -- We're running some additional tests and benchmark today for option 3. Did additional research on weekend in CPython code, Loguru and Requests library, old CPython bug tracking, Python 2 vs Python 3 necessary changes, to make us more comfortable with optional 3 -- if our benchmark and additional tests work out nicely today, we'd be comfortable with the risk, provide a quick way to rollback if we must, and if we do need to rollback we'll make it the default in v3. |
@bml1g12 Verdict -- we're going with option 3... changing the default. We did our due diligence as much as possible and consider this to be a long standing bug. Main reasons but not limited to why we've decided to flip the default:
|
|
@leandrodamascena is triggering a hotfix release |
This is now released under 2.29.1 version! |
@bml1g12 could you try the latest version and let us know, please? 2.29.1 on
|
Great stuff, thanks very much for the rapid fix! This week is a busy week for me so might take me a little while to get around to integrating this to some repos with a real trial by fire, but I did a quick local pip install and kicked the wheels a bit, and all seems healthy to me! |
No pressure ;) we wanted to make sure this was available as soon as we
could given its impact
…On Tue, 12 Dec 2023 at 20:27, Benjamin Lowe ***@***.***> wrote:
Great stuff, this week is a busy week for me so might take me a little
while to get around to integrating this to some repos with a real trial by
fire, but I did a quick local pip install and kicked the wheels a bit, and
all seems healthy to me!
—
Reply to this email directly, view it on GitHub
<#3474 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZPQBEMHMOM5OBGJBF5TEDYJCVYPAVCNFSM6AAAAABAMZ6QGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJSGY3TENZUGA>
.
You are receiving this because you were assigned.Message ID:
***@***.***
com>
|
Expected Behaviour
Current Behaviour
Code snippet
Possible Solution
Use a custom formatter via
Logger(logger_formatter=formatter)
- although I'd need to do some research as to what would be the correct formatter to fix this issue, it does seem e.g.Can work around this issue, at the expense of losing the structured logging
I also tried
Which I thought might be the issue, but it seems no matter what I pass to
json_serializer
I get the same outputSteps to Reproduce
Powertools for AWS Lambda (Python) version
latest
AWS Lambda function runtime
3.10
Packaging format used
Lambda Layers
Debugging logs
No response
The text was updated successfully, but these errors were encountered: