Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library #3474

bml1g12 · 2023-12-08T15:24:27Z

Expected Behaviour

{"level":"INFO","location":"<module>:8","message":"スコビルデモ２","timestamp":"2023-12-08 15:21:28,502+0000","service":"service_undefined"}
スコビルデモ２
INFO:root:スコビルデモ２

Current Behaviour

{"level":"INFO","location":"<module>:8","message":"\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12","timestamp":"2023-12-08 15:21:28,502+0000","service":"service_undefined"}
スコビルデモ２
INFO:root:スコビルデモ２

Code snippet

from aws_lambda_powertools import Logger
import logging

japanese_string = "\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12"

logger = Logger()
logger.info(japanese_string)


print(japanese_string)

import logging
logging.basicConfig(level=logging.INFO)
logging.info(japanese_string)

Possible Solution

Use a custom formatter via Logger(logger_formatter=formatter) - although I'd need to do some research as to what would be the correct formatter to fix this issue, it does seem e.g.

class LocalPlainFormatter(LambdaPowertoolsFormatter):
    # pylint: disable = no-self-use, too-few-public-methods
    """Converts from the JSON structured data to a plain logging format"""

    def serialize(self, log: dict[Any, Any]) -> str:
        """Converts from the JSON structured data to a plain logging format.

        :param log: A JSON log record.
        :return: Plan logging format version of the input `log`.
        """
        timestamp = log["timestamp"]
        levelname = log["level"]
        func_name_location = log["location"]
        message = log["message"]
        exception_name = log.get("exception_name", "")
        traceback = log.get("exception", "")
        local_context = log.get("local_context", "")
        if exception_name or traceback:
            exception_info = f" - {exception_name} - {traceback}"
        else:
            exception_info = ""

        return f"{levelname} — {timestamp} —  {func_name_location} — {message}{local_context}" \
               f"{exception_info}"

Can work around this issue, at the expense of losing the structured logging

I also tried

def myserializer(x):
  return json.dumps(x, ensure_ascii=False)

logger = Logger(json_serializer=myserial)

Which I thought might be the issue, but it seems no matter what I pass to json_serializer I get the same output

Steps to Reproduce

aws_lambda_powertools-2.29.0
python 3.11

Powertools for AWS Lambda (Python) version

latest

AWS Lambda function runtime

3.10

Packaging format used

Lambda Layers

Debugging logs

No response

The text was updated successfully, but these errors were encountered:

bml1g12 · 2023-12-08T15:54:26Z

Ah I see the problem, it's a problem introduced by JSON.dumps() so we need to do

logger = Logger(json_serializer=lambda x: json.dumps(x, ensure_ascii=False))

To fix this

github-actions · 2023-12-08T15:54:47Z

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

heitorlessa · 2023-12-08T15:57:40Z

looking... TIL, I always thought by default std lib JSON would use utf-8 since logging and all the rest is -- let's have that as a default if it backwards compatible.

Would you like to send a PR for this so we can have it in the next release?

PS: If not backwards compatible, we should at least improve our documentation to call this out!

bml1g12 · 2023-12-08T16:19:21Z

Thanks @heitorlessa - good to see it's not just me surprised by this behaviour.

Actually I spotted that the solution I suggested is potentially going to introduce crashes into code bases

For example,

japanese_string = {"スコビルデモ２"}
logger = Logger(json_serializer=lambda data: json.dumps(data, ensure_ascii=False))
logger.info(japanese_string)

Yields TypeError: Object of type set is not JSON serializable

Whereas:

logger = Logger()
logger.info(japanese_string)

Yields:

logger.info(japanese_string)
{"level":"INFO","location":"<module>:18","message":"{'\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12'}","timestamp":"2023-12-08 16:17:25,496+0000","service":"service_undefined"}

So something like

def serialiser(data: Any) -> str:
    """A custom serialiser which allows displaying of Japanese text as utf-8 in logs."""
    try:
        return json.dumps(data, ensure_ascii=False)
    except TypeError:
        # e.g. if dumping a python set, convert to string them try again
        print("Failed to serialize object to JSON, converting to string")
        return json.dumps(str(data), ensure_ascii=False)

logger = Logger(json_serializer=serialiser)

Is what I have implemented so far, but would love if you have any suggestions? Especially if we'd consider changing the default behaviour, we'd need to take a lot of care here I think

heitorlessa · 2023-12-08T16:20:43Z

sensible, I ran tests and didn't run into any issue, which still makes me nervous (hyrum law) as we can't get all edge cases.

Wrapping an exception is more helpful. We'll update the PR.

Thank you!

heitorlessa · 2023-12-08T16:22:09Z

@leandrodamascena got to jump on a meeting I can't reschedule, could you take over, pretty please? Need to do some extensive exploratory testing besides what we have just in case.

There could be data loss when converting to str when getting it back too... hmmm tricky. If we can't be 99% sure, I'd add under a flag, and document the behaviour

bml1g12 · 2023-12-08T16:28:16Z

I also wonder if you think it would be safe to call LOGGER.exception() within the except statement here of the serializer?

def serialiser(data: Any) -> str:
    try:
        return json.dumps(data, ensure_ascii=False)
    except TypeError:
        # e.g. if dumping a python set, convert to string them try again
        LOGGER.exception("Failed to serialize object to JSON, converting to string")
        return json.dumps(str(data), ensure_ascii=False)

It's somewhat risk of recursion maybe - using the logging library inside it's own json_serialisation function.

We use Sentry in our company for error reporting, so in the above case, it would also trigger a sentry alert something is wrong.

Maybe safer I could do:

from sentry_sdk import capture_message

def serialiser(data: Any) -> str:
    """A custom serialiser which allows displaying of Japanese text as utf-8 in logs."""
    try:
        return json.dumps(data, ensure_ascii=False)
    except TypeError:
        # e.g. if dumping a python set, convert to string them try again
        capture_message(f"Failed to serialize object to JSON, converting to string. data: {data}")
        print(f"Failed to serialize object to JSON, converting to string. data: {data}")
        return json.dumps(str(data), ensure_ascii=False)

leandrodamascena · 2023-12-08T17:02:27Z

Hey @bml1g12 this is a really interesting finding in our Logger utility.

I tested our Logger with this new parameter ensure_ascii=False and it seems to be ok, converting as expected. I can also view the logs (Unicode decoded/encoded) in CloudWatch and search them in CloudWatch Insights. But my point here is: is someone intentionally creating a Unicode-encoded log line to export it maybe? And if we change this behavior, can we break someone? I don't know.

I agree this is a bug because the default behavior in the Logging library is to convert the string and show the Japanese string, in this case, but I'm wondering if we should add this flag to our constructor, add a section in our documentation and let the clients decide what they want.

What do you think? @bml1g12 and @heitorlessa

heitorlessa · 2023-12-08T17:18:58Z

I know it's late Friday, so...would you be available for a call on Monday to go through options @bml1g12? It'd be nice to put a face into a name too!!

We're having a chat now and will answer with a few options with pros and cons, so if you can't do a call on Monday we can decide async too -- since there is a workaround however imperfect we'd be better off not making any hotfix release at haste until all options are on the table (backward compatibility is a tenet).

email us if you are and we send a calendar link to choose the best slot: [email protected]

heitorlessa · 2023-12-08T17:55:19Z

@leandrodamascena and I had the call and here's the options we could think of - our preferred is 1. We'd love to hear your thoughts and anyone else from the community too:

Document, add a new flag and change default in v3. Document non-ascii chars are escaped by default, and how to use this new flag to not escape it along with an example.
- Pros. Reduces boilerplate from customers like you and prepare new customers if they ever encounter this issue with a friendly solution.
- Cons. It doesn't prevent a customer from hitting this issue, we can't expect everyone will read documentation, release notes, or IDE explanations.
Document and change default in v3. Document as proposed in 1/, but don't introduce a new flag which will have to be deprecated in v3, and removed only in v4 (we value stability).
- Pros. Keeps Logger constructor from having yet more dials like datetime, and saving customers from watching out for breaking change in v4.
- Cons. Same as 1, except customers will have to bring additional boilerplate until v3 (~H1 2024).
Change the default and catch exceptions. Update default serializer and catch TypeError and potential other edge cases we might find.
- Pros. Prevents new customers to ever hitting this issue, no need to wait until v3.
- Cons. We cannot be absolutely certain of the impact; we're talking billions of transactions a week. We'd need to educate ourselves better of the possibilities before this becoming an option.

bml1g12 · 2023-12-08T19:57:05Z

Just to note, probably needs no explanation, but I gave the example snippet as japanese_string = "\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12" but it's the same result if we provide japanese_string = "スコビルデモ２" which of course would be the more likely example (user's providing string naturally in their native language for logging).

is someone intentionally creating a Unicode-encoded log line to export it maybe?

But as you say, it's possible some users are already using the default behaviour intentionally to output Unicode encoded strings - I think it's unlikely, but it's definately possible.

It seems like a breaking change to me, so in my mind it would need to be 1 or 2. Thanks for the invite to discuss - we use this library a lot in our company (great work team!), so I'll discuss with my colleagues if they have any thoughts on best way to proceed on Monday and get back to you via this issue. Async is probably easiest for me.

I'm also little unclear why it is that:

japanese_set = {"スコビルデモ２"}
logger = Logger()
logger.info(japanese_set)

Does not throw an error when

json.dumps(japanese_set)

Does? Maybe this TypeError I spotted is unrelated to how ensure_ascii is set, but something in the way powertools library is handling things.

heitorlessa · 2023-12-08T23:04:58Z

I think I found a solution but we’d need some form of mutation testing to put my mind at ease — using ensure_ascii=False + our internal JSON Encoder default method possibly combined with a singledispatch. I’ll ping Nate from the SDK team who’s got experience with this ~6 years ago in the requests module back in the days. That should solve the edge cases since we’ve had this working for ~4 years and only discovered it now.

…

On Fri, 8 Dec 2023 at 20:57, Benjamin Lowe ***@***.***> wrote: Just to note, probably needs no explanation, but I gave the example snippet as japanese_string = "\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12" but it's the same result if we provide japanese_string = "スコビルデモ２" which of course would be the more likely example (user's providing string naturally in their native language for logging). is someone intentionally creating a Unicode-encoded log line to export it maybe? But as you say, it's possible some users are already using the default behaviour intentionally to output Unicode encoded strings - I think it's unlikely, but it's definately possible. It seems like a breaking change to me, so in my mind it would need to be 1 or 2. Thanks for the invite to discuss - we use this library a lot in our company (great work team!), so I'll discuss with my colleagues if they have any thoughts on best way to proceed on Monday and get back to you via this issue. Async is probably easiest for me. I'm also little unclear why it is that: japanese_set = {"スコビルデモ２"} logger = Logger()logger.info(japanese_set) Does not throw an error when json.dumps(japanese_set) Does? — Reply to this email directly, view it on GitHub <#3474 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZPQBAFAR4VNDCA4VJMJ7LYINWJZAVCNFSM6AAAAABAMZ6QGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXG43DSMJTGE> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

bml1g12 · 2023-12-11T11:01:31Z

I discussed with my team today. We all agree (3) would be ideal, but we appreciate the risks involved of any potential bugs introduced, given how widespread the library is. So if you could somehow guarantee (3) won't introduce unexpected errors for users of the default setting then that would be the way to go, but otherwise, we think option (2) would be the best way to go. We hope this can become the default behaviour in V3, given that currently the library would behave unexpectedly for most users in Asia I think.

heitorlessa · 2023-12-11T12:22:14Z

Great, thank you @bml1g12 for gathering everyone's consensus (say hello to team on our behalf!) -- We're running some additional tests and benchmark today for option 3.

Did additional research on weekend in CPython code, Loguru and Requests library, old CPython bug tracking, Python 2 vs Python 3 necessary changes, to make us more comfortable with optional 3 -- if our benchmark and additional tests work out nicely today, we'd be comfortable with the risk, provide a quick way to rollback if we must, and if we do need to rollback we'll make it the default in v3.

heitorlessa · 2023-12-11T13:49:04Z

@bml1g12 Verdict -- we're going with option 3... changing the default. We did our due diligence as much as possible and consider this to be a long standing bug.

Main reasons but not limited to why we've decided to flip the default:

ensure_ascii=False only affects non-ascii (~99% of the use cases atm).and no performance impact when toggling so. Python 2 and Python 3 would've been a different story (unicode obj vs str).
There isn't any performance overhead in toggling this feature (hyperfine test). Largely due to doing less checks typically done for non-ascii that needs to be reverted and escaped.
Log analytics tools don't offer auto-deserialization of encoded non-ascii characters, making search troublesome at best. This blocks customers from using Logger not only with CloudWatch Logs but also Datadog and others. This alone is a big reason to change it.
TypeError shared earlier is related to a non-JSON serializable type which is already handled today via the default method.
We've added an extra test to cover both all non-ascii characters (3.7+) and edge cases like field names being non-ascii.
Our destination is standard output, not HTTP requests. It makes it less likely to break any deserialization as the consumer. It's quite the opposite, it creates an usability issue today (point 3). To be extra sure, we also checked that Loguru (popular logging framework) uses ensure_ascii=False as the default too

github-actions · 2023-12-11T13:55:04Z

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

heitorlessa · 2023-12-11T13:55:24Z

@leandrodamascena is triggering a hotfix release

github-actions · 2023-12-11T14:24:14Z

This is now released under 2.29.1 version!

heitorlessa · 2023-12-11T14:42:15Z

@bml1g12 could you try the latest version and let us know, please? 2.29.1 on pip install and Lambda Layer version 53

arn:aws:lambda:{region}:017000801446:layer:AWSLambdaPowertoolsPythonV2:53
arn:aws:lambda:{region}:017000801446:layer:AWSLambdaPowertoolsPythonV2-Arm64:53

bml1g12 · 2023-12-12T19:26:53Z

Great stuff, thanks very much for the rapid fix! This week is a busy week for me so might take me a little while to get around to integrating this to some repos with a real trial by fire, but I did a quick local pip install and kicked the wheels a bit, and all seems healthy to me!

heitorlessa · 2023-12-12T20:05:55Z

No pressure ;) we wanted to make sure this was available as soon as we could given its impact

…

On Tue, 12 Dec 2023 at 20:27, Benjamin Lowe ***@***.***> wrote: Great stuff, this week is a busy week for me so might take me a little while to get around to integrating this to some repos with a real trial by fire, but I did a quick local pip install and kicked the wheels a bit, and all seems healthy to me! — Reply to this email directly, view it on GitHub <#3474 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZPQBEMHMOM5OBGJBF5TEDYJCVYPAVCNFSM6AAAAABAMZ6QGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJSGY3TENZUGA> . You are receiving this because you were assigned.Message ID: ***@***.*** com>

bml1g12 added bug Something isn't working triage Pending triage from maintainers labels Dec 8, 2023

github-project-automation bot added this to Powertools for AWS Lambda (Python) Dec 8, 2023

github-project-automation bot moved this to Triage in Powertools for AWS Lambda (Python) Dec 8, 2023

bml1g12 changed the title ~~Bug: TITLE~~ Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library Dec 8, 2023

bml1g12 closed this as completed Dec 8, 2023

github-project-automation bot moved this from Triage to Coming soon in Powertools for AWS Lambda (Python) Dec 8, 2023

heitorlessa reopened this Dec 8, 2023

github-project-automation bot moved this from Coming soon to Pending review in Powertools for AWS Lambda (Python) Dec 8, 2023

heitorlessa added logger and removed triage Pending triage from maintainers labels Dec 8, 2023

heitorlessa moved this from Pending review to Working on it in Powertools for AWS Lambda (Python) Dec 8, 2023

heitorlessa mentioned this issue Dec 8, 2023

fix(logger): log non-ascii characters as is when JSON stringifying #3475

Merged

7 tasks

heitorlessa linked a pull request Dec 11, 2023 that will close this issue

fix(logger): log non-ascii characters as is when JSON stringifying #3475

Merged

7 tasks

leandrodamascena closed this as completed in #3475 Dec 11, 2023

github-project-automation bot moved this from Working on it to Coming soon in Powertools for AWS Lambda (Python) Dec 11, 2023

heitorlessa self-assigned this Dec 11, 2023

github-actions bot added the pending-release Fix or implementation already in dev waiting to be released label Dec 11, 2023

github-actions bot removed the pending-release Fix or implementation already in dev waiting to be released label Dec 11, 2023

heitorlessa moved this from Coming soon to Shipped in Powertools for AWS Lambda (Python) Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library #3474

Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library #3474

bml1g12 commented Dec 8, 2023 •

edited

Loading

bml1g12 commented Dec 8, 2023

github-actions bot commented Dec 8, 2023

heitorlessa commented Dec 8, 2023 •

edited

Loading

bml1g12 commented Dec 8, 2023 •

edited

Loading

heitorlessa commented Dec 8, 2023

heitorlessa commented Dec 8, 2023 •

edited

Loading

bml1g12 commented Dec 8, 2023 •

edited

Loading

leandrodamascena commented Dec 8, 2023 •

edited

Loading

heitorlessa commented Dec 8, 2023

heitorlessa commented Dec 8, 2023

bml1g12 commented Dec 8, 2023 •

edited

Loading

heitorlessa commented Dec 8, 2023 via email

bml1g12 commented Dec 11, 2023

heitorlessa commented Dec 11, 2023

heitorlessa commented Dec 11, 2023

github-actions bot commented Dec 11, 2023

heitorlessa commented Dec 11, 2023 •

edited

Loading

github-actions bot commented Dec 11, 2023

heitorlessa commented Dec 11, 2023

bml1g12 commented Dec 12, 2023 •

edited

Loading

heitorlessa commented Dec 12, 2023 via email

Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library #3474

Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library #3474

Comments

bml1g12 commented Dec 8, 2023 • edited Loading

Expected Behaviour

Current Behaviour

Code snippet

Possible Solution

Steps to Reproduce

Powertools for AWS Lambda (Python) version

AWS Lambda function runtime

Packaging format used

Debugging logs

bml1g12 commented Dec 8, 2023

github-actions bot commented Dec 8, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

heitorlessa commented Dec 8, 2023 • edited Loading

bml1g12 commented Dec 8, 2023 • edited Loading

heitorlessa commented Dec 8, 2023

heitorlessa commented Dec 8, 2023 • edited Loading

bml1g12 commented Dec 8, 2023 • edited Loading

leandrodamascena commented Dec 8, 2023 • edited Loading

heitorlessa commented Dec 8, 2023

heitorlessa commented Dec 8, 2023

bml1g12 commented Dec 8, 2023 • edited Loading

heitorlessa commented Dec 8, 2023 via email

bml1g12 commented Dec 11, 2023

heitorlessa commented Dec 11, 2023

heitorlessa commented Dec 11, 2023

github-actions bot commented Dec 11, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

heitorlessa commented Dec 11, 2023 • edited Loading

github-actions bot commented Dec 11, 2023

heitorlessa commented Dec 11, 2023

bml1g12 commented Dec 12, 2023 • edited Loading

heitorlessa commented Dec 12, 2023 via email

bml1g12 commented Dec 8, 2023 •

edited

Loading

heitorlessa commented Dec 8, 2023 •

edited

Loading

bml1g12 commented Dec 8, 2023 •

edited

Loading

heitorlessa commented Dec 8, 2023 •

edited

Loading

bml1g12 commented Dec 8, 2023 •

edited

Loading

leandrodamascena commented Dec 8, 2023 •

edited

Loading

bml1g12 commented Dec 8, 2023 •

edited

Loading

heitorlessa commented Dec 11, 2023 •

edited

Loading

bml1g12 commented Dec 12, 2023 •

edited

Loading