Skip to content

Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library #3474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bml1g12 opened this issue Dec 8, 2023 · 21 comments · Fixed by #3475
Assignees
Labels
bug Something isn't working logger

Comments

@bml1g12
Copy link

bml1g12 commented Dec 8, 2023

Expected Behaviour

{"level":"INFO","location":"<module>:8","message":"スコビルデモ2","timestamp":"2023-12-08 15:21:28,502+0000","service":"service_undefined"}
スコビルデモ2
INFO:root:スコビルデモ2

Current Behaviour

{"level":"INFO","location":"<module>:8","message":"\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12","timestamp":"2023-12-08 15:21:28,502+0000","service":"service_undefined"}
スコビルデモ2
INFO:root:スコビルデモ2

Code snippet

from aws_lambda_powertools import Logger
import logging

japanese_string = "\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12"

logger = Logger()
logger.info(japanese_string)


print(japanese_string)

import logging
logging.basicConfig(level=logging.INFO)
logging.info(japanese_string)

Possible Solution

Use a custom formatter via Logger(logger_formatter=formatter) - although I'd need to do some research as to what would be the correct formatter to fix this issue, it does seem e.g.

class LocalPlainFormatter(LambdaPowertoolsFormatter):
    # pylint: disable = no-self-use, too-few-public-methods
    """Converts from the JSON structured data to a plain logging format"""

    def serialize(self, log: dict[Any, Any]) -> str:
        """Converts from the JSON structured data to a plain logging format.

        :param log: A JSON log record.
        :return: Plan logging format version of the input `log`.
        """
        timestamp = log["timestamp"]
        levelname = log["level"]
        func_name_location = log["location"]
        message = log["message"]
        exception_name = log.get("exception_name", "")
        traceback = log.get("exception", "")
        local_context = log.get("local_context", "")
        if exception_name or traceback:
            exception_info = f" - {exception_name} - {traceback}"
        else:
            exception_info = ""

        return f"{levelname} — {timestamp} —  {func_name_location} — {message}{local_context}" \
               f"{exception_info}"

Can work around this issue, at the expense of losing the structured logging

I also tried

def myserializer(x):
  return json.dumps(x, ensure_ascii=False)

logger = Logger(json_serializer=myserial)

Which I thought might be the issue, but it seems no matter what I pass to json_serializer I get the same output

Steps to Reproduce

  • aws_lambda_powertools-2.29.0
  • python 3.11

Powertools for AWS Lambda (Python) version

latest

AWS Lambda function runtime

3.10

Packaging format used

Lambda Layers

Debugging logs

No response

@bml1g12 bml1g12 added bug Something isn't working triage Pending triage from maintainers labels Dec 8, 2023
@bml1g12 bml1g12 changed the title Bug: TITLE Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library Dec 8, 2023
@bml1g12
Copy link
Author

bml1g12 commented Dec 8, 2023

Ah I see the problem, it's a problem introduced by JSON.dumps() so we need to do

logger = Logger(json_serializer=lambda x: json.dumps(x, ensure_ascii=False))

To fix this

@bml1g12 bml1g12 closed this as completed Dec 8, 2023
@github-project-automation github-project-automation bot moved this from Triage to Coming soon in Powertools for AWS Lambda (Python) Dec 8, 2023
Copy link
Contributor

github-actions bot commented Dec 8, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

@heitorlessa heitorlessa reopened this Dec 8, 2023
@github-project-automation github-project-automation bot moved this from Coming soon to Pending review in Powertools for AWS Lambda (Python) Dec 8, 2023
@heitorlessa
Copy link
Contributor

heitorlessa commented Dec 8, 2023

looking... TIL, I always thought by default std lib JSON would use utf-8 since logging and all the rest is -- let's have that as a default if it backwards compatible.

Would you like to send a PR for this so we can have it in the next release?

PS: If not backwards compatible, we should at least improve our documentation to call this out!

@heitorlessa heitorlessa added logger and removed triage Pending triage from maintainers labels Dec 8, 2023
@heitorlessa heitorlessa moved this from Pending review to Working on it in Powertools for AWS Lambda (Python) Dec 8, 2023
@bml1g12
Copy link
Author

bml1g12 commented Dec 8, 2023

Thanks @heitorlessa - good to see it's not just me surprised by this behaviour.

Actually I spotted that the solution I suggested is potentially going to introduce crashes into code bases

For example,

japanese_string = {"スコビルデモ2"}
logger = Logger(json_serializer=lambda data: json.dumps(data, ensure_ascii=False))
logger.info(japanese_string)

Yields TypeError: Object of type set is not JSON serializable

Whereas:

logger = Logger()
logger.info(japanese_string)

Yields:

logger.info(japanese_string)
{"level":"INFO","location":"<module>:18","message":"{'\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12'}","timestamp":"2023-12-08 16:17:25,496+0000","service":"service_undefined"}

So something like

def serialiser(data: Any) -> str:
    """A custom serialiser which allows displaying of Japanese text as utf-8 in logs."""
    try:
        return json.dumps(data, ensure_ascii=False)
    except TypeError:
        # e.g. if dumping a python set, convert to string them try again
        print("Failed to serialize object to JSON, converting to string")
        return json.dumps(str(data), ensure_ascii=False)

logger = Logger(json_serializer=serialiser)

Is what I have implemented so far, but would love if you have any suggestions? Especially if we'd consider changing the default behaviour, we'd need to take a lot of care here I think

@heitorlessa
Copy link
Contributor

sensible, I ran tests and didn't run into any issue, which still makes me nervous (hyrum law) as we can't get all edge cases.

Wrapping an exception is more helpful. We'll update the PR.

Thank you!

@heitorlessa
Copy link
Contributor

heitorlessa commented Dec 8, 2023

@leandrodamascena got to jump on a meeting I can't reschedule, could you take over, pretty please? Need to do some extensive exploratory testing besides what we have just in case.

There could be data loss when converting to str when getting it back too... hmmm tricky. If we can't be 99% sure, I'd add under a flag, and document the behaviour

@bml1g12
Copy link
Author

bml1g12 commented Dec 8, 2023

I also wonder if you think it would be safe to call LOGGER.exception() within the except statement here of the serializer?

def serialiser(data: Any) -> str:
    try:
        return json.dumps(data, ensure_ascii=False)
    except TypeError:
        # e.g. if dumping a python set, convert to string them try again
        LOGGER.exception("Failed to serialize object to JSON, converting to string")
        return json.dumps(str(data), ensure_ascii=False)

It's somewhat risk of recursion maybe - using the logging library inside it's own json_serialisation function.

We use Sentry in our company for error reporting, so in the above case, it would also trigger a sentry alert something is wrong.

Maybe safer I could do:

from sentry_sdk import capture_message

def serialiser(data: Any) -> str:
    """A custom serialiser which allows displaying of Japanese text as utf-8 in logs."""
    try:
        return json.dumps(data, ensure_ascii=False)
    except TypeError:
        # e.g. if dumping a python set, convert to string them try again
        capture_message(f"Failed to serialize object to JSON, converting to string. data: {data}")
        print(f"Failed to serialize object to JSON, converting to string. data: {data}")
        return json.dumps(str(data), ensure_ascii=False)

@leandrodamascena
Copy link
Contributor

leandrodamascena commented Dec 8, 2023

Hey @bml1g12 this is a really interesting finding in our Logger utility.

I tested our Logger with this new parameter ensure_ascii=False and it seems to be ok, converting as expected. I can also view the logs (Unicode decoded/encoded) in CloudWatch and search them in CloudWatch Insights. But my point here is: is someone intentionally creating a Unicode-encoded log line to export it maybe? And if we change this behavior, can we break someone? I don't know.

I agree this is a bug because the default behavior in the Logging library is to convert the string and show the Japanese string, in this case, but I'm wondering if we should add this flag to our constructor, add a section in our documentation and let the clients decide what they want.

What do you think? @bml1g12 and @heitorlessa

@heitorlessa
Copy link
Contributor

I know it's late Friday, so...would you be available for a call on Monday to go through options @bml1g12? It'd be nice to put a face into a name too!!

We're having a chat now and will answer with a few options with pros and cons, so if you can't do a call on Monday we can decide async too -- since there is a workaround however imperfect we'd be better off not making any hotfix release at haste until all options are on the table (backward compatibility is a tenet).

email us if you are and we send a calendar link to choose the best slot: [email protected]

@heitorlessa
Copy link
Contributor

@leandrodamascena and I had the call and here's the options we could think of - our preferred is 1. We'd love to hear your thoughts and anyone else from the community too:

  1. Document, add a new flag and change default in v3. Document non-ascii chars are escaped by default, and how to use this new flag to not escape it along with an example.

    • Pros. Reduces boilerplate from customers like you and prepare new customers if they ever encounter this issue with a friendly solution.
    • Cons. It doesn't prevent a customer from hitting this issue, we can't expect everyone will read documentation, release notes, or IDE explanations.
  2. Document and change default in v3. Document as proposed in 1/, but don't introduce a new flag which will have to be deprecated in v3, and removed only in v4 (we value stability).

    • Pros. Keeps Logger constructor from having yet more dials like datetime, and saving customers from watching out for breaking change in v4.
    • Cons. Same as 1, except customers will have to bring additional boilerplate until v3 (~H1 2024).
  3. Change the default and catch exceptions. Update default serializer and catch TypeError and potential other edge cases we might find.

    • Pros. Prevents new customers to ever hitting this issue, no need to wait until v3.
    • Cons. We cannot be absolutely certain of the impact; we're talking billions of transactions a week. We'd need to educate ourselves better of the possibilities before this becoming an option.

@bml1g12
Copy link
Author

bml1g12 commented Dec 8, 2023

Just to note, probably needs no explanation, but I gave the example snippet as japanese_string = "\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12" but it's the same result if we provide japanese_string = "スコビルデモ2" which of course would be the more likely example (user's providing string naturally in their native language for logging).

is someone intentionally creating a Unicode-encoded log line to export it maybe?

But as you say, it's possible some users are already using the default behaviour intentionally to output Unicode encoded strings - I think it's unlikely, but it's definately possible.

It seems like a breaking change to me, so in my mind it would need to be 1 or 2. Thanks for the invite to discuss - we use this library a lot in our company (great work team!), so I'll discuss with my colleagues if they have any thoughts on best way to proceed on Monday and get back to you via this issue. Async is probably easiest for me.

I'm also little unclear why it is that:

japanese_set = {"スコビルデモ2"}
logger = Logger()
logger.info(japanese_set)

Does not throw an error when

json.dumps(japanese_set)

Does? Maybe this TypeError I spotted is unrelated to how ensure_ascii is set, but something in the way powertools library is handling things.

@heitorlessa
Copy link
Contributor

heitorlessa commented Dec 8, 2023 via email

@bml1g12
Copy link
Author

bml1g12 commented Dec 11, 2023

I discussed with my team today. We all agree (3) would be ideal, but we appreciate the risks involved of any potential bugs introduced, given how widespread the library is. So if you could somehow guarantee (3) won't introduce unexpected errors for users of the default setting then that would be the way to go, but otherwise, we think option (2) would be the best way to go. We hope this can become the default behaviour in V3, given that currently the library would behave unexpectedly for most users in Asia I think.

@heitorlessa
Copy link
Contributor

Great, thank you @bml1g12 for gathering everyone's consensus (say hello to team on our behalf!) -- We're running some additional tests and benchmark today for option 3.

Did additional research on weekend in CPython code, Loguru and Requests library, old CPython bug tracking, Python 2 vs Python 3 necessary changes, to make us more comfortable with optional 3 -- if our benchmark and additional tests work out nicely today, we'd be comfortable with the risk, provide a quick way to rollback if we must, and if we do need to rollback we'll make it the default in v3.

@heitorlessa
Copy link
Contributor

@bml1g12 Verdict -- we're going with option 3... changing the default. We did our due diligence as much as possible and consider this to be a long standing bug.

Main reasons but not limited to why we've decided to flip the default:

  1. ensure_ascii=False only affects non-ascii (~99% of the use cases atm).and no performance impact when toggling so. Python 2 and Python 3 would've been a different story (unicode obj vs str).
  2. There isn't any performance overhead in toggling this feature (hyperfine test). Largely due to doing less checks typically done for non-ascii that needs to be reverted and escaped.
  3. Log analytics tools don't offer auto-deserialization of encoded non-ascii characters, making search troublesome at best. This blocks customers from using Logger not only with CloudWatch Logs but also Datadog and others. This alone is a big reason to change it.
  4. TypeError shared earlier is related to a non-JSON serializable type which is already handled today via the default method.
  5. We've added an extra test to cover both all non-ascii characters (3.7+) and edge cases like field names being non-ascii.
  6. Our destination is standard output, not HTTP requests. It makes it less likely to break any deserialization as the consumer. It's quite the opposite, it creates an usability issue today (point 3). To be extra sure, we also checked that Loguru (popular logging framework) uses ensure_ascii=False as the default too

@github-project-automation github-project-automation bot moved this from Working on it to Coming soon in Powertools for AWS Lambda (Python) Dec 11, 2023
Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

@heitorlessa heitorlessa self-assigned this Dec 11, 2023
@heitorlessa
Copy link
Contributor

heitorlessa commented Dec 11, 2023

@leandrodamascena is triggering a hotfix release

@github-actions github-actions bot added the pending-release Fix or implementation already in dev waiting to be released label Dec 11, 2023
Copy link
Contributor

This is now released under 2.29.1 version!

@github-actions github-actions bot removed the pending-release Fix or implementation already in dev waiting to be released label Dec 11, 2023
@heitorlessa
Copy link
Contributor

@bml1g12 could you try the latest version and let us know, please? 2.29.1 on pip install and Lambda Layer version 53

  • arn:aws:lambda:{region}:017000801446:layer:AWSLambdaPowertoolsPythonV2:53
  • arn:aws:lambda:{region}:017000801446:layer:AWSLambdaPowertoolsPythonV2-Arm64:53

@heitorlessa heitorlessa moved this from Coming soon to Shipped in Powertools for AWS Lambda (Python) Dec 12, 2023
@bml1g12
Copy link
Author

bml1g12 commented Dec 12, 2023

Great stuff, thanks very much for the rapid fix! This week is a busy week for me so might take me a little while to get around to integrating this to some repos with a real trial by fire, but I did a quick local pip install and kicked the wheels a bit, and all seems healthy to me!

@heitorlessa
Copy link
Contributor

heitorlessa commented Dec 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logger
Projects
Status: Shipped
3 participants