std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O #16618

Prcuvu · 2023-07-30T06:42:11Z

These commits aim to eliminate text encoding problems regarding Windows native console I/O.

To-do list:

UTF-8 → UTF-16 conversion before writing to console with WriteConsoleW() (resolves Zig runtime should set console output mode to utf-8 by default on Windows #7600, closes std.start: initialize Windows console output CP to UTF-8 on exe startup #14411)
UTF-16 → UTF-8 conversion after reading from console with ReadConsoleW() (resolves Cannot read unicode data from STDIN on Windows #5148)

squeek502 · 2023-07-30T19:49:13Z

Thanks for working on this! At some point I'd like to benchmark reading/writing to console against master to see what the performance cost of this approach is.

Prior attempt at the same thing: #12400

Prcuvu · 2023-08-02T01:05:22Z

Update: I also tried UTF-16 to UTF-8 conversion for raw mode, but I encountered ferocious edge cases, especially when I read the console input partially (leaving UTF-16 code units in the console input text buffer) and then switching on/off ENABLE_LINE_INPUT with SetConsoleMode(). The Windows console API just refused to work together with my buffers, so I had to delete the whole bunch of code for that part.

It should be perfectly fine as long as people don't mess with console input mode. I would be optimistic to assume that it is effectively done.

…console I/O

notcancername · 2023-08-05T16:06:00Z

Is the hidden allocation here fine? Wouldn't this be better solved with a stack-allocated buffer and multiple writes?
Also, in my personal opinion, the old broken behavior should be used if OOM occurs.

Prcuvu · 2023-08-05T22:07:01Z

Is the hidden allocation here fine? Wouldn't this be better solved with a stack-allocated buffer and multiple writes?

I did that just to keep the logic simple.
It seems that Microsoft UCRT also takes the heap allocation approach, but you've got a good point. I will try rewriting it using a fixed-size buffer.

…g fixed-size temporary buffer

Prcuvu · 2023-08-06T02:21:13Z

Done. The temporary buffer is now 1024-byte long, and no heap allocation involved.

Prcuvu · 2023-10-04T10:05:14Z

@squeek502 Just rebased onto master head and passed all checks. Are you free to review?

squeek502 · 2023-10-04T20:27:00Z

Will try writing some benchmarks / test cases when I get a chance.

In the meantime, it might be helpful if you could take a look at #12400 and compare your implementation with its (relatively simple) implementation. What are the relevant differences? Are there things that your implementation handles that wouldn't be handled by the one in #12400?

Prcuvu · 2023-10-06T15:59:09Z

What are the relevant differences? Are there things that your implementation handles that wouldn't be handled by the one in #12400?

As far as I can tell after a quick glance at #12400:

std: Windows Console IO Fix #12400 adds platform-specific logic and data (a FIFO buffer for leftover UTF-8 bytes and a u16 buffer for leftover UTF-16 high surrogate) in std.fs.File while my implementation keeps everything inside std.os.windows;
std: Windows Console IO Fix #12400 adds a new getClampedUtf8SizeForUtf16LeSize function in std.unicode while my implementation produces a new InvalidUtf8 error that spreads into File.UpdateDeclError in src/link.zig;
My implementation handles LF ↔ CRLF conversion while std: Windows Console IO Fix #12400 does not;
My implementation handles Ctrl-Z while std: Windows Console IO Fix #12400 does not.

andrewrk

This is not good. Let's take a step back and understand what is going on before adding such hacks to the codebase.

andrewrk · 2023-10-06T19:28:37Z

lib/std/os/windows.zig

@@ -523,19 +527,186 @@ pub fn ReadFile(in_hFile: HANDLE, buffer: []u8, offset: ?u64, io_mode: std.io.Mo
                };
                break :blk &overlapped_data;
            } else null;
-            if (kernel32.ReadFile(in_hFile, buffer.ptr, want_read_count, &amt_read, overlapped) == 0) {
+            var console_mode: DWORD = undefined;
+            const is_console_handle: bool = kernel32.GetConsoleMode(in_hFile, &console_mode) != FALSE;


Really? You want to call GetConsoleMode with every call to ReadFile?

Some context:

In std: Windows Console IO Fix #12400 it used isTty which uses os.isatty which calls GetConsoleMode (so it'd call it on every read/write)

Rust uses an is_console function which is a wrapper around GetConsoleMode (so it calls it on every read/write AFAICT)

However, it seems Go stores the kind as "console" in newFile to skip the GetConsoleMode on read/write:

https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/os/file_windows.go#L49-L51

https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/internal/poll/fd_windows.go#L419-L420

But still seems to have a GetConsoleMode call in a (lower level?) API? I'm not familiar with Go.

https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/runtime/os_windows.go#L637-L645

Really? You want to call GetConsoleMode with every call to ReadFile?

No, I don't.

However, std.fs.File is merely a thin wrapper around the underlying OS file handle. If we want to remember whether the OS handle is a console handle on std.fs.File creation, we will need to add OS-specific fields in std.fs.File data structure, or add additional records in std.os.windows along with a lookup scheme by a std.fs.File handle (either pointer or a unique ID), just like how C runtime libraries deal with POSIX file descriptors internally.

For minimum intrusive changes with platform-agnostic code (i.e. std.fs.File), I chose calling GetConsoleMode with every ReadFile call. I would like to hear your thoughts on design choices before making any change.

andrewrk · 2023-10-06T19:29:02Z

lib/std/os/windows.zig

+            // There is no reliable way to implement perfectly platform-agnostic UTF-16 to UTF-8
+            // conversion for raw mode, because it is impossible to know the number of pending
+            // code units stored in console input buffer, while in cooked mode we can rely on the
+            // terminating LF character. Without knowing that, ReadConsoleW() may accidentally pop
+            // out characters without blocking, or prompt for user input at unexpected timing.
+            // In the case of raw mode, redirect to kernel32.ReadFile() without conversion for now,
+            // just don't make things worse.


This is not good enough. If it cannot be implemented perfectly using the current abstractions, then the current abstractions need to change.

Let me elaborate on this:

The current goal of my implementation is to make users totally unaware of UTF-16 presence even if they call std.os.windows.ReadFile directly. Users are able to pass UTF-8-encoded strings everywhere, and std.os.windows.ReadFile does the translation job under the hood. It means that I need to insert a UTF-16 → UTF-8 translation layer between std.os.windows.ReadFile and kernel32.ReadConsoleW.

Here comes the key problem that gives rise to annoying edge cases: The user-provided buffer is meant to be UTF-8 encoded. Console input (which is UTF-16 encoded) must be converted to UTF-8 before filling into the user-provided buffer, while the converted UTF-8 length can't be easily determined. When the user-provided buffer is not large enough to hold the converted console input, there are three pieces of data we have to deal with carefully:

UTF-8 fragments. For example, a Unicode character from console input converts into a 3-byte sequence, but the user-provided buffer is only 2 bytes long. Then ReadFile should store the first 2 bytes into the user-provided buffer and stash the final byte. Whenever ReadFile is called again, this byte should be popped first before reading anything else from the console.

Single UTF-16 high surrogate yet to form a complete Unicode code point. We should stash it separately and wait for a low surrogate from another ReadConsoleW call to convert into UTF-8 byte sequence.

UTF-16 code units left in ReadConsoleW's dedicated buffer. It is completely opaque to user-mode code, causing serious troubles, which will be discussed below.

There are two modes affecting ReadConsoleW behaviour determined by ENABLE_LINE_INPUT flag (accessed with GetConsoleMode/SetConsoleMode functions):

Cooked mode (ENABLE_LINE_INPUT set): If CR is encountered in the dedicated buffer, an LF is appended to the user-provided buffer and reading from console ends immediately. Otherwise, wait for console input until a CR is entered (either single key press or pasting multiple characters). Everything after the CR is left inside the dedicated buffer after the read.

Raw mode (ENABLE_LINE_INPUT unset): If there is at least one character available in the dedicated buffer, reading from console ends immediately. Otherwise, wait for console input (either single key press or pasting multiple characters). The dedicated buffer is flushed after the read.

Cooked mode can be emulated perfectly, as in my implementation. However, to emulate the behaviour in raw mode, we have to try flushing UTF-8 fragments and then call ReadConsoleW to flush the dedicated buffer. The problem arises when we popped at least one byte from the UTF-8 fragments. Because we have "read" something (from UTF-8 fragments), std.os.windows.ReadFile should return immediately if the ReadConsoleW dedicated buffer is empty, but calling ReadConsoleW when the dedicated buffer is empty causes an unwanted wait for console input. As I explained before, there is no way to know whether the dedicated buffer is empty other than calling ReadConsoleW, so it is a stalemate.

If there are better abstractions, please let me know.

Relevant commit message from Go: golang/go@610d522

The readConsole implementation has changed a bit since then, the latest is here: https://github.com/golang/go/blob/6e8caefc19cae465444775f6cd107b138a26cce7/src/internal/poll/fd_windows.go#L454-L521

There does not seem to be any special handling of ENABLE_LINE_INPUT in Rust or Go at all.

squeek502 · 2023-10-07T22:11:09Z

For what it's worth, here are my initial reactions to these points:

std: Windows Console IO Fix #12400 adds platform-specific logic and data (a FIFO buffer for leftover UTF-8 bytes and a u16 buffer for leftover UTF-16 high surrogate) in std.fs.File while my implementation keeps everything inside std.os.windows;

This is a neutral point IMO--not a pro nor a con.

std: Windows Console IO Fix #12400 adds a new getClampedUtf8SizeForUtf16LeSize function in std.unicode while my implementation produces a new InvalidUtf8 error that spreads into File.UpdateDeclError in src/link.zig;

This makes sense to me and seems like a good change. #12400 returned error.Unexpected when failing UTF-8/UTF-16 conversion which doesn't seem like a good idea.

My implementation handles LF ↔ CRLF conversion while std: Windows Console IO Fix #12400 does not;

This needs some justification. Why should Zig care about LF ↔ CRLF conversion in console input/output? It doesn't seem like Rust/Go do this AFAICT.

My implementation handles Ctrl-Z while std: Windows Console IO Fix #12400 does not.

👍

Prcuvu · 2023-10-08T14:47:13Z

Why should Zig care about LF ↔ CRLF conversion in console input/output? It doesn't seem like Rust/Go do this AFAICT.

For console output, I agree that converting LF to CRLF will not bring special benefits, as Windows Console Host should recognize single LF. This conversion makes more sense when writing to a text file. I implemented this way just because C stdio does the same thing for all text-mode file handles.

But for console input, if we convert only the text encoding after ReadConsoleW, we will get something like "hello, world\r\n" instead of "hello, world\n". With this conversion, users only need to take care of a single form of new-line character. C stdio does the same thing.

squeek502 · 2023-10-08T22:36:59Z

Matching libc behavior is not a goal here. In my opinion the goals should be:

Do the absolute minimum possible to get Windows console IO to work with Unicode input/output (plus compatibility for things like Ctrl+Z)
Have the smallest impact on performance of read/write as possible (both for console and non-console IO)

… I/O

Prcuvu · 2023-10-09T01:50:13Z

All right, I will remove LF → CRLF conversion for console output. But I still believe it is a good idea to keep CRLF → LF conversion for console input:

But for console input, if we convert only the text encoding after ReadConsoleW, we will get something like "hello, world\r\n" instead of "hello, world\n". With this conversion, users only need to take care of a single form of new-line character.

What do you think of this?

squeek502 · 2023-10-09T02:37:14Z

What do you think of this?

It seems like handling this via conversion within read would be more costly and error prone than the user handling CRLF line endings at the call site (which they'll already need to do for file IO generally).

Prcuvu · 2023-10-09T03:06:15Z

All right, I will remove LF ↔ CRLF conversion completely and leave the job to users, considering that there are no abstractions of "text mode" and "binary mode" for Zig's std.fs.File.

andrewrk · 2024-05-09T23:09:10Z

I'm sorry, I didn't review this in time, and now it has bitrotted. Furthermore, so many pull requests have stacked up that I can't keep up and I am therefore declaring Pull Request Bankruptcy and closing old PRs that now have conflicts with master branch.

If you want to reroll, you are by all means welcome to revisit this changeset with respect to the current state of master branch, and there's a decent chance your patch will be reviewed the second time around.

Either way, I'm closing this now, otherwise the PR queue will continue to grow indefinitely.

Prcuvu · 2024-05-10T00:38:07Z

All right. I have been busy recently, hopefully I will come back to this PR and rebase all the changes in a month or two.

Prcuvu marked this pull request as draft July 30, 2023 16:13

Prcuvu changed the title ~~std: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O~~ std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O Jul 30, 2023

Prcuvu force-pushed the windows-console-utf16-conversion branch from 911c601 to 1a88bec Compare July 30, 2023 18:07

Prcuvu force-pushed the windows-console-utf16-conversion branch from 1a88bec to 4ff8419 Compare August 2, 2023 00:43

Prcuvu marked this pull request as ready for review August 2, 2023 00:47

Prcuvu force-pushed the windows-console-utf16-conversion branch 3 times, most recently from 6f18d6d to d5e15c5 Compare August 3, 2023 02:09

std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native …

a7e2ef6

…console I/O

Prcuvu force-pushed the windows-console-utf16-conversion branch 2 times, most recently from 2043ba0 to fffbb92 Compare August 4, 2023 06:05

std.os.windows: Rewrite ReadConsoleWithUtf16ToUtf8Conversion() usin…

9495568

…g fixed-size temporary buffer

Prcuvu force-pushed the windows-console-utf16-conversion branch from 8208db5 to d4fe650 Compare August 6, 2023 03:24

Prcuvu force-pushed the windows-console-utf16-conversion branch from d4fe650 to 9495568 Compare October 3, 2023 21:37

andrewrk requested changes Oct 6, 2023

View reviewed changes

std.os.windows: Remove LF ↔ CRLF conversion in Windows native console…

045c169

… I/O

andrewrk closed this May 9, 2024

Uh oh!

std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O #16618

std.os.windows: Support UTF-8 ↔ UTF-16 conversion for Windows native console I/O #16618

Uh oh!

Conversation

Prcuvu commented Jul 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squeek502 commented Jul 30, 2023

Uh oh!

Prcuvu commented Aug 2, 2023

Uh oh!

notcancername commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Prcuvu commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Prcuvu commented Aug 6, 2023

Uh oh!

Prcuvu commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squeek502 commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Prcuvu commented Oct 6, 2023

Uh oh!

andrewrk left a comment

Choose a reason for hiding this comment

Uh oh!

andrewrk Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

squeek502 Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

Prcuvu Oct 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewrk Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

Prcuvu Oct 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squeek502 Oct 7, 2023

Choose a reason for hiding this comment

Uh oh!

squeek502 commented Oct 7, 2023

Uh oh!

Prcuvu commented Oct 8, 2023

Uh oh!

squeek502 commented Oct 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Prcuvu commented Oct 9, 2023

Uh oh!

squeek502 commented Oct 9, 2023

Uh oh!

Prcuvu commented Oct 9, 2023

Uh oh!

andrewrk commented May 9, 2024

Uh oh!

Prcuvu commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Prcuvu commented Jul 30, 2023 •

edited

Loading

notcancername commented Aug 5, 2023 •

edited

Loading

Prcuvu commented Aug 5, 2023 •

edited

Loading

Prcuvu commented Oct 4, 2023 •

edited

Loading

squeek502 commented Oct 4, 2023 •

edited

Loading

Prcuvu Oct 7, 2023 •

edited

Loading

Prcuvu Oct 7, 2023 •

edited

Loading

squeek502 commented Oct 8, 2023 •

edited

Loading

Prcuvu commented May 10, 2024 •

edited

Loading