Skip to content

Fix ZipArchiveEntry names shown as corrupted on other zip programs #65886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 3, 2022

Conversation

jozkee
Copy link
Member

@jozkee jozkee commented Feb 25, 2022

Fixes #65750
Due to changes in #59442, _generalPurposeBitFlag didn't have the proper value when it was written here:

A test may or may not be required as .NET is resilient to the error reported, I think is because it decodes with UTF8 regardless. cc @GrabYourPitchforks.

@ghost
Copy link

ghost commented Feb 25, 2022

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #65750
Due to changes in #59442, _generalPurposeBitFlag didn't have the proper value when it was written here:

A test may or may not be required as .NET is resilient to the error reported, I think is because it decodes with UTF8 regardless. cc @GrabYourPitchforks.

Author: Jozkee
Assignees: -
Labels:

area-System.IO.Compression

Milestone: -

_storedEntryName = value;

if (isUTF8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understand. When Utf8 is detected, we enable the Unicode flag?

Copy link
Contributor

@carlossanlop carlossanlop Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Maybe the name of the flag should be utf8FileNameAndComment to be very clear.

Per the spec: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

 Bit 11: Language encoding flag (EFS).  If this bit is set,
                the filename and comment fields for this file
                MUST be encoded using UTF-8. (see APPENDIX D)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Maybe the name of the flag should be utf8FileNameAndComment to be very clear.

That would definitely help ;) thanks for the pointer to the docs!

Comment on lines +225 to +231
{
_generalPurposeBitFlag |= BitFlagValues.UnicodeFileNameAndComment;
}
else
{
_generalPurposeBitFlag &= ~BitFlagValues.UnicodeFileNameAndComment;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if the comment has utf8 characters, but the filename does not, now you're removing the utf8 flag.

My intention was to detect utf8 characters in either filename or comment. If one of the two had them, then the general purpose bit flag bit 11 would be turned on. Hence the |=.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably look like the code you added in the Comment setter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that the FullName setter is private, because the name is only set once, in the constructor. The user cannot change it. They have to create a new ZipArchiveEntry instance if they want to change the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm now that I think of it more, maybe it makes sense to set it to 0 or 1 here, since as I said, this is called in the constructor only, which means Comment has not modified this bit yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding/removing the flag was being performed before your changes so this is really going back to update _generalPurposeBitFlag in the ctor before it spreads.


if (isUTF8)
{
_generalPurposeBitFlag |= BitFlagValues.UnicodeFileNameAndComment;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, since you're avoiding removing the bit flag unexpectedly if filename already set it to true.

Comment on lines -508 to -512
if (_hasUnicodeEntryNameOrComment)
_generalPurposeBitFlag |= BitFlagValues.UnicodeFileNameAndComment;
else
_generalPurposeBitFlag &= ~BitFlagValues.UnicodeFileNameAndComment;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with removing this if we are directly modifying the general purpose bit flag when detecting the comment or filename encoding.

Copy link
Contributor

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense. Just one request before approving it: Can you generate a few zip files, and then open them with 7-zip or winzip to verify they are correct? (both apps show entry comments IIRC, I don't remember if the Windows File Explorer does).

Create two zips where you pass the UTF8 encoding to the constructor, then:

  • One zip file where an entry has utf8 characters in the name, but none in the comments.
  • Another zip file where the comment has utf8 characters in the name, but none in the filename.

Create two zips like the ones above, but instead of UTF8, you pass Latin1 encoding to the constructor.

@jozkee
Copy link
Member Author

jozkee commented Feb 25, 2022

@carlossanlop I tried that locally and verified filename and comment were shown correct on 7zip, winzip doesn't show comments for UTF8.
Latin1 encoding is a subset of UTF8, except values 128 to 159 are unused, Unicode filenames/comments show garbled and other filenames/comments show correct.

FWIW this is the code I used:

  [Theory]
  [InlineData("folder-with-unicode-filenames", true)]
  [InlineData("folder-with-ascii-filenames", false)]
  public static void test(string test, bool unicodeFileName)
  {
      foreach (var encoding in new[] { Encoding.UTF8, Encoding.Latin1 })
      {
          var pwd = @"C:\Users\david\Desktop\zip-file-main\" + test;
          using var stream = File.Create($@"C:\Users\david\Desktop\zip-file-main\out22-{test}-{(unicodeFileName ? "unicodeFileName" : "unicodeComment")}-{encoding.EncodingName}.zip");
          var directory = new DirectoryInfo(pwd);

          using (var zip = new ZipArchive(stream, ZipArchiveMode.Create, true, encoding))
          {
              foreach (FileInfo file in directory.EnumerateFiles("*", SearchOption.AllDirectories))
              {
                  string relativePath = file.FullName.Substring(pwd.Length + 1); // +1 prevents it from including the leading backslash
                  string zipEntryName = relativePath.Replace('\\', '/'); // Normalize slashes

                  ZipArchiveEntry entry = zip.CreateEntryFromFile(file.FullName, zipEntryName);
                  entry.Comment = unicodeFileName ? "Comment" : "你好";// If Filename is unicode, use an ascii comment and viceversa.
              }
          }
      }
  }

@jozkee jozkee requested a review from carlossanlop February 26, 2022 06:54
@JamesNK
Copy link
Member

JamesNK commented Mar 1, 2022

This fix is blocking dotnet/aspnetcore from updating to the latest SDK. It would be good to make some progress towards getting it merged to unblock us.

@carlossanlop
Copy link
Contributor

carlossanlop commented Mar 3, 2022

There were many unrelated failures in (Libraries Test Run release coreclr windows x86 Release) with the message below, which prevented the System.IO.Compression and System.IO.Compression.ZipFile tests to run in that CI leg. I'm re-running them. Will merge them as soon as they finish.

Edit: They failed again, but other CI legs passed. I'll open an issue to investigate the CI failures and will merge this.

Console log: 'Common.Tests' from job 034a7bfb-e0b8-4d76-9e4a-2c85d35b89c7 workitem c44acf02-e12c-46ca-b51f-dabec51eb076 (windows.7.amd64.open.rt) executed on machine a00AC0P

C:\h\w\B41D09B0\w\BD8A0A4C\e>taskkill.exe /f /im corerun.exe 
ERROR: The process "corerun.exe" not found.

C:\h\w\B41D09B0\w\BD8A0A4C\e>call RunTests.cmd --runtime-path C:\h\w\B41D09B0\p 
----- start Thu 03/03/2022  1:11:23.18 ===============  To repro directly: ===================================================== 
pushd C:\h\w\B41D09B0\w\BD8A0A4C\e\
"C:\h\w\B41D09B0\p\dotnet.exe" exec --runtimeconfig Common.Tests.runtimeconfig.json --depsfile Common.Tests.deps.json xunit.console.dll Common.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================

C:\h\w\B41D09B0\w\BD8A0A4C\e>"C:\h\w\B41D09B0\p\dotnet.exe" exec --runtimeconfig Common.Tests.runtimeconfig.json --depsfile Common.Tests.deps.json xunit.console.dll Common.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: Common.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  Common.Tests (found 257 of 265 test cases)
  Starting:    Common.Tests (parallel test collections = on, max threads = 2)
----- end Thu 03/03/2022  1:11:25.52 ----- exit code -1073741819 ----------------------------------------------------------
2022-03-03T01:11:27.277Z	INFO   	run.py	run(48)	main	Beginning reading of test results.
2022-03-03T01:11:27.277Z	INFO   	run.py	__init__(48)	get_log_files	Searching 'C:\h\w\B41D09B0\w\BD8A0A4C\e\..' for log files
Found log 'C:\h\w\B41D09B0\w\BD8A0A4C\e\..\console.9795886e.log'
Uri 'https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-65886-merge-034a7bfbe0b84d769e/Common.Tests/console.9795886e.log?sv=2019-07-07&se=2022-03-23T01%3A11%3A07Z&sr=c&sp=rl&sig=eIE%2Fj1BffznzHobS6c5%2BWwmJ9GA5puhG9SCiRb21MtY%3D'
2022-03-03T01:11:27.277Z	INFO   	run.py	__init__(66)	construct_log_list	Generated log list: console.9795886e.log:
  https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-65886-merge-034a7bfbe0b84d769e/Common.Tests/console.9795886e.log?sv=2019-07-07&se=2022-03-23T01%3A11%3A07Z&sr=c&sp=rl&sig=eIE%2Fj1BffznzHobS6c5%2BWwmJ9GA5puhG9SCiRb21MtY%3D

2022-03-03T01:11:27.277Z	INFO   	run.py	__init__(90)	read_results	Searching 'C:\h\w\B41D09B0\w\BD8A0A4C\e' for test results files
2022-03-03T01:11:27.277Z	INFO   	run.py	__init__(90)	read_results	Searching 'C:\h\w\B41D09B0\w\BD8A0A4C\uploads' for test results files
2022-03-03T01:11:27.277Z	WARNING	run.py	__init__(103)	read_results	No results file found in any of the following formats: xunit, junit, trx
2022-03-03T01:11:27.277Z	INFO   	run.py	packing_test_reporter(30)	report_results	Packing 0 test reports to 'C:\h\w\B41D09B0\w\BD8A0A4C\e\__test_report.json'
2022-03-03T01:11:27.277Z	INFO   	run.py	packing_test_reporter(33)	report_results	Packed 1718 bytes
read file: C:\h\w\B41D09B0\p\debug-dump-template.md
writing output file: C:\h\w\B41D09B0\w\BD8A0A4C\uploads\how-to-debug-dump.md
done writing debug dump information

@carlossanlop carlossanlop merged commit 83adfae into dotnet:main Mar 3, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 2, 2022
@jozkee jozkee deleted the zipentry_name_corrupt branch May 3, 2022 21:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZipArchive corrupts files with unicode characters
4 participants