-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Background and motivation
The user @Bio2hazard was kind enough to report this bug when testing the new System.Formats.Tar
APIs published in .NET 7 Preview4 with a tar file he generated with their own work tools. He shared the file here.
The tar file contains 4 entries:
- 2 consecutive
RegularFile
ustar
entries. - Then a
gnu
metadata entry of typeLongLink
. - Then the
RegularFile
entry with the actual data that the previousLongLink
metadata entry describes.
I confirmed this by inspecting the archive with a Hex Editor.
The archive also shows that all 4 entries have magic
and a version
metadata fields following the ustar
rules. Even the 3rd entry, which is clearly not supported by ustar
because the entry type is L
. In the gnu
format, the magic
and version
fields are slightly different to those from POSIX formats (ustar
and pax
). I describe those differences here.
The current behavior of our TarReader
is to throw FormatException
when initially reading an entry of a particular format, and then encountering an subsequent entry in a different format, or the entry type is unsupported by the initially-assumed format of the whole archive.
Surprisingly, if this same file is opened and traversed with SharpCompress or 7-zip, they can traverse all the entries without problems. This means that:
- They do not assume the whole archive is in the same format, and allow intermixing entries of different formats.
- They do not mind if the entry has a magic and a version that belongs to a particular format, but the entry type is for another format.
I am opening this proposal to discuss the possibility of becoming as flexible as SharpCompress and 7-zip, at least when it comes to the TarReader
.
The TarWriter
should keep its current behavior: The user should specify the initial format in the constructor, and if an unsupported entry is inserted, it should be converted to that format if possible, or throw an exception if the file is unsupported. This behavior aligns with the Unix tar
tool, which fails if, for example, the user tries to add a block device entry, or a long path entry, when creating a v7
or ustar
archive (these two formats do not support those types of files).
@bartonjs @stephentoub @eerhardt @adamsitnik @jozkee @jeffhandley @baronfel
API Proposal
Remove the Format
property from TarReader
to stop assuming all entries are in the same format:
Remove:
namespace System.Formats.Tar;
public class TarReader : IDisposable
{
- public TarFormat Format { get; }
}
Add:
public class TarEntry
{
+ public TarFormat Format { get; }
}
API Usage
using FileStream fs = File.OpenRead("archive.tar"); // Archive with intermixed entries
using TarReader reader = new(fs);
TarEntry entry = reader.GetNextEntry();
// The entry format can be detected by using the Format property
switch (entry.Format)
{
case TarFormat.V7:
//...
break;
case TarFormat.Ustar:
//...
break;
case TarFormat.Pax:
//...
break;
case TarFormat.Gnu:
//...
break;
case TarFormat.Unknown:
//...
break;
}
Alternative Designs
One alternative is to just not remove anything, and if we encounter an entry of a different format than the first one, we just switch the Format
property to Unknown
, but keep returning entries to the user when they call GetNextEntry
. We would have to document this very clearly.
Risks
Low.
- We are in still in preview, so we have time for adjustments.
- We want to be as helpful as SharpCompress and 7-zip, especially if they are able to handle these cases and we don't.
- The spec does not explicitly indicate that all entries are expected to be in the same format, but it does specify clear rules about the metadata differences between formats, which should help ensure we properly detect the format of each entry individually and independently from the others.