Skip to content

[API Proposal]: TarReader should not assume same format for all entries in a tar file #69544

@carlossanlop

Description

@carlossanlop

Background and motivation

The user @Bio2hazard was kind enough to report this bug when testing the new System.Formats.Tar APIs published in .NET 7 Preview4 with a tar file he generated with their own work tools. He shared the file here.

The tar file contains 4 entries:

  • 2 consecutive RegularFile ustar entries.
  • Then a gnu metadata entry of type LongLink.
  • Then the RegularFile entry with the actual data that the previous LongLink metadata entry describes.

I confirmed this by inspecting the archive with a Hex Editor.

The archive also shows that all 4 entries have magic and a version metadata fields following the ustar rules. Even the 3rd entry, which is clearly not supported by ustar because the entry type is L. In the gnu format, the magic and version fields are slightly different to those from POSIX formats (ustar and pax). I describe those differences here.

The current behavior of our TarReader is to throw FormatException when initially reading an entry of a particular format, and then encountering an subsequent entry in a different format, or the entry type is unsupported by the initially-assumed format of the whole archive.

Surprisingly, if this same file is opened and traversed with SharpCompress or 7-zip, they can traverse all the entries without problems. This means that:

  • They do not assume the whole archive is in the same format, and allow intermixing entries of different formats.
  • They do not mind if the entry has a magic and a version that belongs to a particular format, but the entry type is for another format.

I am opening this proposal to discuss the possibility of becoming as flexible as SharpCompress and 7-zip, at least when it comes to the TarReader.

The TarWriter should keep its current behavior: The user should specify the initial format in the constructor, and if an unsupported entry is inserted, it should be converted to that format if possible, or throw an exception if the file is unsupported. This behavior aligns with the Unix tar tool, which fails if, for example, the user tries to add a block device entry, or a long path entry, when creating a v7 or ustar archive (these two formats do not support those types of files).

@bartonjs @stephentoub @eerhardt @adamsitnik @jozkee @jeffhandley @baronfel

API Proposal

Remove the Format property from TarReader to stop assuming all entries are in the same format:

Remove:

namespace System.Formats.Tar;

public class TarReader : IDisposable
{
-      public TarFormat Format { get; }
}

Add:

public class TarEntry
{
+    public TarFormat Format { get; }
}

API Usage

using FileStream fs = File.OpenRead("archive.tar"); // Archive with intermixed entries
using TarReader reader = new(fs);
TarEntry entry = reader.GetNextEntry();

// The entry format can be detected by using the Format property
switch (entry.Format)
{
    case TarFormat.V7:
        //...
        break;
    case TarFormat.Ustar:
        //...
        break;
    case TarFormat.Pax:
        //...
        break;
    case TarFormat.Gnu:
        //...
        break;
    case TarFormat.Unknown:
        //...
        break;
}

Alternative Designs

One alternative is to just not remove anything, and if we encounter an entry of a different format than the first one, we just switch the Format property to Unknown, but keep returning entries to the user when they call GetNextEntry. We would have to document this very clearly.

Risks

Low.

  • We are in still in preview, so we have time for adjustments.
  • We want to be as helpful as SharpCompress and 7-zip, especially if they are able to handle these cases and we don't.
  • The spec does not explicitly indicate that all entries are expected to be in the same format, but it does specify clear rules about the metadata differences between formats, which should help ensure we properly detect the format of each entry individually and independently from the others.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions