Skip to content

Cargo packages duplicate files on case-insensitive file systems #13722

@kornelski

Description

@kornelski
Contributor

Problem

It seems that Cargo is excluding an already-packaged files using exact name comparison, which doesn't always match how the file system sees name equality.

   Archiving Cargo.lock
   Archiving Cargo.toml
   Archiving Cargo.toml.orig
   Archiving README.md
   Archiving readme.Md
   Archiving src/main.rs

Example crate:

https://docs.rs/crate/rosu/0.6.0/source/

Steps

[package]

readme = "README.md"
echo case > readme.Md
cargo package

The same applies to license-file, cargo.lock.

Possible Solution(s)

Theoretically there could be other gotchas of this kind, e.g. HFS+ file system on macOS forces file names to use NFD Unicode form, while most text has NFC form, which makes codepoint-by-codepoint comparisons not equal. However HFS+ is on its way out, so perhaps a simple case-insensitive comparison will suffice.

Notes

No response

Version

cargo 1.79.0-nightly (499a61ce7 2024-03-26)

Activity

added
C-bugCategory: bug
S-triageStatus: This issue is waiting on initial triage.
on Apr 7, 2024
changed the title [-]Cargo packages duplicate README on case-insensitive file systems[/-] [+]Cargo packages duplicate files on case-insensitive file systems[/+] on Apr 7, 2024
kornelski

kornelski commented on Apr 7, 2024

@kornelski
ContributorAuthor

In the same vein, if there's a TARGET/ directory, it doesn't get excluded when packaging.

Caused by:
  Source directory was modified by build.rs during cargo publish. Build scripts should not modify anything outside of OUT_DIR.
  Added: /private/tmp/bla/target/package/testx-0.0.0/TARGET/.rustc_info.json
  	/private/tmp/bla/target/package/testx-0.0.0/TARGET/debug
  	/private/tmp/bla/target/package/testx-0.0.0/TARGET/debug/.cargo-lock
  	/private/tmp/bla/target/package/testx-0.0.0/TARGET/debug/.fingerprint
heisen-li

heisen-li commented on Apr 9, 2024

@heisen-li
Contributor

@rustbot label Command-package

VorpalBlade

VorpalBlade commented on Apr 23, 2024

@VorpalBlade

so perhaps a simple case-insensitive comparison will suffice

While that sounds lovely, in what locale? For the languages I speak it is relatively straight forward, but my understanding is that case handling is lossy in some languages, such as German (ẞ is Ss in upper case I think?) and Turkish (I believe they have the letter "i" both with and without a dot, and the uppper/lower case there isn't straight forward, but don't ask me how exactly).

As a Swedish/English speaker this is all hearsay though, and I don't know how e.g. Windows or Mac OS handle these, though I think I heard that NTFS store a case normalisation table at file system creation time based on the locale set at that point?

ChrisDenton

ChrisDenton commented on Apr 23, 2024

@ChrisDenton
Member

On Windows, the NTFS up case table is initialized when the drive is first formatted. So it'll depend on the Windows version that did that. It is however language neutral and only acts on the Basic Multilingual Plane.

Also, depending on the configuration, NTFS can be case sensitive. In Windows this can even be set differently for each directory.

VorpalBlade

VorpalBlade commented on Apr 23, 2024

@VorpalBlade

On Windows, the NTFS up case table is initialized when the drive is first formatted. So it'll depend on the Windows version that did that. It is however language neutral and only acts on the Basic Multilingual Plane.

Hm, maybe I'm thinking of FAT and Windows 9x then? Pretty sure things differed depending on code pages and such there. Not sure how modern OSes interacting with FAT32/exFAT works with that. Hopefully it is somewhat sane on any Windows version Rust still supports.

ChrisDenton

ChrisDenton commented on Apr 23, 2024

@ChrisDenton
Member

Ah yes FAT32 is indeed a mess. But then I'm also not sure how well Cargo and rustc support it as it lacks a lot of filesystem features that may be expected. Probably it does at least work if it's only read from (e.g. the target directory is on another drive).

kornelski

kornelski commented on Apr 24, 2024

@kornelski
ContributorAuthor

While that sounds lovely, in what locale?

It is a messy problem, but fortunately the detection algorithm doesn't need to produce user-facing text, so it doesn't need to be perfect from linguistic perspective. It only needs to detect potential collisions between file names. Crates that work with only a specific combination of Windows locale and NTFS vintage are not generally useful, so the detection can also err on the side of over-normalizing (e.g. normalize all dotless ı's to i, forbid all control characters, check against both lower and upper case, treat codepoints with multiple transliterations/decompositions as a wildcard, etc.).

but for a start, even a simple .lowercase() will handle more than enough for the accidental variations of Readme.Md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @epage@kornelski@VorpalBlade@ChrisDenton@heisen-li

      Issue actions

        Cargo packages duplicate files on case-insensitive file systems · Issue #13722 · rust-lang/cargo