Skip to content

gix free pack verify --statistics uses ambiguous "KB" for SI kilobyte #1947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
EliahKagan opened this issue Apr 10, 2025 · 7 comments
Closed
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed

Comments

@EliahKagan
Copy link
Member

EliahKagan commented Apr 10, 2025

Current behavior 😯

As attested by the current journey test snapshots, gix free pack verify --statistics outputs compression-related sizes in units of "KB":

compression
compressed entries size : 51.8 KB
decompressed entries size : 103.7 KB
total object size : 288.7 KB
pack size : 51.9 KB

But it is not immediately clear what unit that actually is. Is it…

  • …an SI decimal kilobyte, equal to 1000 bytes? (kB)
  • …an IEC binary kibibyte, equal to 1024 bytes? (KiB)

It turns out that, in this case, 1 KB = 1 kB, but it is not obvious.

Expected behavior 🤔

It is also not obvious what unit is intended. gitoxide-core uses the bytesize library to display the units:

writeln!(out, "\ncompression")?;
#[rustfmt::skip]
writeln!(
out, "\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}",
"compressed entries size", ByteSize(stats.total_compressed_entries_size),
"decompressed entries size", ByteSize(stats.total_decompressed_entries_size),
"total object size", ByteSize(stats.total_object_size),
"pack size", ByteSize(stats.pack_size),
width = width
)?;

What unit is displayed and what symbol is used to represent it varies across major versions of bytesize. In current (i.e. recent stable) releases, the default unit is the IEC binary kibibyte, which it abbreviates KiB; while one can explicitly request the SI decimal kilobyte, which it abbreviates kB. But old versions of bytesize behave differently, defaulting to the SI decimal kilobyte, and also abbreviating it with the non-SI symbol KB. The new behavior came in as of bytesize 2.0.0. But gitoxide-core depends on:

bytesize = "1.0.1"

I suggest upgrading to bytesize 2.0.* and deciding whether we actually want…

  • …units of 1000 bytes abbreviated kB, in which case the above would be changed to:

    writeln!(out, "\ncompression")?;
    #[rustfmt::skip]
    writeln!(
    out, "\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}\n\t{:<width$}: {}",
    "compressed entries size", ByteSize(stats.total_compressed_entries_size).display().si(),
    "decompressed entries size", ByteSize(stats.total_decompressed_entries_size).display().si(),
    "total object size", ByteSize(stats.total_object_size).display().si(),
    "pack size", ByteSize(stats.pack_size).display().si(),
    width = width
    )?;

  • …or units of 1024 bytes abbreviated KiB, in which case no change would be needed to that source code file.

    (Though it could, if desired, be made explicit by calling iec() where the SI alternative calls si().)

Git behavior

I'm not sure if there's a Git behavior that should be considered to correspond exactly to this, since gix doesn't have or aim for the same interface as git, and since git verify-pack does not show file sizes in its statistics:

$ git verify-pack -s tests/fixtures/packs/pack-11fdfa9e156ab73caae3b6da867192221f2089c2.idx
non delta: 18 objects
chain length = 1: 4 objects
chain length = 2: 3 objects
chain length = 3: 1 object
chain length = 4: 2 objects
chain length = 5: 1 object
chain length = 6: 1 object

But some other git commands do show sizes of things in "human" units. For example:

$ git count-objects -vH
count: 0
size: 0 bytes
in-pack: 163413
packs: 2
size-pack: 79.16 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

When it is run without -H the size-pack value is shown with no unit, but git-count-objects(1) documents it as being in units of KiB. With neither -v nor -H, one gets:

$ git count-objects
0 objects, 0 kilobytes

There, "kilobytes" is ambiguous. One might think it means decimal SI kilobytes (1000 bytes). But actually that occurrence of "kilobytes" means binary IEC kibibytes, as revealed by:

if (human_readable)
	strbuf_humanise_bytes(&buf, loose_size);
else
	strbuf_addf(&buf, "%lu kilobytes",
			(unsigned long)(loose_size / 1024));

I don't think any of this has much bearing on what we should do, since it's about display behavior that makes no effort to be similar to Git. However, it may be that the preference in Git for using binary IEC units--rather than decimal SI units--reflects a preference for those units, or would lead users to expect that gitoxide use such units. (My personal preference is also for binary IEC units.)

Any change here, especially if it includes upgrading bytesize, should be fairly convenient for me to include in a larger PR that I am already working on. (Although the above-linked code currently fixes this by keeping them SI decimal kilobytes and changing the unit to "kB", I did that because it is closer to the current behavior, not to express a preference for that approach.)

Steps to reproduce 🕹

Check the journey test snapshot file shown above and observe that the journey tests are passing. Alternatively, run:

cargo run --bin=gix -- --no-verbose free pack verify --statistics tests/fixtures/packs/pack-11fdfa9e156ab73caae3b6da867192221f2089c2.idx

This shows the following, which are "KB" units where by "KB" it means what would be less ambiguously called "kB":

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/debug/gix --no-verbose free pack verify --statistics tests/fixtures/packs/pack-11fdfa9e156ab73caae3b6da867192221f2089c2.idx`
objects per delta chain length
         0: 18
         1: 4
         2: 3
         3: 1
         4: 2
         5: 1
         6: 1
        ->: 30

averages
        delta chain length:            1;
        decompressed entry [B]:        3456;
        compressed entry [B]:          1725;
        decompressed object size [B]:  9621;

compression
        compressed entries size       : 51.8 KB
        decompressed entries size     : 103.7 KB
        total object size             : 288.7 KB
        pack size                     : 51.9 KB

        num trees                     : 15
        num blobs                     : 5
        num commits                   : 10
        num tags                      : 0

        compression ratio             : 2.00
        delta compression ratio       : 5.58
        delta gain                    : 2.78
        pack overhead                 : 0.235%

Broadening the scope: Other uses of ambiguous units

I've framed this in terms of gix free pack verify --statistics because that's what I stumbled upon first (EliahKagan#18 (comment)), and because the exact way that formats sizes is under test, and because I didn't really think through the full scope this issue should have. But this should very possibly be construed more broadly: some other places also show ambiguous units, and also show what I believe to be decimal SI units when they might perhaps better show binary IEC units.

$ gix clone [email protected]:EliahKagan/gitoxide.git
 19:17:24 indexing done 161.1K objects in 2.85s (56.5K objects/s)
 19:17:24 decompressing done 175.7MB in 2.85s (61.7MB/s)
 19:17:25     Resolving done 161.1K objects in 0.86s (188.1K objects/s)
 19:17:25      Decoding done 1.4GB in 0.86s (1.6GB/s)
 19:17:25 writing index file done 4.5MB in 0.02s (290.5MB/s)
 19:17:25  create index file done 161.1K objects in 3.80s (42.4K objects/s)
 19:17:25          read pack done 77.6MB in 4.10s (18.9MB/s)
 19:17:25           checkout done 2.4K files in 0.14s (16.9K files/s)
 19:17:25            writing done 72.3MB in 0.14s (516.5MB/s)
...

In contrast, Git uses binary IEC units:

git clone https://github.com/GitoxideLabs/gitoxide.git
Cloning into 'gitoxide'...
remote: Enumerating objects: 163078, done.
remote: Counting objects: 100% (869/869), done.
remote: Compressing objects: 100% (363/363), done.
remote: Total 163078 (delta 630), reused 506 (delta 506), pack-reused 162209 (from 6)
Receiving objects: 100% (163078/163078), 74.22 MiB | 37.44 MiB/s, done.
Resolving deltas: 100% (107181/107181), done.

Upgrading bytesize in all gitoxide crates' Cargo.toml does not affect that. I'm not sure if that's because prodash depends on bytesize 1.3.3, or for some other reason.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Apr 10, 2025
`gix free pack verify --statistics` formerly used "KB" for
kilobytes (i.e., SI decimal kilobytes, units of 1000 bytes). This
was somewhat ambiguous because it is occasionally also used for
kibibytes (i.e., IEC binary kibibytes, units of 1024 bytes).

Kilobytes and kibibytes can be more precisely distinguished by
using kB for kilobytes (since "k" is the SI prefix for "kilo") and
KiB for kibibytes (since decimal kilobytes are never written KiB).

This adapts `gitoxide-core` to changes in `bytesize` and, in so
doing, allows the SI unit symbol "kB" to be used.

Fixes GitoxideLabs#1947
@Byron Byron added help wanted Extra attention is needed acknowledged an issue is accepted as shortcoming to be fixed labels Apr 11, 2025
@Byron
Copy link
Member

Byron commented Apr 11, 2025

Thanks for bringing this up!

If a bytesize upgrade would do the trick, would you want to submit a PR for it?
Maybe prodash also needs an upgrade, and if it's not done yet it's likely that I wasn't willing to spend time to deal with breaking changes.

In any case, I think this should be fixed and I am happy to help with prodash .

@Byron Byron closed this as completed in 77a3a1b Apr 11, 2025
@EliahKagan
Copy link
Member Author

EliahKagan commented Apr 12, 2025

If a bytesize upgrade would do the trick, would you want to submit a PR for it?

#1949 included an upgrade to bytesize, except where it appears transitively through prodash. I noticed this issue with gix free pack verify --statistics due to a journey test failure that happened in the fork-internal test PR EliahKagan#18, which was an experiment to see what the effect of #1948 would be. (EliahKagan#18 was, in effect, a preview of #1949.)

So 77a3a1b in #1949 fixed this issue, at least so long as we construe this issue narrowly to be only about the behavior of gix free pack verify that is exercised in the journey tests. However, even aside from its narrowness, this may or may not be the preferred fix going forward.

As detailed above, it seems to me that, for amounts of data in units that are ultimately based on bytes, we shouldn't use SI decimal units like "kB" (formerly "KB") here at all, but that we should instead use IEC binary units like "KiB". This is subjective. Switching to binary IEC units would have come free (other than updating the journey tests) with upgrading bytesize, since in major version 2 it defaults to displaying those units.

The reason I went with keeping decimal SI units instead (with the less ambiguous "kB" symbol) is that I wanted to make the smallest change to observable behavior that fixed the bug and allowed all tests to pass in #1949.

A bigger concern is that there may be other areas where the observable behavior has also changed already due to #1949 upgrading bytesize, but which are not under test. This could be good or bad, depending on whether we intend the change. It is something I had meant to ask about in #1949 (comment), but didn't (except to the extent to which point 3 touched on the very edges of it). Sorry about that.

In addition, the crates for which bytesize was upgraded and that use bytesize as a non-dev direct dependency are gitoxide-core and gix-features. It looks to me now like this upgrade may actually have constituted a breaking change in gix-features. I have posted #1952 to ask about this.

Maybe prodash also needs an upgrade, and if it's not done yet it's likely that I wasn't willing to spend time to deal with breaking changes.

How do I determine whether a change to prodash itself requires a major version bump? As far as I can tell, it doesn't re-export bytesize. But since the main use of prodash is to display things, presumably there is some degree of change in how it displays things that would constitute a breaking change in and of itself. Is that so? If so, does changing both what units are used (which will change numbers) and the symbols for them constitute such a break?

In any case, I think this should be fixed and I am happy to help with prodash .

It's likely, though not certain, that I may be able to open a PR for that tonight.

EliahKagan added a commit to EliahKagan/prodash that referenced this issue Apr 13, 2025
And adapt `Bytes::format_bytes()` to the API change.

The old `bytesize::to_string()` function is confusing in its second
parameter `si_prefix: bool`, which is why it was removed. But it
looks like a value of `false`, as we were passing, actually caused
decimal SI units (rather than binary IEC units) to be used. It's
not obvious which units `prodash` intended to use `bytesize` to
convert to and display in. But this seems to be a minimal change to
adapt to the new major version.

In spite of the name `si_prefix` for the old parameter, experiments
show that setting it to `true` caused values to be converted to and
presented in binary IEC units.

Although this attempts not to change the units that are used, is is
expected to produce observable differneces for some of them, in how
they are presented. In particular, a decimal SI kilobyte is least
ambiguously abbreviated "kB" (because "k" is an SI unit symbol
prefix for "kilo" in its meaning of 1000), but it was previously
written as "KB". It is now expected to be writen as "kB".

See also:
GitoxideLabs/gitoxide#1947 (comment)
EliahKagan added a commit to EliahKagan/prodash that referenced this issue Apr 13, 2025
And adapt `Bytes::format_bytes()` to the API change.

The old `bytesize::to_string()` function is confusing in its second
parameter `si_prefix: bool`, which is why it was removed. But it
looks like a value of `false`, as we were passing, actually caused
decimal SI units (rather than binary IEC units) to be used. It's
not obvious which units `prodash` intended to use `bytesize` to
convert to and display in. But this seems to be a minimal change to
adapt to the new major version.

In spite of the name `si_prefix` for the old parameter, experiments
show that setting it to `true` caused values to be converted to and
presented in binary IEC units.

Although this attempts not to change which actual units are used,
it does produce observable differneces for some of them, in how
they are presented. In particular, a decimal SI kilobyte is least
ambiguously abbreviated "kB" (because "k" is an SI unit symbol
prefix for "kilo" in its meaning of 1000), but it was previously
written as "KB". It is now expected to be writen as "kB". Tests
catch this distinction, and are updated here accordingly to assert
that the generally preferable "kB" symbol for decimal SI kilobyte
is used.

See also:
GitoxideLabs/gitoxide#1947 (comment)
EliahKagan added a commit to EliahKagan/prodash that referenced this issue Apr 13, 2025
This upgrades the `bytesize` dependency in `Cargo.toml` from
version 1.0.1 (which was usually selecting 1.3.3) to 2.0.1 (which
is the latest version).

There were some nontrivial API changes from major version 1 to 2.
Accordingly, this adapts `Bytes::format_bytes()` to the API change.

The old `bytesize::to_string()` function was confusing in its
second parameter `si_prefix: bool`, which is why it was removed.
But it looks like a value of `false`, as we were passing, actually
caused decimal SI units (rather than binary IEC units) to be used.
It's not obvious which units `prodash` intended to use `bytesize`
to convert to and display in. But this seems to be a minimal change
to adapt to the new major version.

In spite of the name `si_prefix` for the old parameter, experiments
show that setting it to `true` caused values to be converted to and
presented in binary IEC units.

Although this attempts not to change which actual units are used,
it does produce observable differneces for some of them, in how
they are presented. In particular, a decimal SI kilobyte is least
ambiguously abbreviated "kB" (because "k" is an SI unit symbol
prefix for "kilo" in its meaning of 1000), but it was previously
written as "KB". It is now expected to be writen as "kB". Tests
catch this distinction, and are updated here accordingly to assert
that the generally preferable "kB" symbol for decimal SI kilobyte
is used.

See also:
GitoxideLabs/gitoxide#1947 (comment)
EliahKagan added a commit to EliahKagan/prodash that referenced this issue Apr 13, 2025
This upgrades the `bytesize` dependency in `Cargo.toml` from
version 1.0.1 (which was usually selecting 1.3.3) to 2.0.1 (which
is the latest version).

There were some nontrivial API changes from major version 1 to 2.
Accordingly, this adapts `Bytes::format_bytes()` to the API change.

The old `bytesize::to_string()` function was confusing in its
second parameter `si_prefix: bool`, which is why it was removed.
But it looks like a value of `false`, as we were passing, actually
caused decimal SI units (rather than binary IEC units) to be used.
It's not obvious which units `prodash` intended to use `bytesize`
to convert to and display in. But this seems to be a minimal change
to adapt to the new major version.

In spite of the name `si_prefix` for the old parameter, experiments
show that setting it to `true` caused values to be converted to and
presented in binary IEC units.

Although this attempts not to change which actual units are used,
it does produce observable differences for some of them, in how
they are presented. In particular, a decimal SI kilobyte is least
ambiguously abbreviated "kB" (because "k" is an SI unit symbol
prefix for "kilo" in its meaning of 1000), but it was previously
written as "KB". It is now expected to be writen as "kB". Tests
catch this distinction, and are updated here accordingly to assert
that the generally preferable "kB" symbol for decimal SI kilobyte
is used.

See also:
GitoxideLabs/gitoxide#1947 (comment)
@Byron
Copy link
Member

Byron commented Apr 13, 2025

A bigger concern is that there may be other areas where the observable behavior has also changed already due to #1949 upgrading bytesize, but which are not under test. This could be good or bad, depending on whether we intend the change. It is something I had meant to ask about in #1949 (comment), but didn't (except to the extent to which point 3 touched on the very edges of it). Sorry about that.

No worries at all - I missed it too, but admittedly I also didn't think much about. The system as is (without automation) is prone to failure when faced with very fallible humans.
Even if re-exports were forbidden, then using external crate types in the API would also have to be forbidden. In any case, right now one is going to fail to consistently declare breaking changes, it's a certainty.

How do I determine whether a change to prodash itself requires a major version bump? As far as I can tell, it doesn't re-export bytesize. But since the main use of prodash is to display things, presumably there is some degree of change in how it displays things that would constitute a breaking change in and of itself. Is that so? If so, does changing both what units are used (which will change numbers) and the symbols for them constitute such a break?

I never considered the display for humans could be a breaking change, only breaking the build and in the worst case, behaviour.

Thanks again for all your help with this, it's much appreciated and I am definitely feeling like I should rush things a little less.

@EliahKagan
Copy link
Member Author

I never considered the display for humans could be a breaking change, only breaking the build and in the worst case, behaviour.

I'm not sure how big of a change is needed before that would qualify as breaking. One factor that made me wonder about it is that the exact output of gix free pack verify has a journey test, which had to be adjusted to pass even when keeping decimal SI units but changing "KB" to "kB".

@Byron
Copy link
Member

Byron commented Apr 14, 2025

Right, the journey tests. These are from a time when I nailed the output 'just because', and before it was clear that gix was a developer tool which can change at will.
These days, new commands don't get journey tests, and the underlying functionality is expected to be tested sufficiently in unit tests.
In any case, their human-output can change at will.

@EliahKagan
Copy link
Member Author

In this case, the initial change before I did anything to adapt to the new version of bytesize was:

-compressed entries size       : 51.8 KB
-decompressed entries size     : 103.7 KB
-total object size             : 288.7 KB
-pack size                     : 51.9 KB
+compressed entries size       : 50.5 KiB
+decompressed entries size     : 101.3 KiB
+total object size             : 281.9 KiB
+pack size                     : 50.7 KiB

When that appears as part of larger output, or if it is formatted in a non-monospace font, it is not immediately obvious that what has happened is that the units being used have changed from units of 1000 bytes to units of 1024 bytes.

So, from my perspective, the journey test was helpful to me here. Without it, I would not have noticed this issue, and it might not have occurred to me that there could be skew in the behavior of bytesize when it is used directly versus through prodash. (Thus, the events that led to #1947 (comment) would not have occurred and prodash might not have ended up getting updated.)

To be clear, I am not saying this to argue that more journey tests be created for gix. Rather, I am curious if this ought to be considered something where the underlying functionality ought to have a unit test somewhere (other than within in the code of bytesize itself, where of course I presume it already does).

@Byron
Copy link
Member

Byron commented Apr 20, 2025

To be clear, I am not saying this to argue that more journey tests be created for gix. Rather, I am curious if this ought to be considered something where the underlying functionality ought to have a unit test somewhere (other than within in the code of bytesize itself, where of course I presume it already does).

In a way, the current suite of journey-tests is overly detailed where it does exist for gix, and in theory it could be reduced to a few strategic tests that are either validating important functionality on the highest level possible, or that are known to have failed in the past, catching real bugs in the process. As I'd be hesitant to remove anything from an existing test-suite, I'd at least say that finding and adding more strategic journey tests is an option.
For ein they are certainly more desirable as it wants to be more than a developer tool.

However, doing so is probably rarely necessary if there are enough unit tests, and adding more can always be done if coverage is too low in some places.
Lastly, with bytesize I am still quite indifferent - even though it's a breaking change, it's nothing that downstream can depend on (particularly not output for humans).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants