-
Notifications
You must be signed in to change notification settings - Fork 155
range-diff: add configurable memory limit for cost matrix #1958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this patch! It is reasonable, I just have one suggestion how to improve it.
Please note that there has been a highly over-engineered attempt at addressing this problem before: https://lore.kernel.org/git/[email protected]/t/#me423268c4f14a0d37c0ac3e83dc7d5e9cea3661a. You probably want to mention this in the "cover letter" (i.e. in the initial PR comment that will be sent), even though that patch series' contributor seems to be AWOL for years already.
Thank you @dscho for the thoughtful review! I attempted to implement your suggestion of checking content size within read_patches(), but discovered an issue:
if (strbuf_read(&contents, cp.out, 0) < 0) { // Line 87 - reads ALL output
error_errno(_("could not read `log` output"));
...
}
// Only AFTER reading everything do we process line by line
for (; size > 0; size -= len, line += len) {
// Check limits here is too late - memory already consumed
} For the test case with 256k commits, this means ~6GB is read into the contents strbuf before any limits can be checked. By the time we could check content size or commit count in the loop, the memory is already exhausted. To properly implement early exit as you suggested, we would need to:
Would you prefer:
I'll also reference the previous RFC attempt as you suggested. |
@pcasaretto wow, thorough work! Personally, I would prefer the streaming approach, but I could understand if it is unreasonable to ask for such a huge refactor just to get the bug fix in. Your choice! |
1a92256
to
daea1fe
Compare
After pairing with @thehcma, we've updated the approach to address the memory exhaustion issue more directly. Instead of pre-counting commits, we now check the actual memory requirements of the cost matrix just before allocation in
This solution avoids the performance overhead of spawning additional processes while still preventing the crashes. Worth noting, that the process still takes a while to process and takes up around 10GB for the particular command that triggered the crash. As you noted, integrating this into What do you think about this approach, particularly:
|
cd92fde
to
e308b55
Compare
e308b55
to
dc9c6a6
Compare
There are issues in commit dc9c6a6: |
Update: 4GB was too much for 32bit systems. Made the limit 2GB in those cases. |
dc9c6a6
to
f6a1c6d
Compare
f6a1c6d
to
90d0059
Compare
I like your approach! @pcasaretto please note that I am not a gate keeper here. The Git project does not accept code reviews in PRs, it requires the code review to happen on the list. In other words: Please If you'd like, I invite you to add an "Acked-by: Johannes Schindelin [email protected]" to the commit message footer (right before your "Signed-off-by:" line) and refer to this here comment in the "cover letter", i.e. in the PR description which will be sent as part of the email to the Git mailing list. |
90d0059
to
5cf3e89
Compare
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
Error: 5cf3e89 was already submitted |
On the Git mailing list, Junio C Hamano wrote (reply to this): "Paulo Casaretto via GitGitGadget" <[email protected]> writes:
> From: pcasaretto <[email protected]>
<administrivia>
It is usual to see a less human readable name embedded in the commit
object than the mail header when a mail comes from GGG.
Just in case you want to be known to this community as "Paulo
Casaretto", not "pcasaretto", I thought I'd point it out that you
may want to redo the commit. I do not mind what name you like to
use, as long as it is identifiable, and From: identity matches the
identity you add your Signed-off-by: with.
</administrivia>
> Acked-by: Johannes Schindelin [email protected]
It is unusual to lack <> around e-mail address here.
> Signed-off-by: pcasaretto <[email protected]>
> ---
> range-diff: add configurable memory limit for cost matrix
> +static int parse_max_memory(const struct option *opt, const char *arg, int unset)
> +{
> + size_t *max_memory = opt->value;
> + uintmax_t val;
> +
> + if (unset) {
> + return 0;
> + }
No unnecessary {braces} around a single statement, please.
> + if (!git_parse_unsigned(arg, &val, SIZE_MAX))
> + return error(_("invalid max-memory value: %s"), arg);
> +
> + *max_memory = (size_t)val;
> + return 0;
> +}
> @@ -33,17 +51,21 @@ int cmd_range_diff(int argc,
> OPT_INTEGER(0, "creation-factor",
> &range_diff_opts.creation_factor,
> N_("percentage by which creation is weighted")),
> + OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> + N_("style"), N_("passed to 'git log'"), 0),
> + OPT_BOOL(0, "left-only", &left_only,
> + N_("only emit output related to the first range")),
> + OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
> + N_("size"),
> + N_("maximum memory for cost matrix (default 4G)"),
> + parse_max_memory),
> OPT_BOOL(0, "no-dual-color", &simple_color,
> N_("use simple diff colors")),
> OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
> N_("notes"), N_("passed to 'git log'"),
> PARSE_OPT_OPTARG),
> - OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> - N_("style"), N_("passed to 'git log'"), 0),
> OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
> N_("passed to 'git log'"), PARSE_OPT_NOARG),
> - OPT_BOOL(0, "left-only", &left_only,
> - N_("only emit output related to the first range")),
> OPT_BOOL(0, "right-only", &right_only,
> N_("only emit output related to the second range")),
> OPT_END()
This seems to mix unrelated changes. Please don't.
Or if the reordering of options do have a reason to exist in _this_
commit, please justify it in your proposed log message. Even if
there were a good reason for reordering existing options, I strongly
suspect that the change would want to be done in a separate,
preparatory-clean-up commit (i.e., making this topic a two-patch
series), because it has nothing to do with preventing inefficient
cost matrix computation from consuming too much memory, which _is_
the theme of this commit.
> diff --git a/range-diff.c b/range-diff.c
> index 8a2dcbee322..6e9b6b115e5 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -21,6 +21,7 @@
> #include "apply.h"
> #include "revision.h"
>
> +
Unrelated, unexplained, and unnecessary change snuck in? Please
proof-read the patch yourself before sending.
> @@ -287,8 +288,8 @@ static void find_exact_matches(struct string_list *a, struct string_list *b)
> }
>
> static int diffsize_consume(void *data,
> - char *line UNUSED,
> - unsigned long len UNUSED)
> + char *line UNUSED,
> + unsigned long len UNUSED)
What is this change about???
> static void get_correspondences(struct string_list *a, struct string_list *b,
> - int creation_factor)
> + int creation_factor, size_t max_memory)
> {
> int n = a->nr + b->nr;
> int *cost, c, *a2b, *b2a;
> int i, j;
> -
> - ALLOC_ARRAY(cost, st_mult(n, n));
> + size_t cost_size = st_mult(n, n);
> + size_t cost_bytes = st_mult(sizeof(int), cost_size);
> + if (cost_bytes >= max_memory) {
> + struct strbuf cost_str = STRBUF_INIT;
> + struct strbuf max_str = STRBUF_INIT;
> + strbuf_humanise_bytes(&cost_str, cost_bytes);
> + strbuf_humanise_bytes(&max_str, max_memory);
> + die(_("range-diff: unable to compute the range-diff, since it "
> + "exceeds the maximum memory for the cost matrix: %s "
> + "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
> + cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
> + }
> + ALLOC_ARRAY(cost, cost_size);
Nicely done.
> @@ -351,7 +363,8 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
> }
>
> c = a_util->matching < 0 ?
> - a_util->diffsize * creation_factor / 100 : COST_MAX;
> + a_util->diffsize * creation_factor / 100 :
> + COST_MAX;
> for (j = b->nr; j < n; j++)
> cost[i + n * j] = c;
> }
There seem to be other unrelated changes indentation-only changes
mixed in to the changes to this file, not just this one.
As a style fix,
c = a_util->matching < 0
? a_util->diffsize * creation_factor / 100
: COST_MAX;
would be easier to follow and read, but please do not do such a
cosmetic clean-up in the same patch. Do them in a separate
preliminary clean-up patch before the "real work".
> @@ -591,7 +605,8 @@ int show_range_diff(const char *range1, const char *range2,
> if (!res) {
> find_exact_matches(&branch1, &branch2);
> get_correspondences(&branch1, &branch2,
> - range_diff_opts->creation_factor);
> + range_diff_opts->creation_factor,
> + range_diff_opts->max_memory);
> output(&branch1, &branch2, range_diff_opts);
> }
OK. |
5cf3e89
to
c81f920
Compare
/preview |
Preview email sent as [email protected] |
Preview email sent as [email protected] |
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
builtin/range-diff.c
Outdated
@@ -33,17 +33,17 @@ int cmd_range_diff(int argc, | |||
OPT_INTEGER(0, "creation-factor", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"pcasaretto via GitGitGadget" <[email protected]> writes:
> From: pcasaretto <[email protected]>
>
> Reorder the command-line options in builtin/range-diff.c to be in
> lexicographic order for better organization and readability. This is
> a preparatory cleanup with no functional changes.
>
> Signed-off-by: Paulo Casaretto <[email protected]>
> ---
> builtin/range-diff.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
Thanks for splitting this out into its own commit.
I am not sure if "lexicographic order" fits well in the context of
"git cmd -h" that spews out many many options, shown with related
options together in groups. I find it aggressively annoying to show
left/right-only far apart. A user unfamiliar with the command would
look at the list, find "left-only" sitting in the list alone, and
waste time and break concentration wondering what in the first range
is so special to deserve such an option, until they see "right-only"
further down to realize that they are symmetric.
I'd rather not to see this "lexicographic" change done, but others
may have better justification (note: "for better organization and
readability" I just disagreed is a good justification) that may make
me change my mind.
What I would change, if there is something suboptimal in the current
output from "git range-diff -h" that deserves improvement, is the
lack of the grouping header before the options for range-diff
operation (i.e. creation-factor to left/right-only, before the next
"diff output" group begins).
Thanks.
> diff --git a/builtin/range-diff.c b/builtin/range-diff.c
> index a563abff5fee..283583a80d0b 100644
> --- a/builtin/range-diff.c
> +++ b/builtin/range-diff.c
> @@ -33,17 +33,17 @@ int cmd_range_diff(int argc,
> OPT_INTEGER(0, "creation-factor",
> &range_diff_opts.creation_factor,
> N_("percentage by which creation is weighted")),
> + OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> + N_("style"), N_("passed to 'git log'"), 0),
> + OPT_BOOL(0, "left-only", &left_only,
> + N_("only emit output related to the first range")),
> OPT_BOOL(0, "no-dual-color", &simple_color,
> N_("use simple diff colors")),
> OPT_PASSTHRU_ARGV(0, "notes", &other_arg,
> N_("notes"), N_("passed to 'git log'"),
> PARSE_OPT_OPTARG),
> - OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> - N_("style"), N_("passed to 'git log'"), 0),
> OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
> N_("passed to 'git log'"), PARSE_OPT_NOARG),
> - OPT_BOOL(0, "left-only", &left_only,
> - N_("only emit output related to the first range")),
> OPT_BOOL(0, "right-only", &right_only,
> N_("only emit output related to the second range")),
> OPT_END()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Thu, Aug 28, 2025 at 8:24 AM Junio C Hamano <[email protected]> wrote:
>
> "pcasaretto via GitGitGadget" <[email protected]> writes:
>
> > From: pcasaretto <[email protected]>
> > Signed-off-by: Paulo Casaretto <[email protected]>
Same issue with name here.
> I am not sure if "lexicographic order" fits well in the context of
> "git cmd -h" that spews out many many options, shown with related
> options together in groups. I find it aggressively annoying to show
> left/right-only far apart. A user unfamiliar with the command would
> look at the list, find "left-only" sitting in the list alone, and
> waste time and break concentration wondering what in the first range
> is so special to deserve such an option, until they see "right-only"
> further down to realize that they are symmetric.
>
> I'd rather not to see this "lexicographic" change done, but others
> may have better justification (note: "for better organization and
> readability" I just disagreed is a good justification) that may make
> me change my mind.
>
> What I would change, if there is something suboptimal in the current
> output from "git range-diff -h" that deserves improvement, is the
> lack of the grouping header before the options for range-diff
> operation (i.e. creation-factor to left/right-only, before the next
> "diff output" group begins).
>
> Thanks.
I do like lexicographic ordering for unrelated options, but I prefer
options to be grouped by intent/use first, then by lexicographic
ordering. And here, not only are--left-only & --right-only related as
Junio points out, to me --diff-merges and --remerge-diff are a similar
grouping that belong together. So, my $0.02 is that I'd lean towards
calling both changes in the patch a reduction in organization rather
than an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Paulo L F Casaretto wrote (reply to this):
Yes, I concur. I noticed these were "out of order" when I added the
new flag but now it's obvious that there was order. I'll remove this
commit.
Regarding the name problem, I've checked and I do have "Paulo
Casaretto" set as my name in my Github public profile.
I fixed my local git config and apparently that fixed it.
On Thu, Aug 28, 2025 at 7:12 PM Elijah Newren <[email protected]> wrote:
>
> On Thu, Aug 28, 2025 at 8:24 AM Junio C Hamano <[email protected]> wrote:
> >
> > "pcasaretto via GitGitGadget" <[email protected]> writes:
> >
> > > From: pcasaretto <[email protected]>
> > > Signed-off-by: Paulo Casaretto <[email protected]>
>
> Same issue with name here.
>
> > I am not sure if "lexicographic order" fits well in the context of
> > "git cmd -h" that spews out many many options, shown with related
> > options together in groups. I find it aggressively annoying to show
> > left/right-only far apart. A user unfamiliar with the command would
> > look at the list, find "left-only" sitting in the list alone, and
> > waste time and break concentration wondering what in the first range
> > is so special to deserve such an option, until they see "right-only"
> > further down to realize that they are symmetric.
> >
> > I'd rather not to see this "lexicographic" change done, but others
> > may have better justification (note: "for better organization and
> > readability" I just disagreed is a good justification) that may make
> > me change my mind.
> >
> > What I would change, if there is something suboptimal in the current
> > output from "git range-diff -h" that deserves improvement, is the
> > lack of the grouping header before the options for range-diff
> > operation (i.e. creation-factor to left/right-only, before the next
> > "diff output" group begins).
> >
> > Thanks.
>
> I do like lexicographic ordering for unrelated options, but I prefer
> options to be grouped by intent/use first, then by lexicographic
> ordering. And here, not only are--left-only & --right-only related as
> Junio points out, to me --diff-merges and --remerge-diff are a similar
> grouping that belong together. So, my $0.02 is that I'd lean towards
> calling both changes in the patch a reduction in organization rather
> than an improvement.
--
Paulo L F Casaretto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Paulo L F Casaretto <[email protected]> writes:
> Yes, I concur. I noticed these were "out of order" when I added the
> new flag but now it's obvious that there was order. I'll remove this
> commit.
> Regarding the name problem, I've checked and I do have "Paulo
> Casaretto" set as my name in my Github public profile.
> I fixed my local git config and apparently that fixed it.
Yeah, these in-body From: lines GigGitGadget adds come from the
authorship of the commits you are sending (in other words, what you
see in "git cat-file commit <commit>" for these commits), and your
GitHub profile would not affect it (and you do not want your GitHub
profile name be used---otherwise you cannot send a series that
contains a change written by somebody else without overtaking the
authorship of their commits).
I see v3 posted there; thanks.
@@ -1404,6 +1404,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file, | |||
struct range_diff_options range_diff_opts = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Thu, Aug 28, 2025 at 2:00 AM pcasaretto via GitGitGadget
<[email protected]> wrote:
>
> From: pcasaretto <[email protected]>
> Signed-off-by: Paulo Casaretto <[email protected]>
The names (and emails) in these should match; I believe the name in
the From field is set by Gitgitgadget based on your profile settings;
see https://github.com/settings/profile and set your name there.
> static void get_correspondences(struct string_list *a, struct string_list *b,
> - int creation_factor)
> + int creation_factor, size_t max_memory)
> {
> int n = a->nr + b->nr;
> int *cost, c, *a2b, *b2a;
> int i, j;
> -
> - ALLOC_ARRAY(cost, st_mult(n, n));
> + size_t cost_size = st_mult(n, n);
> + size_t cost_bytes = st_mult(sizeof(int), cost_size);
> + if (cost_bytes >= max_memory) {
> + struct strbuf cost_str = STRBUF_INIT;
> + struct strbuf max_str = STRBUF_INIT;
> + strbuf_humanise_bytes(&cost_str, cost_bytes);
> + strbuf_humanise_bytes(&max_str, max_memory);
> + die(_("range-diff: unable to compute the range-diff, since it "
> + "exceeds the maximum memory for the cost matrix: %s "
> + "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
available? I'm worried the error message will report in users
checking system memory, claiming they have 14GB available on their
system, and then reporting a "bug".
Perhaps something like:
+ "(%"PRIuMAX" bytes) needed, limited to %s
(%"PRIuMAX" bytes)"),
?
The rest of the patch looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Elijah Newren <[email protected]> writes:
> <[email protected]> wrote:
>>
>> From: pcasaretto <[email protected]>
>> Signed-off-by: Paulo Casaretto <[email protected]>
>
> The names (and emails) in these should match; I believe the name in
> the From field is set by Gitgitgadget based on your profile settings;
> see https://github.com/settings/profile and set your name there.
>
>> static void get_correspondences(struct string_list *a, struct string_list *b,
>> - int creation_factor)
>> + int creation_factor, size_t max_memory)
>> {
>> int n = a->nr + b->nr;
>> int *cost, c, *a2b, *b2a;
>> int i, j;
>> -
>> - ALLOC_ARRAY(cost, st_mult(n, n));
>> + size_t cost_size = st_mult(n, n);
>> + size_t cost_bytes = st_mult(sizeof(int), cost_size);
>> + if (cost_bytes >= max_memory) {
>> + struct strbuf cost_str = STRBUF_INIT;
>> + struct strbuf max_str = STRBUF_INIT;
>> + strbuf_humanise_bytes(&cost_str, cost_bytes);
>> + strbuf_humanise_bytes(&max_str, max_memory);
>> + die(_("range-diff: unable to compute the range-diff, since it "
>> + "exceeds the maximum memory for the cost matrix: %s "
>> + "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
>
> available? I'm worried the error message will report in users
> checking system memory, claiming they have 14GB available on their
> system, and then reporting a "bug".
>
> Perhaps something like:
>
> + "(%"PRIuMAX" bytes) needed, limited to %s
> (%"PRIuMAX" bytes)"),
Sounds like a good idea.
I am not a huge fan of configuration variables that do not have a
command line option. Assuming that it is not like you'd be doing
overly huge range-diff that would not fit your memory every day,
shouldn't we start this with a command line option without a
configuration variable to gauge how useful it would be for users
with such a need, and then after it proves useful and we identify a
workflow where a user would be passing this option all the time, add
a configuration to allow it always be in effect (with command line
override still available)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Thu, Aug 28, 2025 at 2:22 PM Junio C Hamano <[email protected]> wrote:
>
> I am not a huge fan of configuration variables that do not have a
> command line option. Assuming that it is not like you'd be doing
> overly huge range-diff that would not fit your memory every day,
> shouldn't we start this with a command line option without a
> configuration variable to gauge how useful it would be for users
> with such a need, and then after it proves useful and we identify a
> workflow where a user would be passing this option all the time, add
> a configuration to allow it always be in effect (with command line
> override still available)?
Isn't that what Paulo's patch does? Maybe I'm just blind, but I've
looked over the patch a couple times and don't see where he's reading
from a configuration variable; am I just missing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Elijah Newren <[email protected]> writes:
> On Thu, Aug 28, 2025 at 2:22 PM Junio C Hamano <[email protected]> wrote:
>>
>> I am not a huge fan of configuration variables that do not have a
>> command line option. Assuming that it is not like you'd be doing
>> overly huge range-diff that would not fit your memory every day,
>> shouldn't we start this with a command line option without a
>> configuration variable to gauge how useful it would be for users
>> with such a need, and then after it proves useful and we identify a
>> workflow where a user would be passing this option all the time, add
>> a configuration to allow it always be in effect (with command line
>> override still available)?
>
> Isn't that what Paulo's patch does? Maybe I'm just blind, but I've
> looked over the patch a couple times and don't see where he's reading
> from a configuration variable; am I just missing it?
Ah, I just blindly trusted that the "configurable memory limit" on
the subject line is talking about configuring memory limit with some
mechanism. Thanks for correcting me.
User |
This patch series was integrated into seen via git@f007d0b. |
For the record: I did figure it out. For technical reasons (namely, to accommodate for the Principle of Minimal Permissions), GitGitGadget has two GitHub Apps: |
c81f920
to
6d7ff43
Compare
/preview |
Preview email sent as [email protected] |
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Elijah Newren wrote (reply to this): On Fri, Aug 29, 2025 at 4:00 AM Paulo Casaretto via GitGitGadget
<[email protected]> wrote:
> -
> - ALLOC_ARRAY(cost, st_mult(n, n));
> + size_t cost_size = st_mult(n, n);
> + size_t cost_bytes = st_mult(sizeof(int), cost_size);
> + if (cost_bytes >= max_memory) {
> + struct strbuf cost_str = STRBUF_INIT;
> + struct strbuf max_str = STRBUF_INIT;
> + strbuf_humanise_bytes(&cost_str, cost_bytes);
> + strbuf_humanise_bytes(&max_str, max_memory);
> + die(_("range-diff: unable to compute the range-diff, since it "
> + "exceeds the maximum memory for the cost matrix: %s "
> + "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
> + cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
> + }
> + ALLOC_ARRAY(cost, cost_size);
> ALLOC_ARRAY(a2b, n);
> ALLOC_ARRAY(b2a, n);
>
This still has the same wording issue that I commented on in v2:
https://lore.kernel.org/git/CABPp-BEDje5dYZHEyYMN6j_LdR5CqRN1cxc0riRK06qK-OxiTA@mail.gmail.com/ |
On the Git mailing list, Junio C Hamano wrote (reply to this): "Paulo Casaretto via GitGitGadget" <[email protected]> writes:
> From: Paulo Casaretto <[email protected]>
>
> When comparing large commit ranges (e.g., 250,000+ commits), range-diff
> attempts to allocate an n×n cost matrix that can exhaust available
> memory. For example, with 256,784 commits (n = 513,568), the matrix
> would require approximately 256GB of memory (513,568² × 4 bytes),
> causing either immediate segmentation faults due to integer overflow or
> system hangs.
>
> Add a memory limit check in get_correspondences() before allocating the
> cost matrix. This check uses the total size in bytes (n² × sizeof(int))
> and compares it against a configurable maximum, preventing both
> excessive memory usage and integer overflow issues.
>
> The limit is configurable via a new --max-memory option that accepts
> human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
> systems and 2GB for 32 bit systems. This allows comparing ranges of
> approximately 32,000 (16,000) commits - generous for real-world use cases
> while preventing impractical operations.
>
> When the limit is exceeded, range-diff now displays a clear error
> message showing both the requested memory size and the maximum allowed,
> formatted in human-readable units for better user experience.
>
> Example usage:
> git range-diff --max-memory=1G branch1...branch2
> git range-diff --max-memory=500M base..topic1 base..topic2
>
> This approach was chosen over alternatives:
> - Pre-counting commits: Would require spawning additional git processes
> and reading all commits twice
> - Limiting by commit count: Less precise than actual memory usage
> - Streaming approach: Would require significant refactoring of the
> current algorithm
>
> This issue was previously discussed in:
> https://lore.kernel.org/git/[email protected]/
>
> Acked-by: Johannes Schindelin <[email protected]>
> Signed-off-by: Paulo Casaretto <[email protected]>
> ---
Looks good, especially without the reordering existing entries in
the options list. The authorship information above looks much
better, too.
> @@ -40,6 +57,10 @@ int cmd_range_diff(int argc,
> PARSE_OPT_OPTARG),
> OPT_PASSTHRU_ARGV(0, "diff-merges", &diff_merges_arg,
> N_("style"), N_("passed to 'git log'"), 0),
> + OPT_CALLBACK(0, "max-memory", &range_diff_opts.max_memory,
> + N_("size"),
> + N_("maximum memory for cost matrix (default 4G)"),
> + parse_max_memory),
> OPT_PASSTHRU_ARGV(0, "remerge-diff", &diff_merges_arg, NULL,
> N_("passed to 'git log'"), PARSE_OPT_NOARG),
> OPT_BOOL(0, "left-only", &left_only,
Among existing options (an excerpt from "git range-diff h")
--[no-]creation-factor <n>
percentage by which creation is weighted
This controls how correspondence between commits on old and new
branches are computed.
--no-dual-color use simple diff colors
--dual-color opposite of --no-dual-color
These control how the findings are shown, by painting the lines
in distinct colors.
--[no-]notes[=<notes>]
passed to 'git log'
--[no-]diff-merges <style>
passed to 'git log'
--[no-]remerge-diff passed to 'git log'
These control what text are used to represent each commit and
participate in comparison and display.
--[no-]left-only only emit output related to the first range
--[no-]right-only only emit output related to the second range
These again control how the findings are shown, by omitting some
commits from the output.
So there is no perfectly logical place to place the new option, but
between diff-merges and remerge-diff somewhat feels a bit odder
choice than other possible places.
Will queue as is. If some users find the location in the "-h"
output too odd and disturbing, they can later send in a reordering
patch on top, but I would think the chosen location is good enough.
As #leftoverbits we might want to
* Group range-diff specific options with OPT_GROUP()
* Instead of having to match the full NxN matrix, perhaps reduce
the matrix by keeping the most promising M (which is much smaller
than N) for each N, or something?
but that (especially the latter) is totally outside the scope of
this patch.
Thanks.
|
When comparing large commit ranges (e.g., 250,000+ commits), range-diff attempts to allocate an n×n cost matrix that can exhaust available memory. For example, with 256,784 commits (n = 513,568), the matrix would require approximately 256GB of memory (513,568² × 4 bytes), causing either immediate segmentation faults due to integer overflow or system hangs. Add a memory limit check in get_correspondences() before allocating the cost matrix. This check uses the total size in bytes (n² × sizeof(int)) and compares it against a configurable maximum, preventing both excessive memory usage and integer overflow issues. The limit is configurable via a new --max-memory option that accepts human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit systems and 2GB for 32 bit systems. This allows comparing ranges of approximately 32,000 (16,000) commits - generous for real-world use cases while preventing impractical operations. When the limit is exceeded, range-diff now displays a clear error message showing both the requested memory size and the maximum allowed, formatted in human-readable units for better user experience. Example usage: git range-diff --max-memory=1G branch1...branch2 git range-diff --max-memory=500M base..topic1 base..topic2 This approach was chosen over alternatives: - Pre-counting commits: Would require spawning additional git processes and reading all commits twice - Limiting by commit count: Less precise than actual memory usage - Streaming approach: Would require significant refactoring of the current algorithm This issue was previously discussed in: https://lore.kernel.org/git/[email protected]/ Acked-by: Johannes Schindelin <[email protected]> Signed-off-by: Paulo Casaretto <[email protected]>
6d7ff43
to
203113e
Compare
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Junio C Hamano wrote (reply to this): Elijah Newren <[email protected]> writes:
> On Fri, Aug 29, 2025 at 4:00 AM Paulo Casaretto via GitGitGadget
> <[email protected]> wrote:
>> -
>> - ALLOC_ARRAY(cost, st_mult(n, n));
>> + size_t cost_size = st_mult(n, n);
>> + size_t cost_bytes = st_mult(sizeof(int), cost_size);
>> + if (cost_bytes >= max_memory) {
>> + struct strbuf cost_str = STRBUF_INIT;
>> + struct strbuf max_str = STRBUF_INIT;
>> + strbuf_humanise_bytes(&cost_str, cost_bytes);
>> + strbuf_humanise_bytes(&max_str, max_memory);
>> + die(_("range-diff: unable to compute the range-diff, since it "
>> + "exceeds the maximum memory for the cost matrix: %s "
>> + "(%"PRIuMAX" bytes) needed, %s (%"PRIuMAX" bytes) available"),
>> + cost_str.buf, (uintmax_t)cost_bytes, max_str.buf, (uintmax_t)max_memory);
>> + }
>> + ALLOC_ARRAY(cost, cost_size);
>> ALLOC_ARRAY(a2b, n);
>> ALLOC_ARRAY(b2a, n);
>>
>
> This still has the same wording issue that I commented on in v2:
> https://lore.kernel.org/git/CABPp-BEDje5dYZHEyYMN6j_LdR5CqRN1cxc0riRK06qK-OxiTA@mail.gmail.com/
Right. I overlooked it, sorry. |
This branch is now known as |
There was a status update in the "New Topics" section about the branch "git range-diff" learned a way to limit the memory consumed by O(N*N) cost matrix. Will merge to 'next'? source: <[email protected]> |
There was a status update in the "Cooking" section about the branch "git range-diff" learned a way to limit the memory consumed by O(N*N) cost matrix. Will merge to 'next'? source: <[email protected]> |
This patch series was integrated into seen via git@5410dce. |
This patch series was integrated into seen via git@1674023. |
Problem Description
When
git range-diff
is given extremely large ranges, it can result in either:Reproduction Case
In a Shopify's large monorepo a range-diff command like this crashes after several minutes with a SIGBUS error
Range statistics:
Stack Trace (Segmentation Fault)
Root Cause Analysis
The crash occurs in
get_correspondences()
at line 356:Problems:
n=256,784
fits in anint
,n*n
overflowsSolution
Add a memory limit check in get_correspondences() before allocating the
cost matrix. This check uses the total size in bytes (n² × sizeof(int))
and compares it against a configurable maximum, preventing both
excessive memory usage and integer overflow issues.
The limit is configurable via a new --max-memory option that accepts
human-readable sizes (e.g., "1G", "500M"). The default is 4GB for 64 bit
systems and 2GB for 32 bit systems. This allows comparing ranges of
approximately 32,000 (16,000) commits - generous for real-world use cases
while preventing impractical operations.
When the limit is exceeded, range-diff now displays a clear error
message showing both the requested memory size and the maximum allowed,
formatted in human-readable units for better user experience.
Example usage:
git range-diff --max-memory=1G branch1...branch2
git range-diff --max-memory=500M base..topic1 base..topic2
This approach was chosen over alternatives:
and reading all commits twice
current algorithm
This issue was previously discussed in:
https://lore.kernel.org/git/[email protected]/
[Acked-by: Johannes Schindelin [email protected]](#1958 (comment))
cc: Elijah Newren [email protected]