Skip to content

ignore: solve re.error on group name redefinition in pathspec 0.10.x #8663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions dvc/ignore.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def __init__(self, pattern_list, dirname, sep):
]

self.ignore_spec = [
(ignore, re.compile("|".join(item[0] for item in group)))
(ignore, [re.compile(item[0]) for item in group])
for ignore, group in groupby(
self.regex_pattern_list, lambda x: x[1]
)
Expand Down Expand Up @@ -107,8 +107,8 @@ def matches(pattern, path, is_dir) -> bool:

result = False

for ignore, pattern in self.ignore_spec[::-1]:
if matches(pattern, path, is_dir):
for ignore, patterns in self.ignore_spec[::-1]:
if any(matches(pattern, path, is_dir) for pattern in patterns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @hiroto7 I have a problem here, previously we concatenate these reg expressions mainly for the performance. And actually, in the very beginning, we use pathspaces's API to do these checks but it's really slow and the inner implementation is like this kind of nested for loop. Any better methods to solve this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).

Although pathspec seems to try to check all regexes even if any pattern matches soonly, the any() function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.

Copy link
Contributor

@karajan1001 karajan1001 Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 Oh, I didn't know that the regex concatenation is deliberate, and I have noticed you have previously mentioned a bottleneck in pathspec's API in #3869 (comment).

Although pathspec seems to try to check all regexes even if any pattern matches soonly, the any() function does short-circuit evaluation. So I think my PR never produces unnecessary regex matchings.

Yes, but your algorithm is still O(m) for m reg expressions. even if it would be faster than the pathspec's solution.
While the regex built a state machine inside of it. For example, you have both 1 and 10 patterns, in a long regex if 1 doesn't been matched, we also know 10 doesn't, while in the for loop it will still be checked.

Here is a simple test on my computer, I create 100 patterns for the dvc status benchmark.

$ cat tests/benchmarks/cli/commands/test_status_ignore.py                                                                                               [ins][18:34:34]
def test_status(bench_dvc, tmp_dir, dvc, make_dataset):
    dataset = make_dataset(files=True, dvcfile=True, cache=True)
    ignore_list =  '\n'.join([str(n) for n in range(100)])
    (tmp_dir/".dvcignore").write_text(ignore_list)

    bench_dvc("status")
    bench_dvc("status", name="noop")

    (dataset / "new").write_text("new")
    bench_dvc("status", name="changed")
    bench_dvc("status", name="changed-noop")

And the final benchmark shows that time cost after this PR increased from 280ms to 310ms

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the benchmark result.

If that's the case, it seems that simply looping regexes one by one, as I did, is not a good idea.
Concatenating regexes is likely to be effective, unfortunately, but I have no idea about any other way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two suggestions here,

  1. use some kind of hacky way to override the regex pattern created by pathspec. (replace <ps_d> with <ps_d1>', <ps_d2>', ... `<ps_dn>' )
  2. Another more radical choice is to contribute back to pathspec to improve its performance and then directly use the API of it.

result = ignore
break
return result
Expand Down