Skip to content

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented Feb 7, 2024

Since GetResultsInFullPath uses a struct (FilePatternMatch) in a Linq call, this is causing extra disk space on native AOT'd applications because of struct specialization. Remove the Linq call and instead use a simple loop to create the result.

Using the following program:

using Microsoft.Extensions.FileSystemGlobbing;

var matcher = new Matcher();
matcher.AddIncludePatterns(["*.txt", "*.asciidoc", "*.md"]);

string searchDirectory = @"C:\temp";
foreach (string file in matcher.GetResultsInFullPath(searchDirectory))
{
    Console.WriteLine(file);
}

win-x64 size on disk:

  • Before: 1.59 MB (1,675,776 bytes)
  • After: 1.56 MB (1,642,496 bytes)

The remaining System.Linq usages in the library don't seem beneficial to remove:

Since GetResultsInFullPath uses a struct (FilePatternMatch) in a Linq call, this is causing extra disk space on native AOT'd applications because of struct specialization. Remove the Linq call and instead use a simple loop to create the result.
@eerhardt eerhardt requested a review from stephentoub February 7, 2024 16:25
@ghost ghost assigned eerhardt Feb 7, 2024
@ghost
Copy link

ghost commented Feb 7, 2024

Tagging subscribers to this area: @dotnet/area-extensions-filesystem
See info in area-owners.md if you want to be subscribed.

Issue Details

Since GetResultsInFullPath uses a struct (FilePatternMatch) in a Linq call, this is causing extra disk space on native AOT'd applications because of struct specialization. Remove the Linq call and instead use a simple loop to create the result.

Using the following program:

using Microsoft.Extensions.FileSystemGlobbing;

var matcher = new Matcher();
matcher.AddIncludePatterns(["*.txt", "*.asciidoc", "*.md"]);

string searchDirectory = @"C:\temp";
foreach (string file in matcher.GetResultsInFullPath(searchDirectory))
{
    Console.WriteLine(file);
}

win-x64 size on disk:

  • Before: 1.59 MB (1,675,776 bytes)
  • After: 1.56 MB (1,642,496 bytes)

The remaining System.Linq usages in the library don't seem beneficial to remove:

Author: eerhardt
Assignees: -
Labels:

area-Extensions-FileSystem

Milestone: -

Comment on lines +60 to 66
IEnumerable<FilePatternMatch> matches = patternMatchingResult.Files;
List<string> result = matches is ICollection matchCollection ? new(matchCollection.Count) : new();
foreach (FilePatternMatch match in matches)
{
result.Add(Path.GetFullPath(Path.Combine(directoryPath, match.Path)));
}
return result;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% confident this is better. But it does reduce the size on disk. I'm assuming it also reduces JIT cost and in-memory code size as well since we no longer need to specialize the Linq code for this struct.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the ToArray/ToList needed at all? Are we expecting consumers to iterate it multiple times?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the ToArray/ToList needed at all?

I was trying to keep the behavior as close to the existing as possible. I assume you are suggesting possibly just using a yield return enumerator instead?

Are we expecting consumers to iterate it multiple times?

I don't think multiple times, but a common pattern I see (a little less than 50% from doing a few sample checks) is to call .ToList() or .ToImmutableArray() on the result of this method call. Having the result of this method return something that implements ICollection/Count seems to be beneficial in those cases.

examples:

https://github.com/dotnet/format/blob/e07ea4767a72b0f81457599e92d9a89ae9cf18bd/src/Workspaces/FolderWorkspace_FolderSolutionLoader.cs#L54

https://github.com/morganstanley/dotnet-please/blob/d9c78451823fc2109a7b27664748b3159d35d64a/DotNetPlease/Helpers/MSBuildHelper.cs#L131-L144

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you are suggesting possibly just using a yield return enumerator instead?

Or just returning the Select, in which case someone calling ToList on it will pick up Select's implementation of it. I'm not sure which aspect of your change was the focus and whether your explicit goal was to remove the Select.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which aspect of your change was the focus and whether your explicit goal was to remove the Select.

Yes, it was to remove the Select over a struct (in this case FilePatternMatch). The main goal was to reduce the generic specialization code generated in NAOT.

image

}

IEnumerable<FilePatternMatch> matches = patternMatchingResult.Files;
List<string> result = matches is ICollection matchCollection ? new(matchCollection.Count) : new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches is ICollection

ICollection<T> doesn't inherit ICollection; many of our collection types implement both, but it's not guaranteed. Is this just a "good enough" thing because we expect most inputs would be e.g. T[] or List<T>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by the source, matches will always be a List<T> here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt is my understanding correct? If so, to reduce complexity a bit, can we remove the new() fallback and maybe add an "is ICollection" assertion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge the change without this since its still valuable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed the original question.

Since this is a public API and Matcher.Execute is virtual, the PatternMatchingResult can have any IEnumerable<FilePatternMatch> Files set on it. It isn't guaranteed to always be a List<T>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a "good enough" thing because we expect most inputs would be e.g. T[] or List?

Yes, our implementation always returns a List<T>, so we typically expect this to be List<T> in most cases.

@jozkee jozkee merged commit 3725e24 into dotnet:main Apr 19, 2024
@eerhardt eerhardt deleted the ReduceLinqInFileSystemGlobbing branch April 20, 2024 21:55
matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
* Reduce Linq usage in FileSystemGlobbing

Since GetResultsInFullPath uses a struct (FilePatternMatch) in a Linq call, this is causing extra disk space on native AOT'd applications because of struct specialization. Remove the Linq call and instead use a simple loop to create the result.

* Use ToArray instead of ToList to reduce unnecessary allocations.

---------

Co-authored-by: David Cantú <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators May 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants