-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add RandomAccessFile.readMapped(Sync) to dart:io library #49810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Any specific reason you don't want to use FFI for this? With We have previously discussed this with dart2js team and provided the code for achieving this without any changes to the VM, e.g. older version of the code https://gist.github.com/mraleph/d8ffb32cd419e08e15b8fc245dbbeff7. I would probably rewrite it today to use |
Other have tried adding a dependency on FFI into Dart2JS before. The dependency broke some infrastructure and blocked the SDK from building. There wasn't an easy workaround or fix for the issue unfortunately. |
I think the main (the only?) issue was that dart2js appjit snapshot is trained on dart2js itself. One option to resolve the issue is just to use conditional imports, another is to just use AOT instead of appjit |
We already have environments where we use an AOT snapshot for dart2js internally, so that should be quite viable. Also, it may be easy to provide a separate dart2js entrypoint for training data. Our entrypoint is already layered, so it wont require much effort. In particular, in the past we didn't allow importing sdk/pkg/compiler/lib/src/dart2js.dart Line 731 in 85dee97
For unit tests we don't even use the main dart2js entrypoint, but a "memory compiler", that provides an in-memory SourceFileProvider implementation instead, see:
We could apply a similar strategy here to hide the dependency to |
@mraleph @a-siva @rakudrama @brianquinlan What are your thoughts on the general direction to take here. Assuming we can address the blocking issues with using ffi (e.g. we address the bootstrapping issue using a different entrypoint for the appjit training), what is your preference in the design space? Would you rather expose this as a public API (I presume we are not the only ones that may be interested in using mmap), or use ffi and keep the library internal to just dart2js for now? |
My 2¢: The original thread started by @davidmorgan in March 2021 (internal email) wanted to use mapped files for DDC and Analyzer too since that would reduce total memory for the persistent versions of those tools. So this is broader than dart2js (but I am not sure where or how those tools read binary data). We are in a better position with either route than we were last year. I would prefer we add a proper API to provide a read-only memory-mapped view of a file as a If we do go the route of using ffi, we should call the VM's OS-abstraction layer otherwise the memory-mapping won't work on some operating systems. It is not clear that we can call Prompt finalization might be important for reading codegen shards - dart2js reads these large binary files sequentially and drops them so GC can collect them. I don't really want to complicate the dart2js build with a platform-dependent native component to run a copy of the code that already mostly exists in the VM. |
I would prefer adding it to dart:io as this seems like a useful feature that others would want too. We recently did a breaking change to the File Create API and landed it, this breaking change may also be of a similar scale and manageable. |
I'd prefer a route that does not increase
With this API being in With
Do you really need to support this on other operating systems though? The case you are trying to improve is Linux centric. |
True the API to support needs some fine tuning. With regards to loss of flexibility, we would cover a big chunk of potential use case with a well defined API, I am sure there would be some outliers with some very specific cases and those could be handled through a custom FFI implementation. |
The immediate issue I'm trying to solve is Linux centric, yes. But this would be an improvement across all platforms if we could support them with this change. I would suspect DDC and especially the analyzer stand to gain even more from this being supported across all platforms. |
In my opinion any usage of this API requires good understanding of associated issues, tradeoffs and platform specific behaviours, e.g. on POSIX you can On POSIX when you use I feel fairly strongly that we should not include this into the This also goes against our desire to shrink |
Is this something we could reasonably expose through a published package? Are we in a place where we have a reasonable story for publishing the native bits for each platform with a single package? Do we have the infrastructure internally to support this as well? If yes to all of the above then I think it would be reasonable to just do it as a separate package, also less breaking to the ecosystem (not at all) and it would be easier to evolve the api in the future. |
Yes.
The package does not need actually need any native bits. You can write all necessary code in pure Dart. The only native bit the package likely needs is one C-function that encodes native finaliser (if we decide that we want GC driven synchronous finalisation of these mappings): but that can be achieved by embedding the small necessary piece of native code directly into the Dart code (see https://gist.github.com/mraleph/d8ffb32cd419e08e15b8fc245dbbeff7#file-test-dart-L119-L143 which illustrates how it is done for X64, similar approach can be ported to work on ARM/ARM64 if we wish to support those systems). |
…_PostCObject. Cf. 938a2c8 TEST=ci Bug: #49810 Bug: #49825 Change-Id: I8d4a574f12458e88b589d5ee02c68b1f436fb964 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/257925 Reviewed-by: Martin Kustermann <[email protected]> Commit-Queue: Ryan Macnak <[email protected]>
@biggs0125 is this breaking change still in the works and ready for the approval process? |
I've put this on hold for now. It seems likely we won't go with this approach but I'd like to revisit it once we have a bit more time on the Dart2JS team. I've removed the 'breaking-change-request' tag for now. |
After discussion the general consensus was that this functionality would first be added as a package and after the package has considerable usage and adjustments to the API are done we would reevaluate to see if it makes sense to move this functionality into dart:io |
I'v recently come across another use case for memory mapped file support, the new language models such as Llama, Falcon etc. are quite large, a quantized Falcon 40B model file for instance is 24GB in size. These have to be loaded and parsed by your instuction/chat app. I'm currently doing this using the stdlibc pub package. This works fine, however, going forward it may be better to bake this functionality into dart:io rather than depend on pub packages. |
@mraleph: Would it be possible to introduce this feature in a new Dart SDK library? IIRC Java has |
I really don't see any reason why this should not just a be a package. It is more aligned with our vision for the future of Dart then expanding |
For reference, after our discussions and consensus above, we created a internal package for this in the Dart SDK that we now use within dart2js. The API ended up being pretty natural and worked well for our purposes. You can find the current implementation at package:mmap. We didn't publish this in pub because we only did what was needed to meet our needs in dart2js at the moment (for us, that was only linux support, which is the only environment where the multi-stage compilation mode of dart2js is in use). For this to be a well supported package, it needs to be generalized to support for other OSs |
Proposal (draft change)
Add 2 methods,
readMapped
andreadMappedSync
, to the RandomAccessFile interface in the dart:io SDK package. These methods would expose the functionality of memory mapping files into the virtual address space (mmap). They would function similarly to the read/readSync methods but would avoid the allocation of a new Uint8List.For simplicity we would mark the address space as “read protected”. This would be enforced within the Dart world by returning unmodifiable views of the Uint8Lists from the new methods. Here is the proposed
file.dart
API:Goal
Reduce memory usage in Dart2JS running in sequential mode. In sequential mode Dart2JS serializes the results of its first 4 phases (kernel, closed world, global inference, codegen) after computing each of them. Then at the start of each phase, the results for all the previous phases are deserialized and loaded into memory. By the final phase we the serialized bytes for every phase of the compiler stored in memory. For large applications, this is a significant amount of memory stored in these byte arrays.
Justification
In some large applications we see hundreds of MBs saved in every phase of the Dart2JS compiler using this strategy. This peaks at around 700MB of memory saved in the final phase for these large applications. This also unblocks some further optimizations that can be made to preserve memory.
Breaking change overview
This change to the RandomAccessFile interface would be a breaking change as implementations of the interface would need to implement these new methods.
It is unlikely there are many implementers of this interface beyond the
dart:io
package itself. One known implementer ispackage:file
which has 2 places that would need to be updated with these methods.MemoryRandomAccessFile
andForwardingRandomAccessFile
would each have simple implementations of these methods: draft changeNote: The implementations in
package:file
are all marked as@override
with a lint ignore.dart:io
won't have the interface change yet. By landing that change first and setting the min SDK version ofpackage:file
to>= 2.19.0-dev
we can avoid ever actually breakingpackage:file
. These implementations will be valid overrides oncedart:io
is updated but before then they are just methods on those subclasses. The lint ignores will keep the analyzer from flagging these methods.The text was updated successfully, but these errors were encountered: