-
Notifications
You must be signed in to change notification settings - Fork 1.3k
ROOT loads unneeded PCMs #13000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Info from @ktf : with 100 processes linked against ROOT running on these machines that's an extra 70MB*100=7GB used by (unneeded?) PCMs. |
Yes, and unfortunately that is as good as it gets with the current infrastructure. Here I've put a bit more details about what could be improved to further reduce the memory footprint: cms-sw/cmssw#41810 (comment) Note that the loading of the pcm files is mmap-ed and is a no-op. The problem is that some sections of the PCMs are not lazy and they are eagerly deserialized. Using a PCH won't make the things much better since the PCH still deserializes eagerly some sections. That is, the deficiency is not in ROOT but in Clang itself. Could you rerun using |
I will provide you the output for ROOTDEBUG=7. Could you elaborate a bit on the mmap? I would expect that using MAP_SHARED would be enough of a workaround for our (ALICE) use case, because what is really killing us is the fact the memory is not shared among the many processes we have. Any reason why that cannot be done? Did I misunderstand something? |
The pcm files are "mmapped" when being opened. That practically uses some virtual memory with the promise it won't increase your rss, unless something is needed from that file. Here is how this is achieved:
I believe your issue is that due to some identifier lookup we start loading pcm files which have sections that require eager deserialization where the mmap manifests into a real rss increase.
The reason is that the serialization in Clang has deficiencies and reads from disk when a pcm file is loaded. I've been hunting down these cases and sometimes we could avoid them. That's why I was looking for some output that could help us do that. That being said, we could make some effort to split the startup phase of ROOT into loading and initialization. Then we could move the registration of pcm files and setting up the ROOT's runtime as part of the initialization process but realistically, if you use ROOT for anything you'd probably need these both... |
Ok, so you are saying that it's not the mmaping to cause the problem, but the subsequent allocations due to some deserialisation. Ok, I checked and indeed the mmap is called with MAP_SHARED... Find in attachment the ROOTDEBUG=7 log. This is only for the component which reads the flat root files (i.e. only basic types). |
According to the log ROOT is looking up |
The behavior seems to have indeed changed, yet I still see the same amount of RSS being allocated in the same codepath. See attached log. |
Anything else which I can try here? |
@ktf,
Looks quite good. However, for some reason we pre-load a lot more modules than I expect. Do you have an up-to-date ROOTSYS/lib/modules.idx file? |
I have whatever is generated by building ROOT 6.28.4. I attached it. Is there a way to trim it down for a specific proxy? |
Any further idea? |
Anything else we can try here? I would really love to get rid of the overhead from ROOT. |
Apologies for being slow. I am attending a workshop these past several days... This is what ROOT (without the suggested patch does for me): ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep Preloading | wc -l Essentially it loads 56 pcm files at startup. That's still pretty bad but not 100. /usr/bin/time -v root.exe -l -b -q shows Still probably too much memory being used but that's what the state of the art modules technology allows us. The numbers are not great but not I cannot reproduce the pcm loads that you report. Can you prepare some debug environment for me to check it out. I will need a standard build of ROOT perhaps with |
TLDR; I repeated your tests with our environment, see below for the actual setup. Without the root file being opened, I get more preloaded dictionaries (119), but roughly the same memory footprint. Adding a ROOT file on the command line it adds an additional 60MB and I see SOFIE and PyMVA being loaded on demand (see also #13055). Long story:You can set up the same environment by going to lxplus and doing: → ALIBUILD_ARCH_PREFIX="Packages" WORK_DIR=/cvmfs/alice.cern.ch/el7-x86_64 . /cvmfs/alice.cern.ch/el7-x86_64/Packages/ROOT/v6-28-04-9/etc/profile.d/init.sh
/cvmfs/alice.cern.ch/el7-x86_64/Packages/AliEn-Runtime/v2-19-le-136/etc/profile.d/init.sh:source:7: no such file or directory: /cvmfs/alice.cern.ch/el7-x86_64/Packages/ApMon-CPP/v2.2.8-alice5-40/etc/profile.d/init.sh You can safely ignore the ApMon message. Without the root file, I still see over one hundred preloaded, while "on demand" and the memory usage is roughly the same. # eulisse at lxplus707.cern.ch in ~ [9:39:17]
→ ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep Preloading | wc -l
119
# eulisse at lxplus707.cern.ch in ~ [9:39:24]
→ ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq | wc -l
34
# eulisse at lxplus707.cern.ch in ~ [9:39:57]
→ /usr/bin/time -v root.exe -l -b -q
Command being timed: "root.exe -l -b -q"
User time (seconds): 0.22
System time (seconds): 0.26
Percent of CPU this job got: 62%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.78
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 139016
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 56647
Voluntary context switches: 5479
Involuntary context switches: 23
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0 With the root file being loaded, the preloading stays almost the same, while "on demand" goes to 41. This also results in a 60MB jump in memory: → ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep Preloading | wc -l
120
→ ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq | wc -l
41
→ /usr/bin/time -v root.exe -l -b -q ~/public/AO2D.root
Attaching file /afs/cern.ch/user/e/eulisse/public/AO2D.root as _file0...
(TFile *) 0x3910f20
Command exited with non-zero status 255
Command being timed: "root.exe -l -b -q /afs/cern.ch/user/e/eulisse/public/AO2D.root"
User time (seconds): 0.42
System time (seconds): 0.30
Percent of CPU this job got: 72%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.01
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 207928
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 76766
Voluntary context switches: 9921
Involuntary context switches: 28
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 255 The difference indeed seems to come from loading SOFIE and PyMVA on demand (why?). → diff -Naur =(ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep Preloading) =(ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep Preloading)
--- /tmp/zsh3krahi 2023-06-28 09:47:35.214307973 +0200
+++ /tmp/zshDEGUqN 2023-06-28 09:47:35.843315522 +0200
@@ -117,4 +117,3 @@
Info in <TCling::__LoadModule>: Preloading module MultiProc.
Info in <TCling::__LoadModule>: Preloading module Imt.
Info in <TCling::__LoadModule>: Preloading module MathCore.
-Info in <TCling::__LoadModule>: Preloading module Tree. → diff -Naur =(ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq) =(ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq)
--- /tmp/zshD2FXJ0 2023-06-28 09:48:55.414270426 +0200
+++ /tmp/zsh1jCtJt 2023-06-28 09:48:56.293280974 +0200
@@ -1,5 +1,4 @@
'Cling_Runtime'
-'Cling_Runtime_Extra'
'Core'
'Foam'
'Genetic'
@@ -9,7 +8,6 @@
'Hist'
'HistFactory'
'Imt'
-'libc'
'MathCore'
'MathMore'
'Minuit'
@@ -17,7 +15,6 @@
'MultiProc'
'Net'
'NetxNG'
-'PyMVA'
'Rint'
'RIO'
'RooFitCore'
@@ -26,16 +23,12 @@
'RooStats'
'ROOTDataFrame'
'ROOT_Foundation_Stage1_NoRTTI'
-'ROOT_Rtypes'
-'ROOTTMVASofie'
-'ROOTTMVASofieParser'
'ROOTTreeViewer'
'ROOTVecOps'
'ROOTWebDisplay'
'Smatrix'
'std'
'Thread'
-'TMVA'
'Tree'
'TreePlayer'
'XMLIO' Full list of modules being preloaded can be found below.
while the full list of on demand modules is: 'Cling_Runtime'
'Cling_Runtime_Extra'
'Core'
'Foam'
'Genetic'
'GenVector'
'Geom'
'Gpad'
'Hist'
'HistFactory'
'Imt'
'libc'
'MathCore'
'MathMore'
'Minuit'
'Minuit2'
'MultiProc'
'Net'
'NetxNG'
'PyMVA'
'Rint'
'RIO'
'RooFitCore'
'RooFitHS3'
'RooFitJSONInterface'
'RooStats'
'ROOTDataFrame'
'ROOT_Foundation_Stage1_NoRTTI'
'ROOT_Rtypes'
'ROOTTMVASofie'
'ROOTTMVASofieParser'
'ROOTTreeViewer'
'ROOTVecOps'
'ROOTWebDisplay'
'Smatrix'
'std'
'Thread'
'TMVA'
'Tree'
'TreePlayer'
'XMLIO' the file used can be found in |
Ok, indeed SOFIE gets loaded because of the Experimental namespace, I guess. I do not see it with your patch. I will try to do a full build on CVMFS with it applied. |
Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member? |
Yes! Very likely. |
As described in root-project#13000, ROOT will happily load all the PCMs relative to the Experimental namespace, just because such a namespace is found as a transient member of TBranch. This results in a non negligible amount of private memory being used due to the fact tha a number of Experimental features will see their PCMs deserialised.
That's actually it. If I remove the Experimental namespace and simply use Internal::TBulkBranchRead, it does not load the other Experimental bits either. |
and I just realised that the "big modules" are not necessarily big, they just happen to trigger the resizing of some pool. 5280ade seems to give 4 / 5 MB: nothing dramatic. |
However, that in practice shows a more generic design problem in the clang modules system. That is, then we make a lookup of a namespace identifier clang (rightfully) tries to collect all namespace partitions from all reachable modules. This has to do with things like overload resolution. Due to the autoloading system, ROOT essentially considers all modules reachable and thus loads them. The only reliable way to fix this is to make the clang module loads a no-op which is a bit of a challenge… |
That pool resizing has to do with source locations, right? Can you show the stack trace? |
Yes, that's what I was talking about. That's the relevant line of code that we need to "fix" in clang. And this operation we try hard to avoid by delaying the module loads until possible with the global module index (modules.idx) file. However, the QualType thing is new to me. Do you think we go through here? One can see the module file contents by running in ROOT:
|
If I understand correctly from: for x in every single PCM we have has 137 SPECIAL_TYPES. See here I am now trying to do a debug build to check if we go through the line you suggested. |
to be clear, as I said above, the profile changed, however the total sum for cling initialisation is still at 70MB. |
I meant to ping on the stats without a file.
Can you share the new profile? |
So I finally managed to get back to this and try a few things and as a result in my tests I managed to go from 82 MB of overhead to 62 MB. See #13641 for the actual code. While there is still a few things to cleanup, I think what I have is fairly general and quite self contained change, so I would appreciate feedback and if people like it, instructions on how to proceed to make sure this ends up upstream. The solution is based on a newly introduced Initially I developed a mechanism to be general enough to do complex initialisations of the elements, however I later realised this is not actually needed. In order to optimise the issue with the large vectors A few caveats:
Comments? |
This looks awesome!
We did not come up with a better name. Ours was
If it is not needed, let's keep it simple. It would be easier to go through the upstream llvm reviews.
Can we make the API of the
The source locations offset would be a major source of improvement if this technique flies there. PP: It seems llvm has some facility along these lines: https://llvm.org/doxygen/classllvm_1_1SparseBitVector.html |
No, you would need first to register the range [100, 200) via addEmptyRange(). I could try to modify it for that behaviour, if you prefer. Maybe it would allow me to get rid of the logarithmic lookup, actually (basically, one less vector to worry about).
Ok, I will clean it up a bit more.
Yes, users would need to addEmptyRange where they now do resize(). I guess I could actually hide addEmptyRange inside the resize. Is it allowed to use std::pmr::vector in the llvm codebase?
I couldn't find the source locations vector anymore. Could you point it to me?
I will have a look. |
I don't think so. I could not find uses of it in clang.
From here you will need to jump to the |
@vgvassilev I have updated the PR to include a similar patch for the SourceManager and at least the trivial test seems to work fine, including a nice 9MB reduction in memory allocations. I am now testing with the ALICE analysis framework. I have also done a few of the cleanups, and it now only exposes an "expand" interface (basically resize without shrinking). AFAICT, it's not worth to implement the full "resize" functionality, given it's complicated and at least the places I fixed never shrink. I think also the calloc approach might not be a good idea given realloc does not guarantee that the memory is zeroed and besides that page ranges might be a tad too big for vector of pointers and so on. I would put aside that idea, at least for now. My current plan:
|
For the record, I have done what I outlined above and updated the PR. I do see some drastic improvements for some of our workflows (250MB out of 1GB) where we have many processes initialising the interpreter. For others, where the usage of ROOT is limited to reading files in a single process, the improvement is not so obvious. Simply opening a file does show improvement as well (I am down to 49MB, when I also patch the FileInfo vector in the HeaderSearch). The PR somehow seems to die with some old memory corruption which I am pretty sure I fixed and I cannot reproduce anymore. Is there any need to clean some cache? Notice I have also submitted the patch to llvm itself and it passes their CI (llvm/llvm-project#66430 (comment)) |
I cannot reproduce it either, but a number of Linux builds agree that the assertion can still be triggered. PRs should be built from a clean state (in Jenkins), but even if not you are not changing the on-disk format so incremental builds must also work correctly. This needs investigation.
Very likely the pre-merge CI does not test a large scale module setup. You will have to test this yourself and prove that the changes don't regress the case when modules are used (almost) completely. This will be needed to convince the wider community that it is solving an actual problem, or at least not regressing the "normal" compiler use case. |
What I think is happening is that the broken PR generated some broken file, and the new PR is not able to survive such broken file. IMHO, this is a problem with the build setup (not being able to notice broken files and delete them) more than with my PR (which of course has no dependency on some intermediate state). Shall I just open a new PR and we see how that goes? |
I'll do what it takes, thank you for pointing that out. |
I'm telling that this is not the case - there are no left-behind files (in Jenkins) that influence future PR runs. We know this from changing the on-disk format / upgrading LLVM / etc. There is a problem and it needs to be investigated. |
I think quite some improvements are there in 6.30.04 (released) and master. @ktf is it difficult for you to check if the issue is fixed, and, if yes, close the item? If not I can start from your repro and proceed. |
Indeed the patch I provided cuts half of the overhead, however there is still 40 MB per process I cannot really justify at the moment. In our case that translates to 4 GB of RSS. While I appreciate that being completely lazy in the PCM loading is probably complicated, maybe some tactical solution could be employed (like it was done for the PagedVector). For example I am not convinced ReadSLocEntry needs to keep around the buffer. There is moreover a few more places where the PagedVector could be used effectively, I will try to propose a separate PR for that. The reproducer is as easy as opening a ROOT file, see the main issue. A new profile is: |
I am wondering if we are not pushing repetitive pragma diagnostic mappings. We have tons for pragrams coming from the Linkdef files, not sure if they are needed in the module file itself... |
In theory, if the software stack (not just ROOT) uses modules the RSS should also go down. Maybe that's worth trying on your end... |
any way i can confirm this? |
Not directly. Either we can run rootcling with |
Dear @ktf, I am closing the issue assuming that this is not an issue any more for ALICE when adopting ROOT 6.32.X. Please do not hesitate to re-open in case this is not the case, providing the context necessary for us to fix the problem. |
It's not an issue for adoption. The underlying issue is still there, though. ROOT still loads a bunch of unneeded PCM when simply opening a file, it's just the cost is half what it was before and the reproducer is the same as above. |
Check duplicate issues.
Describe the bug
When doing a simple
TFile::Open()
, ROOT initialises the interpreter on the firstTEnv::GetValue
needed by the aforementioned method. This makes it load all the pcms which it can find, regardless of weather they are needed to read the file or not (in our case we only have branches with basic types, e.g. floats or ints or vector of thereof).This has the net result of adding 70MB of RSS for no particular reason to any process which uses ROOT. You can find more details in the attached instruments screenshot.
Moreover, single stepping with the debugger and printing out the LoadModule argument, it seems that the pcms for Sofie, Proof and 3D graphics are particularly honours, all of which we do not need and we do not link.
What is the expected behaviour?
The best would be for ROOT to be smart enough to lazily load the pcms when they are actually needed, not upfront. A second best option would be to be able to start the interpreter in a clean slate state and load by hand only the pcms which are needed to read a file.
How to reproduce?
On macOS using Instruments one can do:
and get the report in question. Using the alice environment should not matter, but in principle you should be able to load it from CVMFS.
ROOT version
6.28.04, but it's a while we see it.
How did you install ROOT?
ALICE build environment
Which operating system are you using?
macOS, linux
Additional context
No response
The text was updated successfully, but these errors were encountered: