-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Shouldn't -fmodules-embed-all-files
be the default?
#72383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-clang-modules Author: Boris Kolpackov (boris-kolpackov)
Consider these two translation units:
// hello.mxx
module;
#include <string>
#include <string_view>
export module hello;
export namespace hello
{
std::string
say_hello (const std::string_view& name)
{
return "Hello, " + std::string (name) + '!';
}
} // main.cxx
#include <string_view>
import hello;
void f (const std::string_view&) {}
int
main ()
{
hello::say_hello ("World");
} If I compile
This can be fixed by adding However,
But a minor change in // main.cxx
#include <string_view>
import hello;
void f (const std::string_view&) {}
int
main ()
{
f (hello::say_hello ("World")); // <-- NOW CALLING f().
}
But:
And if I add I have two questions about this:
@iains @ChuanqiXu9 @dwblaikie |
For myself, I think not embedding the source is the right default. (the build depends on scattered source files on disk without modules, I'm OK with it depending on them (even indirectly through module files) with modules) |
No.
I am happy to make it the default at least for C++20 modules. It will be pretty helpful for distributed build and sandbox-based build. |
While I don't disagree with any of the points made (it's a tradeoff), I want to highlight one aspect that I think is overlooked: not embedding the sources prevents you from moving the BMI. The fact that Clang 17 now seems to be needing the sources in fewer cases might actually make the matter worse since you usually don't get an error even if you moved the BMI. It took me probably half a day yesterday to try to understand what on earth is going on. |
Here is one date point towards that: the |
The size of embedded files are much smaller than the size of the BMI. Let's make it in 2 weeks if no objection comes in. |
Not sure I follow - is it 31MB without embedded files, or with them? How much does it change with/without?
Could we get more details on that? Do they embed the source files? Do they do something else? |
It is 31662KB with and 30574KB without.
I cannot say for sure whether they embed source files (my guess is that they do not, they just serialize the AST). But I can say for sure they don't have shallow referenced to source files since in |
Fair enough, that sounds alright.
I wouldn't necessarily conclude that - preprocessed files might still have |
I am not sure I follow the reasoning here: yes, the preprocessed
I am probably missing something here, but I don't think that GCC and MSVC not having such a requirement are doing something special. Rather, Clang needing access to the original source code in addition to |
GCC at least also quotes source code in diagnostics, as I understand it - so I'd be curious to know whether they have the same dependence, or they embed the source code? If they do embed source, that seems helpful to know/double-check our direction here. |
CC @iains for the details in GCC. |
But shouldn't the quoted source code in diagnostics come from the original (unpreprocessed) source file rather than from the preprocessed? Though I suppose if it's only partially preprocessed (with Just to make doubly sure, the setup is:
|
ah, that seems like a different concern than the one outlined at the start of the bug - perhaps other compilers only depend on the files referenced by #file directives, whereas clang depends on the actual input file? It'd be good to understand more about how other compilers deal with these cases - it doesn't seem obvious to me that an implementation would refer back to the original source files (how would that work when you are compiling the original code anyway? Maybe you only have access to the actual input file, not the files referenced in #file directives... - so depending on the real files seems like a problem too) |
Sent #74419 |
CC @iains and Gaby (I don't know why I can't CC Gaby directly. I'll try to send a private mail.) |
|
Thanks for pinging via email. No, MSVC does not embedded the input source file in its IFC. There is a pending ask from EDG for the IFC spec to have some form of "resolved token streams" partition embedded in an IFC, but I think that is different. |
That is what I would expect. |
GCC has references to the source file paths (one in a section which is essentially human-readable information about how the BMI was built; but I do not believe that participates in the validation of whether a BMI is applicable). The other instance is related to .file directives, I think (you might be better to confirm the intent with @urnathan). As for embedding source text GCC does not do this AFAIK. If I remember previous conversations correctly, the reason for this in clang was to do with improving debug experience with clang modules (edit: e.g. when builds are distributed), @Bigcheese ? |
+1 |
It makes no sense that Clang requires the original file in order to compile, but the solution is to fix Clang to stop requiring that. I think it does not make sense to modify Clang to embed all the source code into the module by default. If Clang cannot read a source file, it is reasonable for diagnostics to skip printing the source-code context for that error, but why should there should be any other issues? |
Source locations are pretty fundamental in Clang and are used for lots of things, but the actual content itself has two uses that I'm aware of. The first is for diagnostics, and it would be sad if using modules meant you got worse diagnostics. The other is that we store pointers into file buffers, so |
The good side of embedding source files is that it is easy to implement. But the bad side is that it takes more space in the BMI files. Is there any other bad points? And if we don't want to embedding source files, we may need to touch seperate parts of lexer which may read the source files from my experience, it may require a larger refactoration. |
Embedding all source files seems to move even farther from being able to generate a BMI which doesn't require a full rebuild of all transitive dependencies -- even for source changes which shouldn't affect the interface. I know we're already not able to do that today, but if we decide to embed entire sources, I'm worried that effectively closes the door on such an idea for good. |
But that is already the case today, isn't it? For example, the recorded source locations of declarations will change after we made almost any change to the current source file. Also I feel embedding the source files may not prevent us to reduce the transitive dependencies. That said:
What we want to do is to remain the BMI of mod1.cppm unchanged if we only touched |
If we now embed the source files by default (so that users don't have to specify |
I know next to nothing about how PCM file format works, so what I am going to say here (from IFC perspective may not apply). If you embed the input source file in the PCM file, and a subsequence change to the source file does not result in any semantics change (e.g. some form of comments or other edits) in the sense that the BMI is the same as before then all users of that PCM file would have to be recompiled just because of the change to the embedded source file. Whether that is a scenario that you deem important is up to you; but from the IFC perspective, it isn't something we would do by default. |
Yeah, it sounds like https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755. And I am curious how MSVC handles the changed diagnostic problem (the reported locations in diagnostic may not be accurate) and the debug information problem (the locations in the debug information may not be accurate) if we changed the source locations in a module unit but don't recompile the users? |
While this sounds plausible, I think it's somewhat theoretical: if the source file changes then the build system will recompile the BMI anyway and any consumers will also need to be recompiled unless there is some more elaborate, hash-based change tracking involved (in which case this more involved approach can also ignore/take into account the extent of changes to the embedded source code). |
One man's theory is another man's practice. What I described is a concrete scenario observed in practice in dev inner loops.
The compile is invoked to generate a new BMI, correct, BUT the on-disk file holding the actual BMI does not need to change if there is NO binary-diff (say as reported by tool like |
It looks like this is the main concern. And this is highly related to my previous patches: #96453. And my conclusion from that series patches is: due to we will encode the source locations into the BMI, the BMI may always change after the corresponding source file changes. The thoughts in #96453 is try to avoid make such changes transitive and the job should be done. So the major concern looks resolved and I like to land the changes to enable |
I sent a RFC for this topic: https://discourse.llvm.org/t/rfc-modules-should-we-embed-sources-to-the-bmi/81029 |
So, just to clarity, the decision is to not make the |
yes |
Consider these two translation units:
If I compile
hello.mxx
with Clang 16, then removehello.mxx
, and attempt to compilemain.cxx
, I get an error:This can be fixed by adding
-Xclang -fmodules-embed-all-files
when compilinghello.mxx
.However,
-fmodules-embed-all-files
appears to no longer be necessary for this example if using Clang 17 or later:But a minor change in
main.cxx
can bring its requirement back if compiling via-frewrite-includes
:But:
And if I add
-Xclang -fmodules-embed-all-files
to the second command, then everything again compiles fine. I get exactly the same behavior with Clang 18.I have two questions about this:
Is
-fmodules-embed-all-files
the default starting from Clang 17? If the answer is yes, then there seems to be a bug in the-fdirectives-only
interaction.If
-fmodules-embed-all-files
is not the default, then should it not be made so? A BMI that still has references to source files is quite brittle since it can be moved around. AFAIK, neither GCC nor MSVC have this restriction for their BMIs.@iains @ChuanqiXu9 @dwblaikie
The text was updated successfully, but these errors were encountered: