Skip to content

The extractor sometimes picks a non-package-local .cc file's command for header files #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matta opened this issue Feb 7, 2022 · 11 comments

Comments

@matta
Copy link
Contributor

matta commented Feb 7, 2022

In most cases the commands extractor seems to generate compile commands that builds the "most closely related" .cc file with each header. E.g. a compile command for "a/thing.h" would actually build "a/thing.cc", and those are usually paired in a single cc_library rule in the //a package. This is unsurprising and intuitive; if I wanted to check if a/thing.h compiled I'd probably build //a:thing. This is also what bazel build --compile_one_dependency a/thing.h would do.

I have noticed that sometimes the bazel-compile-commands-extractor does something else -- it will pick a .cc file in some other package, when a .cc file in the same package is present and would be a better choice.

This was confusing to me in one situation, where I was changing an API and knew most of my project would be temporarily broken. However, I had fixed the local code, so (in the terms of the example above) bazel build //a/... worked and bazel test //a/... passed, but my clangd for a/thing.h was in a borked state because some other package in my project was not building. I can't remember exactly, but I may have also tried to run clang-tidy thing.h and had it fail for the same reason.

Here is one example:

I have a cc_library rule:

# In the //base package (base/BUILD.bazel)
cc_library(
    name = "test",
    srcs = ["test.cxx"],
    hdrs = ["test.hxx"],
    deps = [
        "//third_party/hash-library:sha256",
        ":base",
    ],
    visibility = ["//visibility:public"],
    linkopts = ["-lboost_program_options"],
)

But my compile_commands.json uses a .cxx file from some other package when building "base/test.hxx":

  "file": "base/test.hxx",
  "command": "/usr/lib/llvm-13/bin/clang [...] -c wavl/tree_test.cxx -o bazel-out/k8-fastbuild/bin/wavl/_objs/tree_test/tree_test.pic.o",

Here is the rule with wavl/tree_test.cxx:

# In the //wavl package (wavl/BUILD.bazel)
cc_test(
    name = "tree_test",
    srcs = ["tree_test.cxx"],
    deps = [
        "//base:test",  # wavl/tree_test.cxx includes base/test.hxx
        ":tree",
        "//base",
    ],
)

Bazel has some idea that base/test.cxx is a better choice, because bazel build --subcommands --compile_one_dependency base/test.hxx picks the //base:test target:

$ bazel build --subcommands --compile_one_dependency base/test.hxx
Loading: 
Loading: 0 packages loaded
Analyzing: target //base:test (0 packages loaded, 0 targets configured)
INFO: Analyzed target //base:test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
bazel: Entering directory `/home/matt/.cache/bazel/_bazel_matt/b06faae89ff42e5fae839cbe27f8d9e7/execroot/__main__/'
[0 / 4] [Prepa] BazelWorkspaceStatusAction stable-status.txt
SUBCOMMAND: # //base:test [action 'Compiling base/test.cxx', configuration: eacdf726293faba19ac503f2e5fbecb17c1cd58b5c40ea287666f3a8e3b68318, execution platform: @local_config_platform//:host]
(cd /home/matt/.cache/bazel/_bazel_matt/b06faae89ff42e5fae839cbe27f8d9e7/execroot/__main__ && \
  exec env - \
    PATH=blah blah blah \
    PWD=/proc/self/cwd \
  /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++0x' -MD -MF bazel-out/k8-fastbuild/bin/base/_objs/test/test.pic.d '-frandom-seed=bazel-out/k8-fastbuild/bin/base/_objs/test/test.pic.o' -fPIC -iquote . -iquote bazel-out/k8-fastbuild/bin -Wextra -Werror '-std=gnu++20' -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c base/test.cxx -o bazel-out/k8-fastbuild/bin/base/_objs/test/test.pic.o)
# Configuration: eacdf726293faba19ac503f2e5fbecb17c1cd58b5c40ea287666f3a8e3b68318
# Execution platform: @local_config_platform//:host
bazel: Leaving directory `/home/matt/.cache/bazel/_bazel_matt/b06faae89ff42e5fae839cbe27f8d9e7/execroot/__main__/'
Target //base:test up-to-date:
  bazel-bin/base/libtest.a
  bazel-bin/base/libtest.so
INFO: Elapsed time: 1.899s, Critical Path: 1.85s
INFO: 2 processes: 1 internal, 1 linux-sandbox.
INFO: Build completed successfully, 2 total actions
INFO: Build completed successfully, 2 total actions
@cpsauer
Copy link
Contributor

cpsauer commented Feb 8, 2022

Hey, @matta!

Thanks as always for your concern and careful eye.

I definitely know what's happening here; we're currently just picking a command arbitrarily among those used to compile each header.

Recall that the header-command-finding mechanism is working by determining all the headers used in compiling a given source file. (This includes transitive inclusion.) Since headers aren't otherwise compiled, reversing the mapping lets us trace all the compile commands that parse each header!

We're then just picking one (arbitrary) command for each header, among the many that might compile it--and emitting that that command for the header in compile_commands.json. (Once upon a time, we used to include all commands for each header, but this blows up the compile_commands.json file size, since some headers are included numerous times. See #6. This was blocking some people, so I fixed it in 084957e) Clangd just picked an arbitrary one anyway, so, faster, smaller output, no functionality changed for users.

There's no logic here like currently like "prefer the command for the closest source file" -- on the theory that we just needed one representative command, used to compile the file in the targets specified.

FWIW, there may indeed be no close source file. Think templates, header-only libraries, or prebuilt libraries. Assuming the right source is nearby is what gets clangd's current heuristics in trouble, requiring this whole header workaround. compile_commands.json wasn't originally intended to contain source files in the first place! I'm very much hoping that clangd will fix the underlying issue relatively soon, and then we can put this necessary but fairly yucky workaround to rest.

Anyway, it sounds like the crux of your point is that picking the nearest file's command is most likely to be correct, so we should pick it instead of an arbitrary one. I get it, and it hadn't occurred to me before. Though like hopefully the point of the test suite is that they all build and run? More generally, it seems like what we really want is preferring ones where the command is in good shape, but it's not quite clear to me the best way to detect that. I should say that with our current solution, we'll at least (gracefully) fail to find headers when the command is sufficiently messed up, preventing those commands from being emitted for headers.

If you feel strongly that borked sub targets are a common enough case that we should defend against them, I'm definitely open to a fix that uses a heuristic to pick among header commands, rather than just taking whichever one happens to have come first. Code would mesh in parallel to the commit listed above that picks arbitrarily. Happy to help orient you more if you'd like more than that! Also happy to accept incremental heuristics as you think of them or encounter cases.

I should also say that I don't think this is currently on clangd's radar, so if you feel strongly about the importance of this, you might want to chime in on their issue. That way the behavior you like will hang around after we move off the workaround!

All the best,
Chris

P.S. The more we write, the more I'm getting the sense you might be a current or former Googler, too. Is that right?

@cpsauer cpsauer changed the title The extractor sometimes picks a non-package-local .cc file for when building a header file. The extractor sometimes picks a non-package-local .cc file's command for header files Feb 8, 2022
@matta
Copy link
Contributor Author

matta commented Feb 8, 2022

I think picking a "nearby" source file for each header is likely to make good sense in most cases, but clangd/clangd#123 (comment) makes an interesting point:

most of the time the user would like to see the header in context of the most recent file they were working in and existing solution guarantees that in most cases.

Related is Sam's admission here: clangd/clangd#45 (comment). In most code in Google even header files are "self contained" (https://google.github.io/styleguide/cppguide.html#Self_contained_Headers), and command like clang [flags] foo.h work! In this context, the question of which "associated .cc file" to use is about what compiler flags to pick, not which one to include in the compilation_commands.json command line(s).

Hm, might it work to support that expression of compile_commands.json, for code bases that enforce this style rule?

P.S. Yep, worked at Google for 13 years on all sorts of stuff, most recently the C++ library team, left 2.5 years ago. When I joined Chrome, iOS, Android, and Bazel did not exist, the Google BUILD files were just Python, and the build system spat out GNU Makefiles. :-)

@cpsauer
Copy link
Contributor

cpsauer commented Feb 9, 2022

:) Yep, agreed! And cool stuff! Curious what you're up to these days--but I'll connect separately.

@cpsauer
Copy link
Contributor

cpsauer commented Feb 9, 2022

Just landed Lukas' big Windows-prep refactor, so the merge-conflict coast is clear if you want to take a shot at this!
Logic would go near the set() "Only emit one entry per header" of the new refresh.template.py

@cpsauer
Copy link
Contributor

cpsauer commented Apr 7, 2022

Saw os.path.commonprefix while landing Windows support and thought back about this. Has this continued to be an issue @matta?

@cpsauer
Copy link
Contributor

cpsauer commented May 20, 2022

@matta, I'm tempted to close this, unless it's continuing to be an problem for you and/or worth your tossing up a quick PR to fix

@matta
Copy link
Contributor Author

matta commented May 20, 2022 via email

@cpsauer
Copy link
Contributor

cpsauer commented May 20, 2022

Will do. Curious what you've moved on to!

If anyone else reading this later has the same need--holler and we'll reopen.

@cpsauer cpsauer closed this as completed May 20, 2022
@sthornington
Copy link

sthornington commented Sep 5, 2024

I'm digging into this now, mostly because we also use compile_commands.json to drive a clang-tidy job, and while everything builds fine, the hedron stuff is picking exemplar cpp files which are actually generated ones (e.g. cython .cxx) that we do not care are not tidy. So I'm trying to find the spot in refresh.template.py where I can add an exclusion pattern to the source files that it chooses from when picking some arbitrary source file to drive the compilation of a given header.... pointers appreciated!

@sthornington
Copy link

@cpsauer could you point me to the spot in the python which would be where one would put these heuristics in? I could do what I need there, I think...

@sthornington
Copy link

sthornington commented Sep 6, 2024

Okay I found something that works, I think - in my branch, I added another parameter which contains a list of source file extensions to treat as ineligible for header file command generation, exactly as is done hard-coded for assembly stuff here https://github.com/hedronvision/bazel-compile-commands-extractor/blob/main/refresh.template.py#L670

PS - I am happy to update my PR with these various changes but I haven't seen any traction on it so for now I'm just carrying on in our internal mirror with this stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants