Skip to content

the first line of .d files is ambiguous due to clang's output format #2046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
andrewrk opened this issue Mar 11, 2019 · 17 comments
Open

the first line of .d files is ambiguous due to clang's output format #2046

andrewrk opened this issue Mar 11, 2019 · 17 comments
Labels
bug Observed behavior contradicts documented or intended behavior contributor friendly This issue is limited in scope and/or knowledge of Zig internals. upstream An issue with a third party project that Zig uses. zig cc Zig as a drop-in C compiler feature
Milestone

Comments

@andrewrk
Copy link
Member

zig/src/cache_hash.cpp

Lines 431 to 432 in fec4555

// skip first line
SplitIterator_next(&it);

But here's an example file:

test.o: test.c b/foo.h

In this example there are 2 files on the first line. The dep file parsing probably needs to gain a real tokenizer, and support any number of items on any line. For example it also doesn't look for multiple files on the same line if there are double quotes.

I also haven't managed to find documentation for this file format. It would be nice to find a reference. Supposedly it's "NMake or Jom" format.

Finally, audit the places that call cache_add_dep_file and make sure they don't redundantly add the main source file, since it's contained in the dep file as well.

@andrewrk andrewrk added bug Observed behavior contradicts documented or intended behavior contributor friendly This issue is limited in scope and/or knowledge of Zig internals. stage1 The process of building from source via WebAssembly and the C backend. labels Mar 11, 2019
@andrewrk andrewrk added this to the 0.4.0 milestone Mar 11, 2019
@surajbarkale
Copy link

Technically dep file must be a valid Makefile. In practice, compilers support a few different forms. Ninja build has a comprehensive parser for the format: https://github.com/ninja-build/ninja/blob/master/src/depfile_parser.in.cc

The supported constructs end up being:

  1. Simple
    out: in1 in2 in3
  1. Escaped newlines
    out: in1 \
         in2 \
         in3
  1. Separate declarations
    out: in1
    out: in2
    out: in3
  1. Ignore phony rules generated by the gcc -MP option
    out: in1 in2 in3
    in1:
    in2:
    in3:

@mikdusan
Copy link
Member

mikdusan commented Apr 1, 2019

I also haven't managed to find documentation for this file format

file clang/lib/Frontend/DependencyFile.cpp function PrintFilename() is probably the best we're going to get. what I can see right now is if someone is doing trickery with quotes inside an include file name, clang's emission doesn't handle it too well so they're really not all too formalized anyways.

Here is output from both styles, using 2 source files foo.c is simple, and odd.....name.c is including a bunch of specials. One important surprise is that the target name uses make/compatible format even if -MV is passed to clang. So I'm thinking it would probably just be easier to implement a tokenizer for make/compatible because NMake/Jom format would require a tokenizer that understood make/compatible anyways.

NMake/Jom format

clang -MV -MM -MF - -MG foo.c 'odd{ball} #$^! name.c'

foo.o: foo.c foo_01.h foo_02.h foo_03.h foo_04.h foo_05.h foo_06.h \
  foo_07.h foo_08.h foo_09.h foo_10.h
odd{ball}\ \#$$^!\ name.o: "odd{ball} #$^! name.c" "odd{01}.h" "odd$.h" "odd^.h" \
  "odd#.h" "odd!.h"

make/compatible format - notice esca

clang -MM -MF - -MG foo.c 'odd{ball} #$^! name.c'

foo.o: foo.c foo_01.h foo_02.h foo_03.h foo_04.h foo_05.h foo_06.h \
  foo_07.h foo_08.h foo_09.h foo_10.h
odd{ball}\ \#$$^!\ name.o: odd{ball}\ \#$$^!\ name.c odd{01}.h odd$$.h odd^.h \
  odd\#.h odd!.h

@andrewrk
Copy link
Member Author

andrewrk commented Apr 1, 2019

On Windows, the \ is used in file paths, so it can't be relied on for escaping. The make/compatible format is ambiguous on Windows.

Let's have a chat with the clang developers and make sure that this isn't a bug, both the ambiguous case for make/compatible format, and the case where the target name does not respect -MV.

@mikdusan
Copy link
Member

mikdusan commented Apr 1, 2019

Finally, audit the places that call cache_add_dep_file and make sure they don't redundantly add the main source file, since it's contained in the dep file as well.

is this to mean that in the following .d, source.c should not be passed to cache_add_file() because the build system already knows the dep?

zig-cache/tmp/h6XndQRnlXKV-source.o: source.c foo.h bar.h

@andrewrk
Copy link
Member Author

andrewrk commented Apr 1, 2019

Yep I think you have the correct understanding. Stated another way, it means that in this case cache_add_dep_file should be assumed to add source.c, foo.h, and bar.h, and so an additional cache_add_file for source.c would be redundant.

@mikdusan
Copy link
Member

mikdusan commented Apr 1, 2019

On Windows, the \ is used in file paths, so it can't be relied on for escaping. The make/compatible format is ambiguous on Windows.

Let's have a chat with the clang developers and make sure that this isn't a bug, both the ambiguous case for make/compatible format, and the case where the target name does not respect -MV.

hold off a bit. i've got what I think is a really good implementation that i'm fuzz-testing using default make/compatible .d format.

as it relates to make/compatible format emitted by clang, It took me a bit to get these nuggets and here is what my prwip is doing:

\ escape

  • if \ is followed by one of { <space>, # } then it is an escape so we translate:
    • \<space><space>
    • \##
  • otherwise \ is literal and passed through
    • note: this part is difficult, I don't have zig/clang on windows platform to test, but for macos clang is consistently rewriting literal \/ for .d generation. I don't know if this clang-rewrite happens on windows
  • nonetheless, prwip accepts literal \ unless used as an escape for <space> and #

$ escape

  • if $ is followed by exactly one $ then it is an escape and we translate:
    • $$$
  • it should be impossible to ever see a single $ in .d so it is an error if detected

@andrewrk
Copy link
Member Author

andrewrk commented Apr 1, 2019

Without -MV, windows paths don't get escaped.
bad-dep-file
This format is ambiguous.

@mikdusan
Copy link
Member

mikdusan commented Apr 4, 2019

This format is ambiguous.

So that answers a question. clang was re-writing \/ on macOS. I guess it's rewriting is turned off on windows as per test-obj.d file.

But -MV is still ambiguous too because (at least on macOS) it does this:

foo$$.o: "foo$.c" "foo$.h"

target does not obey -MV syntax. it's prereqs do. I understand that it's less likely a target will need quotation.

Right now it sounds the like the best way to go is indeed to switch prereq tokens (RHS of colon) to -MV format for my incoming pr. And leave make/compatible tokenization for target tokens (LHS of colon). Just be aware we still have that ambiguity with target.

@andrewrk
Copy link
Member Author

andrewrk commented Apr 4, 2019

I think it's definitely worth filing a bug report with clang. I can help with that.

@mikdusan
Copy link
Member

mikdusan commented Apr 4, 2019

find and xargs have a need to share a file list and they use -print0 and -0 to pipe nullz filenames. Piping aside (that's not a terrible idea either but feature creep), a new -M0 or -MZ option for clang to generate nullz filenames would suit: 3 targets represented fairly easily:

target,null
prereq,null
null
target,null
null
target,null
prereq,null
prereq,null

@andrewrk
Copy link
Member Author

andrewrk commented Apr 4, 2019

But -MV is still ambiguous too because (at least on macOS) it does this:

foo$$.o: "foo$.c" "foo$.h"

Is that actually ambiguous? Isn't the $ escaped with another $? We need to find out the behavior on Windows.

@andrewrk
Copy link
Member Author

andrewrk commented Apr 4, 2019

Upstream bug report: https://bugs.llvm.org/show_bug.cgi?id=41379

@andrewrk andrewrk added the upstream An issue with a third party project that Zig uses. label Apr 4, 2019
@mikdusan
Copy link
Member

mikdusan commented Apr 4, 2019

But -MV is still ambiguous too because (at least on macOS) it does this:

foo$$.o: "foo$.c" "foo$.h"

Is that actually ambiguous? Isn't the $ escaped with another $? We need to find out the behavior on Windows.

Yes, foo$$.o is not ambiguous, my intent was only to show that simultaneously, an emitted .d file has two kinds of escaping going on.

This is probably where it could be fixed upstream:
https://github.com/llvm/llvm-project/blob/ce2d45e7ba4d86a1f7b694deefddda4b60ff6fd7/clang/lib/Frontend/DependencyFile.cpp#L463-L464

@andrewrk
Copy link
Member Author

andrewrk commented Apr 4, 2019

Goodness. Did you see this comment?

https://github.com/llvm/llvm-project/blob/ce2d45e7ba4d86a1f7b694deefddda4b60ff6fd7/clang/lib/Frontend/DependencyFile.cpp#L352-L399

What a mess.

@mikdusan
Copy link
Member

mikdusan commented Apr 21, 2019

For posterity: Windows clang does the following with target pathnames; note that drive letter : is also never escaped. I'll have to get creative because it clashes with target-end delimiter.

literal pathname .d target number of literal spaces
C:\Users\mike\foo.o C:\Users\mike\foo.o:
C:\Users\mike\foo .o C:\Users\mike\foo\ .o: 1
C:\Users\mike\foo#.o C:\Users\mike\foo\#.o:
C:\Users\mike\foo$.o C:\Users\mike\foo$$.o:
C:\Users\mike\ foo.o C:\Users\mike\\\ foo.o: 1
C:\Users\mike\#foo.o C:\Users\mike\\#foo.o:
C:\Users\mike\$foo.o C:\Users\mike\$$foo.o:
C:\Users\mike\ foo.o C:\Users\mike\\\ \ \ \ \ foo.o: 5

mikdusan added a commit to mikdusan/zig that referenced this issue May 27, 2019
- wip for ziglang#2046
- clang .d output must be created with `clang -MV` switch
- implemented in Zig
- hybridized for zig stage0 and stage1
- zig test src-self-hosted/dep_tokenizer.zig
@andrewrk
Copy link
Member Author

Now that @mikdusan's new parser is merged, Zig looks at the first line of dep file parsing but this issue remains, because clang outputs ambiguous .d files. So this issue is open to follow up with the upstream bug report and get the issue fixed in clang.

One more item to solve, once the first line files is solved, is making sure that zig does not redundantly add the target file to the cache. This would be harmless, but a small waste of time.

@andrewrk andrewrk changed the title dep file parsing doesn't look at the first line, but the first line can have files on it the first line of .d files is ambiguous due to clang's output format May 30, 2019
@andrewrk andrewrk modified the milestones: 0.5.0, 0.6.0 Aug 27, 2019
@andrewrk andrewrk modified the milestones: 0.6.0, 0.7.0 Jan 3, 2020
@andrewrk andrewrk modified the milestones: 0.7.0, 0.8.0 Aug 13, 2020
@andrewrk andrewrk modified the milestones: 0.8.0, 0.9.0 Nov 6, 2020
@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 May 19, 2021
@andrewrk
Copy link
Member Author

Upstream bug report has moved to GitHub: llvm/llvm-project#40724

This does not appear to be a high priority for the LLVM team so will likely require us to send a patch in order to get addressed.

@andrewrk andrewrk added zig cc Zig as a drop-in C compiler feature and removed stage1 The process of building from source via WebAssembly and the C backend. labels Dec 27, 2022
@andrewrk andrewrk modified the milestones: 0.12.0, 1.1.0 Dec 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior contributor friendly This issue is limited in scope and/or knowledge of Zig internals. upstream An issue with a third party project that Zig uses. zig cc Zig as a drop-in C compiler feature
Projects
None yet
Development

No branches or pull requests

3 participants