[Debugify] Improve reduction of debugify coverage build output #150212

SLTozer · 2025-07-23T11:45:07Z

In current DebugLoc coverage builds, the output for any reasonably large build can become very large if any missing DebugLocs are present; this happens because single errors in LLVM may result in many errors being reported in the output report. The main cause of this is that the empty locations attached to instructions may be propagated to other instructions in later passes, which will each be reported as new errors. This patch prevents this by adding an "unknown" annotation to instructions after reporting them once, ensuring that any other DebugLocs copied or derived from the original empty location will not be marked as new errors.

As a separate but related change, this patch updates the report generation script to deduplicate results using the recorded stacktrace if they are available, instead of the pass+instruction combination. This reduces the size of the reduction, but makes the reduction highly reliable, as the stacktrace allows us to very precisely identify when two bugs have originated from the same place.

llvmbot · 2025-07-23T11:45:38Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-debuginfo

Author: Stephen Tozer (SLTozer)

Changes

In current DebugLoc coverage builds, the output for any reasonably large build can become very large if any missing DebugLocs are present; this happens because single errors in LLVM may result in many errors being reported in the output report. The main cause of this is that the empty locations attached to instructions may be propagated to other instructions in later passes, which will each be reported as new errors. This patch prevents this by adding an "unknown" annotation to instructions after reporting them once, ensuring that any other DebugLocs copied or derived from the original empty location will not be marked as new errors.

As a separate but related change, this patch updates the report generation script to deduplicate results using the recorded stacktrace if they are available, instead of the pass+instruction combination. This reduces the size of the reduction, but makes the reduction highly reliable, as the stacktrace allows us to very precisely identify when two bugs have originated from the same place.

Full diff: https://github.com/llvm/llvm-project/pull/150212.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Utils/Debugify.cpp (+9)
(modified) llvm/utils/llvm-original-di-preservation.py (+33-17)

diff --git a/llvm/lib/Transforms/Utils/Debugify.cpp b/llvm/lib/Transforms/Utils/Debugify.cpp
index 4210ce6da1eb2..39ff6e94510fb 100644
--- a/llvm/lib/Transforms/Utils/Debugify.cpp
+++ b/llvm/lib/Transforms/Utils/Debugify.cpp
@@ -707,6 +707,15 @@ bool llvm::checkDebugInfoMetadata(Module &M,
       DILocsBefore, DILocsAfter, InstToDelete, NameOfWrappedPass,
       FileNameFromCU, ShouldWriteIntoJSON, Bugs);
 
+#if LLVM_ENABLE_DEBUGLOC_TRACKING_COVERAGE
+  // If we are tracking DebugLoc coverage, replace each empty DebugLoc with an
+  // annotated location now so that it does not show up in future passes even if
+  // it is propagated to other instructions.
+  for (const auto &L : DILocsAfter) {
+    if (!L.second)
+      L.first->setDebugLoc(DebugLoc::getUnknown());
+#endif
+
   bool ResultForVars = checkVars(DIVarsBefore, DIVarsAfter, NameOfWrappedPass,
                                  FileNameFromCU, ShouldWriteIntoJSON, Bugs);
 
diff --git a/llvm/utils/llvm-original-di-preservation.py b/llvm/utils/llvm-original-di-preservation.py
index b5ccd7a3224f8..cdca2ec8153b8 100755
--- a/llvm/utils/llvm-original-di-preservation.py
+++ b/llvm/utils/llvm-original-di-preservation.py
@@ -6,6 +6,7 @@
 from __future__ import print_function
 import argparse
 import os
+import re
 import sys
 from json import loads
 from collections import defaultdict
@@ -22,6 +23,14 @@ def __init__(self, origin, action, bb_name, fn_name, instr):
     def key(self):
         return self.action + self.bb_name + self.fn_name + self.instr
 
+    def reduced_key(self, bug_pass):
+        if self.origin is not None:
+            # If we have the origin stacktrace available, we can use it to efficiently deduplicate identical errors. We
+            # just need to remove the pointer values from the string first, so that we can deduplicate across files.
+            origin_no_addr = re.sub(r"0x[0-9a-fA-F]+", "", self.origin)
+            return origin_no_addr
+        return bug_pass + self.instr
+
     def to_dict(self):
         result = {
             "instr": self.instr,
@@ -42,6 +51,9 @@ def __init__(self, action, fn_name):
     def key(self):
         return self.action + self.fn_name
 
+    def reduced_key(self, bug_pass):
+        return bug_pass + self.fn_name
+
     def to_dict(self):
         return {
             "fn_name": self.fn_name,
@@ -58,6 +70,9 @@ def __init__(self, action, name, fn_name):
     def key(self):
         return self.action + self.name + self.fn_name
 
+    def reduced_key(self, bug_pass):
+        return bug_pass + self.name
+
     def to_dict(self):
         return {
             "fn_name": self.fn_name,
@@ -478,7 +493,11 @@ def get_json_chunk(file, start, size):
 # Parse the program arguments.
 def parse_program_args(parser):
     parser.add_argument("file_name", type=str, help="json file to process")
-    parser.add_argument("--reduce", action="store_true", help="create reduced report")
+    parser.add_argument(
+        "--reduce",
+        action="store_true",
+        help="create reduced report by deduplicating bugs within and across files",
+    )
 
     report_type_group = parser.add_mutually_exclusive_group(required=True)
     report_type_group.add_argument(
@@ -523,13 +542,10 @@ def Main():
     di_sp_bugs_summary = OrderedDict()
     di_var_bugs_summary = OrderedDict()
 
-    # Compress similar bugs.
-    # DILocBugs with same pass & instruction name.
-    di_loc_pass_instr_set = set()
-    # DISPBugs with same pass & function name.
-    di_sp_pass_fn_set = set()
-    # DIVarBugs with same pass & variable name.
-    di_var_pass_var_set = set()
+    # If we are using --reduce, use these sets to deduplicate similar bugs within and across files.
+    di_loc_reduced_set = set()
+    di_sp_reduced_set = set()
+    di_var_reduced_set = set()
 
     start_line = 0
     chunk_size = 1000000
@@ -585,9 +601,9 @@ def Main():
                     if not di_loc_bug.key() in di_loc_set:
                         di_loc_set.add(di_loc_bug.key())
                         if opts.reduce:
-                            pass_instr = bugs_pass + instr
-                            if not pass_instr in di_loc_pass_instr_set:
-                                di_loc_pass_instr_set.add(pass_instr)
+                            reduced_key = di_loc_bug.reduced_key(bugs_pass)
+                            if not reduced_key in di_loc_reduced_set:
+                                di_loc_reduced_set.add(reduced_key)
                                 di_loc_bugs.append(di_loc_bug)
                         else:
                             di_loc_bugs.append(di_loc_bug)
@@ -608,9 +624,9 @@ def Main():
                     if not di_sp_bug.key() in di_sp_set:
                         di_sp_set.add(di_sp_bug.key())
                         if opts.reduce:
-                            pass_fn = bugs_pass + name
-                            if not pass_fn in di_sp_pass_fn_set:
-                                di_sp_pass_fn_set.add(pass_fn)
+                            reduced_key = di_sp_bug.reduced_key(bugs_pass)
+                            if not reduced_key in di_sp_reduced_set:
+                                di_sp_reduced_set.add(reduced_key)
                                 di_sp_bugs.append(di_sp_bug)
                         else:
                             di_sp_bugs.append(di_sp_bug)
@@ -632,9 +648,9 @@ def Main():
                     if not di_var_bug.key() in di_var_set:
                         di_var_set.add(di_var_bug.key())
                         if opts.reduce:
-                            pass_var = bugs_pass + name
-                            if not pass_var in di_var_pass_var_set:
-                                di_var_pass_var_set.add(pass_var)
+                            reduced_key = di_var_bug.reduced_key(bugs_pass)
+                            if not reduced_key in di_var_reduced_set:
+                                di_var_reduced_set.add(reduced_key)
                                 di_var_bugs.append(di_var_bug)
                         else:
                             di_var_bugs.append(di_var_bug)

github-actions · 2025-07-23T11:48:33Z

✅ With the latest revision this PR passed the C/C++ code formatter.

SLTozer · 2025-08-13T17:20:55Z

Added a test for the new behaviour, where the identical* stacktraces are de-duplicated.

*Except for the instruction addresses.

djtodoro

LGTM

In current DebugLoc coverage builds, the output for any reasonably large build can become very large if any missing DebugLocs are present; this happens because single errors in LLVM may result in many errors being reported in the output report. The main cause of this is that the empty locations attached to instructions may be propagated to other instructions in later passes, which will each be reported as new errors. This patch prevents this by adding an "unknown" annotation to instructions after reporting them once, ensuring that any other DebugLocs copied or derived from the original empty location will not be marked as new errors. As a separate but related change, this patch updates the report generation script to deduplicate results using the recorded stacktrace if they are available, instead of the pass+instruction combination. This reduces the size of the reduction, but makes the reduction highly reliable, as the stacktrace allows us to very precisely identify when two bugs have originated from the same place.

Forward-fixes a compile error in bc216b0 (#150212) in specific build configurations, due to a missing const_cast.

SLTozer requested review from tru, jmorse, OCHyams, CarlosAlbertoEnciso and djtodoro July 23, 2025 11:45

SLTozer self-assigned this Jul 23, 2025

llvmbot added debuginfo llvm:transforms labels Jul 23, 2025

djtodoro approved these changes Aug 14, 2025

View reviewed changes

SLTozer added 4 commits August 15, 2025 12:09

Remove opening brace

691a6d2

Remove const

426675d

Add test for reducing based on 'origin'

c6da248

SLTozer force-pushed the dl-coverage-tracking-reduce-results branch from 1bd82a6 to c6da248 Compare August 15, 2025 11:09

SLTozer merged commit bc216b0 into llvm:main Aug 15, 2025
9 checks passed

SLTozer added a commit that referenced this pull request Aug 19, 2025

[Debugify] Fix compile error in tracking coverage build

5cedb01

Forward-fixes a compile error in bc216b0 (#150212) in specific build configurations, due to a missing const_cast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Debugify] Improve reduction of debugify coverage build output #150212

[Debugify] Improve reduction of debugify coverage build output #150212

Uh oh!

SLTozer commented Jul 23, 2025

Uh oh!

llvmbot commented Jul 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

SLTozer commented Aug 13, 2025

Uh oh!

djtodoro left a comment

Uh oh!

Uh oh!

Uh oh!

[Debugify] Improve reduction of debugify coverage build output #150212

[Debugify] Improve reduction of debugify coverage build output #150212

Uh oh!

Conversation

SLTozer commented Jul 23, 2025

Uh oh!

llvmbot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SLTozer commented Aug 13, 2025

Uh oh!

djtodoro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Jul 23, 2025 •

edited

Loading

github-actions bot commented Jul 23, 2025 •

edited

Loading