Skip to content

Commit e8a9d08

Browse files
malfetpytorchmergebot
authored andcommitted
[DevX] Add tool and doc on partial debug builds (pytorch#116521)
Turned command sequence mentioned in https://dev-discuss.pytorch.org/t/how-to-get-a-fast-debug-build/1597 and in various discussions into a tool that I use almost daily to debug crashes or correctness issues in the codebase Essentially it allows one to turn this: ``` Process 87729 stopped * thread ROCm#1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00000001023d55a8 libtorch_python.dylib`at::indexing::impl::applySelect(at::Tensor const&, long long, c10::SymInt, long long, c10::Device const&, std::__1::optional<c10::ArrayRef<c10::SymInt>> const&) libtorch_python.dylib`at::indexing::impl::applySelect: -> 0x1023d55a8 <+0>: sub sp, sp, #0xd0 0x1023d55ac <+4>: stp x24, x23, [sp, #0x90] 0x1023d55b0 <+8>: stp x22, x21, [sp, #0xa0] 0x1023d55b4 <+12>: stp x20, x19, [sp, #0xb0] ``` into this ``` Process 87741 stopped * thread ROCm#1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7 236 const at::Device& /*self_device*/, 237 const c10::optional<SymIntArrayRef>& self_sizes) { 238 // See NOTE [nested tensor size for indexing] -> 239 if (self_sizes.has_value()) { 240 auto maybe_index = index.maybe_as_int(); 241 if (maybe_index.has_value()) { 242 TORCH_CHECK_INDEX( ``` while retaining good performance for the rest of the codebase Pull Request resolved: pytorch#116521 Approved by: https://github.com/atalman
1 parent df85a92 commit e8a9d08

File tree

2 files changed

+176
-0
lines changed

2 files changed

+176
-0
lines changed

CONTRIBUTING.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ aspects of contributing to PyTorch.
4141
- [Use a faster linker](#use-a-faster-linker)
4242
- [Use pre-compiled headers](#use-pre-compiled-headers)
4343
- [Workaround for header dependency bug in nvcc](#workaround-for-header-dependency-bug-in-nvcc)
44+
- [Rebuild few files with debug information](#rebuild-few-files-with-debug-information)
4445
- [C++ frontend development tips](#c-frontend-development-tips)
4546
- [GDB integration](#gdb-integration)
4647
- [C++ stacktraces](#c-stacktraces)
@@ -811,6 +812,66 @@ export CMAKE_CUDA_COMPILER_LAUNCHER="python;`pwd`/tools/nvcc_fix_deps.py;ccache"
811812
python setup.py develop
812813
```
813814

815+
### Rebuild few files with debug information
816+
817+
While debugging a problem one often had to maintain a debug build in a separate folder.
818+
But often only a few files needs to be rebuild with debug info to get a symbolicated backtrace or enable source debugging
819+
One can easily solve this with the help of `tools/build_with_debinfo.py`
820+
821+
For example, suppose one wants to debug what is going on while tensor index is selected, which can be achieved by setting a breakpoint at `applySelect` function:
822+
```
823+
% lldb -o "b applySelect" -o "process launch" -- python3 -c "import torch;print(torch.rand(5)[3])"
824+
(lldb) target create "python"
825+
Current executable set to '/usr/bin/python3' (arm64).
826+
(lldb) settings set -- target.run-args "-c" "import torch;print(torch.rand(5)[3])"
827+
(lldb) b applySelect
828+
Breakpoint 1: no locations (pending).
829+
WARNING: Unable to resolve breakpoint to any actual locations.
830+
(lldb) process launch
831+
2 locations added to breakpoint 1
832+
Process 87729 stopped
833+
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
834+
frame #0: 0x00000001023d55a8 libtorch_python.dylib`at::indexing::impl::applySelect(at::Tensor const&, long long, c10::SymInt, long long, c10::Device const&, std::__1::optional<c10::ArrayRef<c10::SymInt>> const&)
835+
libtorch_python.dylib`at::indexing::impl::applySelect:
836+
-> 0x1023d55a8 <+0>: sub sp, sp, #0xd0
837+
0x1023d55ac <+4>: stp x24, x23, [sp, #0x90]
838+
0x1023d55b0 <+8>: stp x22, x21, [sp, #0xa0]
839+
0x1023d55b4 <+12>: stp x20, x19, [sp, #0xb0]
840+
Target 0: (python) stopped.
841+
Process 87729 launched: '/usr/bin/python' (arm64)
842+
```
843+
Which is not very informative, but can be easily remedied by rebuilding `python_variable_indexing.cpp` with debug information
844+
```
845+
% ./tools/build_with_debinfo.py torch/csrc/autograd/python_variable_indexing.cpp
846+
[1 / 2] Building caffe2/torch/CMakeFiles/torch_python.dir/csrc/autograd/python_variable_indexing.cpp.o
847+
[2 / 2] Building lib/libtorch_python.dylib
848+
```
849+
And afterwards:
850+
```
851+
% lldb -o "b applySelect" -o "process launch" -- python3 -c "import torch;print(torch.rand(5)[3])"
852+
(lldb) target create "python"
853+
Current executable set to '/usr/bin/python3' (arm64).
854+
(lldb) settings set -- target.run-args "-c" "import torch;print(torch.rand(5)[3])"
855+
(lldb) b applySelect
856+
Breakpoint 1: no locations (pending).
857+
WARNING: Unable to resolve breakpoint to any actual locations.
858+
(lldb) process launch
859+
2 locations added to breakpoint 1
860+
Process 87741 stopped
861+
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
862+
frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7
863+
236 const at::Device& /*self_device*/,
864+
237 const c10::optional<SymIntArrayRef>& self_sizes) {
865+
238 // See NOTE [nested tensor size for indexing]
866+
-> 239 if (self_sizes.has_value()) {
867+
240 auto maybe_index = index.maybe_as_int();
868+
241 if (maybe_index.has_value()) {
869+
242 TORCH_CHECK_INDEX(
870+
Target 0: (python) stopped.
871+
Process 87741 launched: '/usr/bin/python3' (arm64)
872+
```
873+
Which is much more useful, isn't it?
874+
814875
### C++ frontend development tips
815876

816877
We have very extensive tests in the [test/cpp/api](test/cpp/api) folder. The

tools/build_with_debinfo.py

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
#!/usr/bin/env python3
2+
# Tool quickly rebuild one or two files with debug info
3+
# Mimics following behavior:
4+
# - touch file
5+
# - ninja -j1 -v -n torch_python | sed -e 's/-O[23]/-g/g' -e 's#\[[0-9]\+\/[0-9]\+\] \+##' |sh
6+
# - Copy libs from build/lib to torch/lib folder
7+
8+
import subprocess
9+
import sys
10+
from pathlib import Path
11+
from typing import Any, List, Optional, Tuple
12+
13+
PYTORCH_ROOTDIR = Path(__file__).resolve().parent.parent
14+
TORCH_DIR = PYTORCH_ROOTDIR / "torch"
15+
TORCH_LIB_DIR = TORCH_DIR / "lib"
16+
BUILD_DIR = PYTORCH_ROOTDIR / "build"
17+
BUILD_LIB_DIR = BUILD_DIR / "lib"
18+
19+
20+
def check_output(args: List[str], cwd: Optional[str] = None) -> str:
21+
return subprocess.check_output(args, cwd=cwd).decode("utf-8")
22+
23+
24+
def parse_args() -> Any:
25+
from argparse import ArgumentParser
26+
27+
parser = ArgumentParser(description="Incremental build PyTorch with debinfo")
28+
parser.add_argument("--verbose", action="store_true")
29+
parser.add_argument("files", nargs="?", action="append")
30+
return parser.parse_args()
31+
32+
33+
def get_lib_extension() -> str:
34+
if sys.platform == "linux":
35+
return "so"
36+
if sys.platform == "darwin":
37+
return "dylib"
38+
raise RuntimeError(f"Usupported platform {sys.platform}")
39+
40+
41+
def create_symlinks() -> None:
42+
"""Creates symlinks from build/lib to torch/lib"""
43+
if not TORCH_LIB_DIR.exists():
44+
raise RuntimeError(f"Can't create symlinks as {TORCH_LIB_DIR} does not exist")
45+
if not BUILD_LIB_DIR.exists():
46+
raise RuntimeError(f"Can't create symlinks as {BUILD_LIB_DIR} does not exist")
47+
for torch_lib in TORCH_LIB_DIR.glob(f"*.{get_lib_extension()}"):
48+
if torch_lib.is_symlink():
49+
continue
50+
build_lib = BUILD_LIB_DIR / torch_lib.name
51+
if not build_lib.exists():
52+
raise RuntimeError(f"Can't find {build_lib} corresponding to {torch_lib}")
53+
torch_lib.unlink()
54+
torch_lib.symlink_to(build_lib)
55+
56+
57+
def has_build_ninja() -> bool:
58+
return (BUILD_DIR / "build.ninja").exists()
59+
60+
61+
def is_devel_setup() -> bool:
62+
output = check_output([sys.executable, "-c", "import torch;print(torch.__file__)"])
63+
return output.strip() == str(TORCH_DIR / "__init__.py")
64+
65+
66+
def create_build_plan() -> List[Tuple[str, str]]:
67+
output = check_output(
68+
["ninja", "-j1", "-v", "-n", "torch_python"], cwd=str(BUILD_DIR)
69+
)
70+
rc = []
71+
for line in output.split("\n"):
72+
if not line.startswith("["):
73+
continue
74+
line = line.split("]", 1)[1].strip()
75+
if line.startswith(": &&") and line.endswith("&& :"):
76+
line = line[4:-4]
77+
line = line.replace("-O2", "-g").replace("-O3", "-g")
78+
name = line.split("-o ", 1)[1].split(" ")[0]
79+
rc.append((name, line))
80+
return rc
81+
82+
83+
def main() -> None:
84+
if sys.platform == "win32":
85+
print("Not supported on Windows yet")
86+
sys.exit(-95)
87+
if not is_devel_setup():
88+
print(
89+
"Not a devel setup of PyTorch, please run `python3 setup.py develop --user` first"
90+
)
91+
sys.exit(-1)
92+
if not has_build_ninja():
93+
print("Only ninja build system is supported at the moment")
94+
sys.exit(-1)
95+
args = parse_args()
96+
for file in args.files:
97+
if file is None:
98+
continue
99+
Path(file).touch()
100+
build_plan = create_build_plan()
101+
if len(build_plan) == 0:
102+
return print("Nothing to do")
103+
if len(build_plan) > 100:
104+
print("More than 100 items needs to be rebuild, run `ninja torch_python` first")
105+
sys.exit(-1)
106+
for idx, (name, cmd) in enumerate(build_plan):
107+
print(f"[{idx + 1 } / {len(build_plan)}] Building {name}")
108+
if args.verbose:
109+
print(cmd)
110+
subprocess.check_call(["sh", "-c", cmd], cwd=BUILD_DIR)
111+
create_symlinks()
112+
113+
114+
if __name__ == "__main__":
115+
main()

0 commit comments

Comments
 (0)