Skip to content

Commit 3ed8330

Browse files
authored
Kernel args patch to show zero_init buffer (#1809)
Updated kernel args print to indicate zero_init buffers, which explains elementwise kernels happening before fusion. Changes from this ``` Reduction and semaphore buffers: Float [16] Long [1] ``` To ``` Reduction and semaphore buffers: Float [16] is_zero_initialized: 0 Long [1] is_zero_initialized: 1 ``` The is_zero_initialized: 1 on a given buffer means an extra init kernel would be needed.
1 parent 037a75a commit 3ed8330

File tree

2 files changed

+10
-3
lines changed

2 files changed

+10
-3
lines changed

torch/csrc/jit/codegen/cuda/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ There're a few debug dump that could be turned on via environment variables. Loo
187187
1. `dump_eff_bandwidth`: print out effective bandwidth of each generated kernel. This naively measure the kernel time divided by I/O buffer size and is a good/simple metric of performance for bandwidth bound kernels
188188
2. `cuda_kernel`: print out generated cuda kernels
189189
3. `launch_param`: print out launch config of generated kernels
190-
4. `print_args`: print out input output tensors of executed codegen kernels
190+
4. `kernel_args`: print out input/output/buffer tensors of all executed codegen kernels, note that for buffers, we indicate whether they are zero-initialized, which hints on an extra kernel to fill the tensor before codegen kernels.
191191
192192
### FAQs
193193

torch/csrc/jit/codegen/cuda/executor.cpp

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -790,13 +790,15 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
790790
at::TensorOptions()
791791
.dtype(executor_entry->buffer_types[i])
792792
.device(options_.device)));
793+
global_buffers.zero_init.push_back(true);
793794
} else {
794795
global_buffers.buffers.push_back(at::native::empty_cuda(
795796
executor_entry->buffer_sizes[i],
796797
executor_entry->buffer_types[i],
797798
c10::nullopt,
798799
options_.device,
799800
c10::nullopt));
801+
global_buffers.zero_init.push_back(false);
800802
}
801803
}
802804
}
@@ -984,9 +986,14 @@ std::vector<at::Tensor> FusionExecutor::runFusion(
984986
<< " (strides = " << output.strides() << ")" << std::endl;
985987
}
986988
std::cout << "Reduction and semaphore buffers:" << std::endl;
987-
for (const auto& buffer : global_buffers.buffers) {
989+
TORCH_INTERNAL_ASSERT(
990+
global_buffers.buffers.size() == global_buffers.zero_init.size(),
991+
"global_buffer buffer & zero_init container should have identical sizes");
992+
for (const auto i : c10::irange(global_buffers.buffers.size())) {
993+
const auto& buffer = global_buffers.buffers[i];
994+
const auto& zero_init = global_buffers.zero_init[i];
988995
std::cout << " " << buffer.scalar_type() << " " << buffer.sizes()
989-
<< std::endl;
996+
<< " is_zero_initialized: " << zero_init << std::endl;
990997
}
991998
}
992999

0 commit comments

Comments
 (0)