Adding an explicit launch_bounds(1024) attribute for GPU kernels.

Starting with ROCm 3.5, the HIP compiler will change from HCC to hip-clang. This compiler change introduce a change in the default value of the `__launch_bounds__` attribute associated with a GPU kernel. (default value means the value assumed by the compiler as the `__launch_bounds attribute__` value, when it is not explicitly specified by the user) Currently (i.e. for HIP with ROCm 3.3 and older), the default value is 1024. That changes to 256 with ROCm 3.5 (i.e. hip-clang compiler). As a consequence of this change, if a GPU kernel with a `__luanch_bounds__` attribute of 256 is launched at runtime with a threads_per_block value > 256, it leads to a runtime error. This is leading to a couple of Eigen unit test failures with ROCm 3.5. This commit adds an explicit `__launch_bounds(1024)__` attribute to every GPU kernel that currently does not have it explicitly specified (and hence will end up getting the default value of 256 with the change to hip-clang)
author: Deven Desai <deven.desai.amd@gmail.com> 2020-08-05 01:46:34 +0000
committer: Deven Desai <deven.desai.amd@gmail.com> 2020-08-05 01:46:34 +0000
commit: 46f8a18567731925e06a7389a6c611e1dc420ea8 (patch)
tree: fd080850d5f3870c1e1bca80d62463fad76a5c18 /unsupported/Eigen/CXX11/src/Tensor/TensorScan.h
parent: 21122498ecfaa394aeef9d6ca8d8659550be97fa (diff)
1 files changed, 1 insertions, 1 deletions
diff --git a/unsupported/Eigen/CXX11/src/Tensor/TensorScan.h b/unsupported/Eigen/CXX11/src/Tensor/TensorScan.h
index bef8d261f..9e3b1a0b9 100644
--- a/unsupported/Eigen/CXX11/src/Tensor/TensorScan.h
+++ b/unsupported/Eigen/CXX11/src/Tensor/TensorScan.h
@@ -334,7 +334,7 @@ struct ScanLauncher<Self, Reducer, ThreadPoolDevice, Vectorize> {
 // parallel, but it would be better to use a parallel scan algorithm and
 // optimize memory access.
 template <typename Self, typename Reducer>
-__global__ void ScanKernel(Self self, Index total_size, typename Self::CoeffReturnType* data) {
+__global__ __launch_bounds__(1024) void ScanKernel(Self self, Index total_size, typename Self::CoeffReturnType* data) {
   // Compute offset as in the CPU version
   Index val = threadIdx.x + blockIdx.x * blockDim.x;
   Index offset = (val / self.stride()) * self.stride() * self.size() + val % self.stride();
author	Deven Desai <deven.desai.amd@gmail.com>	2020-08-05 01:46:34 +0000
committer	Deven Desai <deven.desai.amd@gmail.com>	2020-08-05 01:46:34 +0000
commit	46f8a18567731925e06a7389a6c611e1dc420ea8 (patch)
tree	fd080850d5f3870c1e1bca80d62463fad76a5c18 /unsupported/Eigen/CXX11/src/Tensor/TensorScan.h
parent	21122498ecfaa394aeef9d6ca8d8659550be97fa (diff)