In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP) which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.
tensor decomposition, canonical polyadic (CP), MTTKRP, Kokkos, manycore, GPU
@article{PhKo19,
author = {Eric Phipps and Tamara G. Kolda},
title = {Software for Sparse Tensor Decomposition on Emerging Computing Architectures},
journal = {SIAM Journal on Scientific Computing},
volume = {41},
number = {3},
pages = {C269--C290},
pagetotal = {22}
month = {June},
year = {2019},
doi = {10.1137/18M1210691},
eprint = {1809.09175},
}