Abstract
This chapter covers two difficult problems frequently encountered by graphics processing unit (GPU) developers-optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in composite applications with multiple dependent kernels. Both pose a formidable challenge, as they require dynamic adaptation and tuning of execution policies to allow high performance for a wide range of inputs. Not meeting these requirements leads to substantial performance penalty. This chapter describes the methodology for solving the memory optimization problem via softwaremanaged caching by efficiently exploiting the fast scratchpad memory. This technique outperforms the cache-less and the texture memory-based approaches on pre-Fermi GPU architectures as well as on the one that uses the Fermi hardware cache alone. It then presents the algorithm for minimizing the total running time of a complete application comprising multiple interdependent kernels. Both a GPU and a CPU can be used to execute the kernels, but the performance varies greatly for different inputs, calling for dynamic assignment of the computations to a GPU or a CPU at runtime. The communication overhead due to the data dependencies between the kernels makes per-kernel greedy selection of the best performing device suboptimal. The algorithm optimizes the runtime of the complete application by evaluating the performance of all the assignments jointly, including the overhead of the data transfers between the devices.
Original language | English |
---|---|
Title of host publication | GPU Computing Gems Jade Edition |
Pages | 501-517 |
Number of pages | 17 |
DOIs | |
State | Published - 2012 |
All Science Journal Classification (ASJC) codes
- General Computer Science