In:
Concurrency and Computation: Practice and Experience, Wiley, Vol. 23, No. 7 ( 2011-05), p. 681-693
Abstract:
State‐of‐the‐art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA (Compute Unified Device Architecture) increases their usability as high‐performance co‐processors for general‐purpose computing. Sorting is well investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high‐performance parallel sorting algorithms that fit with the characteristics of the modern GPU‐architecture. We present a high‐performance in‐place implementation of Batcher's bitonic sorting networks for CUDA‐enabled GPUs. Therefore, we assigned compare/exchange operations to threads in a way that decreases low‐performance global‐memory access and makes efficient use of high‐performance shared memory. This greatly increases the performance of this in‐place, comparison‐based sorting algorithm. Our implementation outperforms all other algorithms in our tests when sorting 64‐bit keys. It is the fastest comparison‐based GPU sorting algorithm for 32‐bit keys, being only outperformed by (non‐comparison‐based) radix sort when sorting sequences larger than 2 23 . Copyright © 2011 John Wiley & Sons, Ltd.
Type of Medium:
Online Resource
ISSN:
1532-0626
,
1532-0634
Language:
English
Publisher:
Wiley
Publication Date:
2011
detail.hit.zdb_id:
2052606-4
SSG:
11
Bookmarklink