@INPROCEEDINGS{speculation,
AUTHOR="Gregory Diamos and Sudakhar  Yalamanchili",
TITLE="Speculative Execution on {Multi-GPU} Systems",
BOOKTITLE="24th  IEEE International Parallel \& Distributed Processing Symposium",
ADDRESS="Atlanta, Georgia, USA",
DAYS=19,
MONTH=4,
YEAR=2010,
KEYWORDS="Harmony; Heterogeneous; Runtimes; Compilers; CUDA; GPU; GPGPU;",
ABSTRACT="The lag of parallel programming models and languages behind the advance of
heterogeneous many-core processors has left a gap between the computational
capability of modern systems and the ability of applications to exploit
them. Emerging programming models, such as CUDA and OpenCL, force
developers to explicitly partition applications into components (kernels)
and assign them to accelerators in order to utilize them effectively. An
accelerator is a processor with a different ISA and micro-architecture than
the main CPU. These static partitioning schemes are effective when
targeting a system with only a single accelerator.  However, they are not
robust to changes in the number of accelerators or the performance
characteristics of future generations of accelerators.  

In previous work, we presented the Harmony execution model for computing
on heterogeneous systems with several CPUs and accelerators. In this paper,
we extend Harmony to target systems with multiple accelerators using
control speculation to expose parallelism. We refer to this technique as
Kernel Level Speculation (KLS). We argue that dynamic parallelization
techniques such as KLS are sufficient to scale applications across several
accelerators based on the intuition that there will be fewer distinct
accelerators than cores within each accelerator. In this paper, we use a
complete prototype of the Harmony runtime that we developed to explore the
design decisions and trade-offs in the implementation of KLS. We show that
KLS improves parallelism to a sufficient degree while retaining a
sequential programming model.  We accomplish this by demonstrating good
scaling of KLS on a highly heterogeneous system with three distinct
accelerator types and ten processors."
}


