PERSONAL WEBSITE
Thoughts and Software
Gregory Diamos
My name is Greg Diamos, I am a Research Scientist with cross-disciplinary experience in computer architecture, dynamic compilation, task scheduling runtimes, and parallel database systems. I am currently employed by NVIDIA corporation.
This is a personal website covering my research interests, open source projects, and contact information. It is not affiliated with any company or organization.
News
Single-Source Longest Paths on Control Flow Graphs
Monday April 18, 2012
I recently ran into an interesting problem that I feel like sharing. Given an arbitrary directed graph (like a control flow graph), find the set of longest acyclic paths from a single node to all other nodes.
This is not quite the same as the Hamiltonian path problem, since a subsection of a Hamiltonian path may not be the longest path to another node along the path. After not being able to find any obvious related work, here's what I came up with:
Algorithm
Use two data structures, an N-ary tree and a hash table that tracks the mapping from any node to its position in the tree. Start with the single node as the root of the tree, and visit it. For each successor of the node being visited, see if there is an entry for it in the hash table. If there is no entry, add it to the tree as a child of the current node, add an entry to the hash table, and visit it immediately (as in a DFS). If an entry already exists check to see if the length of the path to the current node + 1 is greater than the length of the existing path to its successor (the length should be stored with the node in the tree). If it is greater, then we have found a potentially longer path. As long as it is not part of a cycle.
To determine if it is, check to see if the successor is already a parent of the current node in the tree (by walking up the tree). If it isn't part of a cycle, take the entire subtree rooted at the successor and move it as a child under the current node, updating the distances of the subtree nodes and their hash table entries (which can be elided if the hash table entries are pointers).
Comments
At the end, the tree contains the longest path from the root to all other nodes. For anyone who is interested, there's an implementation of this algorithm that is used in the ThreadFrontierAnalysis in gpuocelot - thread-frontier-analysis
Geek points for anyone who emails me an equivalent weakly-scalable parallel algorithm.
Simulated SAXPY clocks in at 1.3 GIPS:
Monday April 18, 2011
As a side project, my wife and I have been toying with the idea of what it would take to build a complete software infrastructure on a GPU, including an architecture simulator, compiler, and runtime. Vanaheimr is a project that attempts just this, by implementing a simulator, compiler, and runtime in 100% CUDA, with NO host code at all other than to kick off the first kernel.
Rather than to build everything from the ground up, our approach was to start with a few small case studies, the first of which involved simulating a SAXPY program. The SAXPY kernel is written in assembly, and executed by a GPU simulator running CUDA on a C2050 GPU. The first results look very promising, with a functional simulator achieving 1.3 giga-instructions per second, about 1000x faster than existing CPU or FPGA based simulators.
This is only a simple example and not a full simulator, but it shows that the general approach is promising. The next steps involve building pieces of a compiler that can compile existing applications into Vanaheimr's new assembly language and flushing out the simulator to support the entire ISA.
Worst case complexity of computing DAG criticality:
Sunday March 13, 2011
I usually shy away from theory problems, but I also often find myself trying to design systems that are robust to worst-case behavior. This leads me to issues like the following:
Assume that we have a directed acyclic graph reresenting dependent tasks, and we want to schedule these tasks on infinite processors satisfying dependencies in the DAG. Also assume that we know the execution time of each task a-priori. A first step is to examine the complete set of solutions (the complete set of possible schedules). I think that a step towards bounding the space can be made by answering the following question:
I have a DAG with N vertices and an arbitrary number of non-duplicate edges. What is the maximum number of unique paths in such a graph, where a unique path is defined as a path that is not partially contained within any other path?
Ocelot 1.3.967 and 2.0.969 Released
Monday February 7, 2011
Ocelot 2.0.969 supports Fermi devices and introduces a series of new features listed below:
- Full PTX 2.2 support, including function pointers and recursion.
- Lazy just-in-time compilation for the x86 backend.
- Ocelot sever and remote device client.
Ocelot 1.3.967 is the last release that supports PTX 1.x devices (G80 and GT200 series).
Ocelot at GTC 2010
Monday September 27th, 2010
We recently gave an overview of Ocelot at GTC 2010. NVIDIA was kind enough to make a webcast of the talk available.
This presentation covers the high level vision of the project, what other researchers can use Ocelot for, and the currently active sub-projects. This concludes the current round of travel and presentations so I can devote the majority of my time to the project again.
Design and Implementation of Ocelot at PACT 2010
Wednesday June 2th, 2010
Our paper, Ocelot: A Dynamic Compiler for Bulk-Synchronous Applications in Heterogeneous Systems was accepted at The Nineteenth International Conference on Parallel Architectures and Compilation Techniques.
Also, I am trying to be more open with the development cycle for all of the work that I do. In that spirit I am going to start including the full reviews for all of my papers starting with this one. I may also go back and try to digg up old reviews depending on whether or not I can still find them. [Reviews]
Ocelot Overview at NVIDIA [pdf]
Tuesday May 18th, 2010
I finally brought this sever back up yesterday after a cross country journey from Atlanta to San Jose.
Also, I am giving a presentation on Ocelot to NVIDIA and NVIDIA Research this friday.
Abstract: "Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multi-threaded platforms. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications. This talk provides a high level overview of Ocelot, focusing on its use as a tool for performance analysis, workload characterization, CUDA debugging, architecture simulation, dynamic optimization, and transparent execution of CUDA applications on CPUs and GPUs."
Index of MSB Bug
Thursday March 18th, 2010
Stefanus Du Toit offered a potential solution as to why the previous code failed to compile with -O3. He writes "The most likely reason is the cast from &ff to an (unsigned long*). This breaks the strict aliasing rules in C/C++, which state that in general two pointers of different types never alias (with exceptions for char* and void*). GCC's -fno-strict-aliasing can be used to test this (it should work with that option enabled).There's no fully portable standards-compliant way to do this in C/C++, but all the major compilers accept the following, which is based on implementation-defined behavior related to unions:
template
D bitwise_cast(S s) {
union { D d; S s; } u;
u.s = s;
return u.d;
}
D bitwise_cast(S s) {
union { D d; S s; } u;
u.s = s;
return u.d;
}
Then you would do:
return (bitwise_cast(ff) >> 23) - 127;
While I'm staring at the code, you could avoid the branch with something like:
float ff = (v | 1) >> (v < 0x01000000 ? 0 : 16);
and a similar adjustment for the +16. On SIMD ISAs this should just result in some masking code. Haven't counted what that would do to the instruction count though.FWIW I plan to proposed a bitwise_cast like this in C++ post-C++11."
Modelling GPU-CPU Workloads and Systems at GPGPU-3
Thursday Feb 11th, 2010
Our paper, Modeling GPU-CPU Workloads and Systems was accepted at the Third Workshop on General-Purpose Computation on Graphics Procesing Units.
Ocelot 1.0.265-Alpha Released
Thursday December 17th, 2009
Ocelot Support for native execution of CUDA applications on GPUs and CPUs without recompilation.
Faster Index of MSB
Monday November 9, 2009
Steve Worley thinks that this is faster than the code I posted before. It relies on the exponent computation of an IEEE754 floating point number. Compiling this code produces 8 x86 instructions on the most likely path. My example used 22 instructions. However, Steve's code relies on two branches and two floating point conversion instructions, whereas my example included only shifts, comparisons, and conditional selects. The performance of these depends on the branch predictor accuracy and latency of FP conversion instructions, although I would imagine that Steve's code is faster. Also, I couldn't get Steve's code to produce correct results with -O3 optimization on, I need to take a look at the machine code to figure out why this is. I may also get around to benchmarking these if I have a bit of free time.
inline ulong highestBitIndex(ulong v)
{
if (v<0x01000000) {
float ff=(float)(v|1);
return ((*((unsigned long *)&ff))>>23)-127;
}
float ff=(float)(v>>16);
return ((*((unsigned long *)&ff))>>23)-127+16;
}
Fast Index of MSB
Tuesday, October 13th, 2009
I keep telling people that this is a fast way to find the position of the msb of an int.
unsigned int msb32(unsigned int x)
{
int position = 0;
int offset = ( x & 0xffff0000 ) ? 16 : 0; position = offset; x >>= offset;
offset = ( x & 0x0000ff00 ) ? 8 : 0; position += offset; x >>= offset;
offset = ( x & 0x000000f0 ) ? 4 : 0; position += offset; x >>= offset;
offset = ( x & 0x0000000c ) ? 2 : 0; position += offset; x >>= offset;
offset = ( x & 0x00000002 ) ? 1 : 0; position += offset; x >>= offset;
return position;
}
New Website Template
Tuesday, September 22th, 2009
I recently switched to a new website design that is hopefully clearner and more straightforward than the previous one. It may take a week or so before everything settles so please excuse any broken links.