comments: http://pact10.ac.upc.edu/submit/paper.php?p=297&ls=1#response PACT'10 Paper #297 gregory.diamos@gatech.edu | Help | Sign out | Wednesday 19 May 2010 1:21:33am CEST Your local time: Tuesday 18 May 2010 4:21:51pm Your submissions [Main] Main [Edit] Edit #297 Comment notification If selected, you will receive email when updated comments are available for this paper. PC conflicts Hyesoon Kim Scott Mahlke + Other conflicts− Other conflicts Georgia Institute of Technology, Georgia Tech Research Institute Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems Submitted [PDF] 430kB Updated Sunday 28 Mar 2010 3:45:36am CEST | SHA-1 b59bfaa5089129ac3618a92ee645f6a89f4ac641 You are an author of this paper. + Abstract− Abstract Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core platforms. Ocelot [more]Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core platforms. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86 and other ISAs. The binary translator is able to execute existing CUDA binaries without static recompilation from source and Ocelot can in fact dynamically switch between execution on an NVIDIA GPU and a many-core CPU. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications. This paper presents a high level overview of the implementation of our dynamic binary translator highlighting design decisions and trade-offs, and showcasing their effect on application performance. We explore several novel code transformations that are applicable only when compiling explicitly parallel applications and revisit traditional dynamic compiler optimizations for this new class of applications. We expect this study to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures. + Authors− Authors G. Diamos, A. Kerr, S. Yalamanchili, N. Clark [details]Gregory Diamos (Georgia Institute of Technology) Andrew Kerr (Georgia Institute of Technology) Sudhakar Yalamanchili (Georgia Institute of Technology) Nathan Clark (Georgia Institute of Technology) OveMer RevExp Nov WriQua TecSou LevIntPAC Review #297A 5 3 3 4 4 4 Review #297B 6 3 4 4 4 4 Review #297C 6 2 4 4 3 4 Review #297D 5 3 3 4 2 4 [Edit paper] Edit paper | [Add response] Add response [Text] Reviews and comments in plain text [Text] Plain text Review #297A Modified Saturday 1 May 2010 4:11:36am CEST Overall merit (?) 5. Weak accept (just above the bar, but I won't object if it is rejected) Reviewer expertise (?) 3. Knowledgeable Novelty (?) 3. Incremental improvement Writing quality (?) 4. Good Technical soundness (?) 4. Technically sound and of good quality Level of interest to PACT attendees (?) 4. High (likely of interest to most attendees) Paper summary This paper presents a dynamic compilation framework that translates programs written in Nvidia CUDA to multicore x86 system and the necessary runtime supports. It also presents the empirical results that shows the important issues to be resolved for such systems. Paper strengths While this paper does not present any technology fundamentally new, it describes some of the engineering issues they encountered when implementing their dynamic compilation system and how the issues were resolved, which would be useful for other people who want to implement similar systems in practice. And their experiment results that show the important issues to consider (such as on-chip memory and context switch overheads) are very insightful. The paper is relatively well-written. Paper weaknesses This is more of a practitioner's report of design and implementation. Nothing described in the paper is really ground-breaking. Comments to address in the response 1. When converting to full SSA form, the authors mentioned that they would create a phi for each live-in register for each basic block. Wouldn't that create unnecessary phi instructions for a basic block that is not a merge point? 2. In your run-time design, instead of statically partitioning the works (CTAs) among the worker threads and using condition variables for the master thread to communicate with them, maybe a simpler implementation is to use a (synchronized) work queue with all the CTAs appended (by the master thread) in the beginning. Each worker thread then keeps grabbing CTAs from the queue. This way, the master thread doesn't have to worry about partitioning the CTAs among the worker threads to try to maintain load balance (and not as complicated as work-stealing). I was wondering if you have considered such approach? (I understand that the work queue approach does not consider the data locality, but was wondering how it performs empirically.) Comments for author 1. You should use the full name (instead of acronym) the first time you use PTX in the abstract and in the paper. 2. Inconsistent explanation for CTA: in fig-1, it's cooperative thread array, but in the text (section II) it's concurrent thread array. 3. While there are a couple of examples that demonstrate different engineering problems in different steps, it will help the readers understand the technology better by creating a small consistent example that shows how it gets processed/translated in each step of the Ocelot translation process. [Text] Plain text Review #297B Modified Monday 3 May 2010 5:39:08am CEST Overall merit (?) 6. Accept (easily above the acceptance bar) Reviewer expertise (?) 3. Knowledgeable Novelty (?) 4. New contribution Writing quality (?) 4. Good Technical soundness (?) 4. Technically sound and of good quality Level of interest to PACT attendees (?) 4. High (likely of interest to most attendees) Paper summary The paper describes a system called Ocelot which is a JIT compiler tat takes PTX as input and can generate code for diverse multiple-core machines. Paper strengths The authors have done a nice job of enumerating all the issues that one encounters in translating PTX to the high-level LLVM IR. Interesting studies are performed including a sensitivity study on the impact of LLVM optimizations on several Parboil programs. Comments for author This is an interesting paper that is well written and highly relevant to PACT, and it should spark interesting discussions at the conference. I only have a few minor comments. In Table 1, all optimizations seem to have the same impact on TPACF. I believe it is likely another optimization that is causing this benefit, and it is probably not any of the optimizations in the table. Perhaps, it is one of the optimizations in the X86 code generator (e.g., the peephole optimizer), and I would suggest further analysis here. In Section D Sensitivity Analysis, the authors state that their analysis presents new opportunities to use application characteristics to choose optimizations. However, this is not new and there are several people working in this area. [Text] Plain text Review #297C Modified Tuesday 4 May 2010 1:08:16pm CEST Overall merit (?) 6. Accept (easily above the acceptance bar) Reviewer expertise (?) 2. Some familiarity Novelty (?) 4. New contribution Writing quality (?) 4. Good Technical soundness (?) 3. Seems technically sound but didn't check completely Level of interest to PACT attendees (?) 4. High (likely of interest to most attendees) Paper summary Oceltot is a framework for dynamic compilation to map data parallel NVIDIA-CUDA (binary) programs to multithreaded platforms, with focus on shared-memory multi-core CPUs. The paper overwievs the Ocelot dynamic compiler, discussing design trade-offs, novel code transformations for explicitly parallel code, and dynamic techniques. Some performance results are presented for microbenchmarks as well as for real applications. Paper strengths A broad overview of key compilation issues and their interrelations for architectures that are becoming mainstream. Comments for author Since this paper could be of interest alo outside the compiler community, some jargon and acronyms ought to be explained. (Examples: IR, AST, CFG, LLVM.) [Text] Plain text Review #297D Modified Tuesday 18 May 2010 12:25:04am CEST Overall merit (?) 5. Weak accept (just above the bar, but I won't object if it is rejected) Reviewer expertise (?) 3. Knowledgeable Novelty (?) 3. Incremental improvement Writing quality (?) 4. Good Technical soundness (?) 2. Incomplete or has some technical flaws Level of interest to PACT attendees (?) 4. High (likely of interest to most attendees) Paper summary This paper presents the Ocelot dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto Multicore x86 CPUs. This paper explains several issues in designing and implementing the compilation framework. Experiments on several microbenchmarks and full applications identified important issues such as on-chip memory pressure, context-switch overhead, and variable CTA execution time, which must be addressed when compiling highly parallel programs to systems with few hardware resources. Paper strengths This paper seems to present the first dynamic compiler framework from a bulk-synchronous programming model to an x86 many-core target processor and well explains several important problems that may arise during the dynamic compilation and binary translation. Paper weaknesses Even though the proposed work seems to be the the first dynamic compiler framework for adaptively executing a bulk-synchronous programs on an x86 many-core platform, most of core contributions came from previous work; 1) the solution to translate CUDA to multi-core x86 is effectively the same approach as MCUDA, 2) capabilities for dynamic optimization and compilation are due to LLVM, and 3) many of issues raised by this paper are addressed in previous work. It seems that the only new contribution of this paper is to implement and solve problems in binary translation from PTX to LLVM. Moreover, the experimental results of applying dynamic optimizations are not very promising. Overall, this paper presents a fair amount of work, and the proposed system can be used as a base framework for research on dynamic optimization of the explicitly data parallel execution model for diverse multithreaded platforms. Comments to address in the response This paper fails in some of the basic paper writing rules: Describing a gap in the state of the art, presenting metrics to measure the gap, describing related work that improves the state of the art in terms of these metrics, and then showing quantitatively that the new contribution performs better than this related work. Is the gap in the existence of optimizations for current and future architectures? In designing compilers for such architectures? In the quality or performance of dynamic binary translators? In understanding the overheads in present applications? Please discuss. Comments for author In addition, in the result section I would expect a graph that shows that the newly presented technique is able to close the gap more than the best related work. This could be a very good paper, but the presentation seeds improvement. Response The authors' response is intended to address reviewer concerns and correct misunderstandings. The response should be addressed to the program committee, who will consider it when making their decision. Don't try to augment the paper's content or form—the conference deadline has passed. Please keep the response short and to the point. This response should be sent to the reviewers. ================================================================================================================ PACT Response: Reviewer A: 1) Creating a phi for each live-in register for each basic block does indeed create unnecessary phi nodes. Originally we were not sure if the SSA representation in LLVM required a phi for every register that is used in a given basic block, so these phis were conservatively not pruned out. It turns out that LLVM does allow them to be omitted, and we should probably go back and remove them. 2) We did consider this approach in the initial design. We discarded it based on the expectation that the single global lock required to synchronize a central queue would have a high overhead on systems with many cores. However, based on our experimental results that show that this overhead is relatively low compared to the execution time of a CTA, it is probably worth re-evaluating along with work stealing. Reviewer B: It is possible that optimizations being performed during TargetLowering in LLVM are responsible for the similar improvement in TPACF for all optimization levels. This would be relatively easy to verify by enabling optimization and then manually disabling all of the passes listed in the paper. We could perform this experiment and update the final copy of the paper to reflect whether or not this was the case. Reviewer C: That is a good suggestion. We will go back and add in better explanations for the terms that you identified. Reviewer D: You are right that this paper does not follow the conventional rules for conference paper writing. It does not introduce and quantitatively evaluate a new technique to address a gap in the state of the art. Instead, it covers the complete implementation and subsequent evaluation of state of the art compilation/optimization techniques under new assumptions (a bulk-synchronous input representation and a shared memory multi-core target). The main contribution of the paper is the identification of several new problems that the state of the art does not address (such as changes in memory access patterns after thread fusion resulting in reduced memory bandwidth). Following your example, it is identifying a new gap and explaining why it exists. Reviewer A made a similar comment that "[the paper] is more of a practitioner's report of design and implementation." In fact, this paper was actually adapted from a technical report describing the design decisions and challenges during the implementation of Ocelot. It was converted into a paper after we decided that the lessons learned were significant enough and that the problems encountered were challenging enough to warrant future research projects and eventually papers that address them directly. We believe that many of these problems are not obvious, that they would be difficult to identify without performing a complete implementation like Ocelot, and that their identification would be valuable in of itself to the PACT community given the rising popularity of parallel architectures and the likelihood of other efforts running into the same problems. We want to thank all of the reviewers for their comments. We welcome the opportunity to use them in improving the quality of this work.