Removing performance and programmability limitations of chip multiprocessor architectures
VTT'S FRONTIER PROJECT
KEYWORDS: Computer architecture, parallel computing, models of computation, parallel programming languages, compiling, optimizing, application software, performance measurement, FPGA prototyping, thread-level parallelism, instruction-level parallelism, general purpose computing
Current CMPs architectures (SMP, NUMA, CC-NUMA, MP, VC) are tedious to program and often provide poor speedup compared to conventional sequential (single core) processors. This is because of lack of fast synchronization and latency hiding mechanism, i.e. weak models of computation.
The removing performance and programmability limitations of chip multiprocessor architecture (REPLICA) project aims developing the CESM architecture and methodology that would enable radically easier programming and higher performance with a help of the PRAM model of computation.
REPLICA is a 3-year (2011-2013) project funded by VTT with total budget of 1.4 M€. VTT collaborates with University of Linköping, Sweden, and University of Turku, Finland.
Fig. 1. The PRAM-NUMA model of computation of REPLICA.
REPLICA refers to replicating the processing resources and
programming/data structures in a smart way to provide radically better
performance and programmability than current multicore computers.
In REPLICA we are developing a configurable emulated shared memory machine (CESM) architecture and methodology that enables radically easier programming and higher performance with a help of a strong parallel random access machine (PRAM) model of computation. As a proof of concept, we are building a prototype machine with selected I/O devices based on FPGA technology, develop a programming language with compiling and optimization tools, and a comprehensive set of sample applications that show the performance and ease of use.
New techniques and ideas to be employed in REPLICA include, but are not limited to
implements an easy-to-program strong MCRCW PRAM model of computation via
multithreaded high-throughput computing
threads of within processors can be combined to mimic non-uniform memory
access (NUMA) to support efficient execution of sequential/NUMA legacy code
efficient wave synchronization dropping the cost of synchronization from
O(100) down to O(1/100)
supports multiple levels and models of parallelism—data, subgroup, and task
parallelism at high level and virtual instruction-level parallelism at low
uses source-to-source translation, low-level virtual machine and virtual ILP
optimization to implement optimizing compiler supporting
New knowledge, solutions, architecture, intellectual property for companies that design, manufacture, and exploit massively parallel computing solutions in their products.
The results of this project have potentially huge impact on industry and the way how future parallel computers are programmed.
Fig. 2. Dropping the cost of synchronization from O(100) down to O(1/100).
We are able to drop the cost of synchronization from O(100) down to
O(1/100) with a help of throughput computing and synchronization wave
technique (see Fig. 2).
This www page describes the main part of the REPLICA project by introducing the on-going work on REPLICA architecture, REPLICA programming language, optimizing compiler for it, hardware prototype, application software, publications, and people behind REPLICA.
+358 20 722 2278