Embedded Chip-Level Integrated Parallel SupErcomputer (ECLIPSE) is an architectural framework for general purpose chip multiprocessors (CMP) and multiprocessor systems on chip (MP-SOC), but is extendable also to multichip constellations [Forsell02]. It lends many ideas from our early work on the Instruction-Level Parallel Shared Memory (IPSM) machine originally reported in [Forsell97] as well as earlier PRAM realization research [Ranade91, Leppänen96] and network on chip (NOC) research [Jantsch03].
Unfortunately, the original ECLIPSE architecture is only able to support the exclusive read exclusive write (EREW) PRAM model which is not able to match the performance of multioperation concurrent read concurrent write (MCRCW) PRAM, but requires logarithmically longer execution times for a large number of parallel computational problems even though optimal parallel algorithms are used. In addition, it fails to support efficient execution of low-TLP functionalities because for organizational reasons it features a relatively high minimum number of threads per processor, dropping the utilization of a core to as low as the reciprocal of that value in the case of a functionality having only one thread.
Our recent proposal for a universal general purpose CMP is the TOTAL ECLIPSE architecture that realizes the arbitrary MCRCW PRAM model and supports NUMA execution for processor-wise thread bunches making execution of low-TLP functionalities as efficient as with standard sequential processors using the NUMA convention [Forsell10, Forsell11]. The REPLICA architecture is an improved version of the TOTAL ECLIPSE architecture that implements the PRAM-NUMA model of computation with support for full NUMA operation, a better memory system employing local memories and virtual/off-chip memory system, I/O system, support for native floating point operations, improved communication network, improved memory modules with halved operating frequency, and various architectural techniques that reduce power consumption.
A REPLICA consists of P Tp-threaded (constituting total T = PTρ threads) F-functional unit MBTAC processor cores with dedicated instruction memory and local data memory modules, P Tρ-line step caches and scratchpads attached to processors, P fast data memory modules, and a high-bandwidth multimesh interconnection network (see Figure 1).
Fig.1 An early view of the REPLICA architecture.
New architectural techniques and ideas to be employed in REPLICA include, but are not limited to
implements an easy-to-program strong MCRCW PRAM model of computation via
multithreaded high-throughput computing
threads of within processors can be combined to mimic the NUMA model to
support sequential/NUMA legacy code
truly scalable latency hiding via high-throughput computing and high-bandwidth
efficient wave synchronization dropping the cost of synchronization from
O(100) down to O(1/100)
concurrent memory access for advanced parallel algorithms
multioperations for computing prefixes and reductions in constant time
virtual instruction-level parallelism exploitation
pipeline hazard elimination
memory hashing for eliminating hot spots in intercommunication
implicitly synchronous multithreaded execution
[Forsell97] M. Forsell, Implementation of Instruction-Level and Thread-Level Parallelism in Computers, Dissertations 2, Department of Computer Science, University of Joensuu, Joensuu, 1997.
[Forsell02] M. Forsell, A Scalable High-Performance Computing Solution for Network on Chips, IEEE Micro 22, 5 (September-October 2002), 46-55.
[Forsell10] M. Forsell, TOTAL ECLIPSE—An Efficient Architectural Realization of the Parallel Random Access Machine, In Parallel and Distributed Computing Edited by Alberto Ros, IN-TECH, Vienna, 2010, 39-64.
[Forsell11] M. Forsell, A PRAM-NUMA Model of Computation for Addressing Low-TLP Workloads, International Journal of Networking and Computing 1, 1 (January 2011), 21-35.
[Jantsch03] Networks on Chip edited by A. Jantsch and H. Tenhunen, Kluver Academic Publishers, Boston, 2003, 173-192.
[Leppänen10] V. Leppänen, M. Penttonen and M. Forsell, Layouts for Sparse Networks Supporting Throughput Computing, In the proceedings of the 2010 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’10), July 12-15, 2010, Las Vegas, USA, 443-449.
[Ranade91] A. Ranade, How to Emulate Shared Memory, Journal of Computer and System Sciences 42, (1991), 307-326.
Fig. 2. Early block diagrams of Mc-way double acyclic multimesh network (top), superswitch (middle), and switch element (bottom) for a 64-processor REPLICA CMP.
+358 20 722 2278