This research is supported by Georgia Tech Research Foundation.
System-on-Chip (SoC) architectures are being increasingly used in a diverse set of applications.
By being able to adopt efficient component reuse and platform based design,
the advent of true SoC design strategies and methodologies would
dramatically impact the production cost and the breadth of functionality.
As single chip system design becomes a reality with the
advent in deep sub-micron technology, managing the complexity of such
designs with billions of transistors entails ingenious design strategies
to its implementation. Recently Intel developed a new 80-core chip
(the TeraFLOP Chip).
The crux of SoC design embodying hundreds of functional units is the on-chip
network architecture and orchestration of parallelism, which are becoming critical
bottlenecks in meeting performance and power consumption budgets of the chip design.
They are also expected to play crucial roles in dictating IP block reuse and design time.
|
| Intel Developer Forum (IDF) |
|---|
SoC Development and Multicore Architecture
SoC designs should be able to provide integrated solutions to increasing circuit complexity at the optimum production cost, while reducing the time-to-market.
|
| IBM Cell Processor |
|---|
This approach is appealing in both development and manafacturing strategies in terms of performance, functionality and product-volume constraints at the GigaScale Research Center (GSRC), an inter-university research consortium sponsored by the SIA and DARPA. Thus, the design flow of incorporating numerous functional blocks requires a structured methodology as the result of a natural progression for higher level abstraction and partitioning, which is helpful for both hardware and software design.
|
|
|
| Billions of Transistors Era | Evolution of Design Abstract |
|---|
Modular design
Reuse concerns all additional activities that have to be performed to generate an ease-to-use and flexible module. This is based on a hierarchical approach, which proceeds by partitioning a system into modules, which requires compatibility and consistency. Proper system partitioning allows independence between the design of different modules. The decomposition is generally guided by structuring rules aimed at hiding local design decisions in such a way that only the interface of each module is visible. This kind of methodology is also called "a modular design", and it consists of sound design rules in terms of timing constraints, hierarchical design and floor-planning. The overall modular approach optimizes the insertion of reusable component within the circuit.
Communication infrastructure
The modualr design approach and component reuse necessiate to address the complex requirements in such SoCs. Therefore, the design of the SoC communication infrastructure, also called, on-chip networks facilitate the development and employment of reusable system components. Since wire delays for crossing the die approach tens of clock cycles, the network module using multiple packets and high-speed links is becoming a reality. This structured communication infrastructure yields reduced die size and costs with higher wire efficiency. The components of an on-chip network, such as switching fabric, link circuitry, buffer and control logic, and interface which are designed to be compatible with heterogeneous and homogeneous Processing elements (PEs), should be interoperable and reusable.
|
|
|
| Global Wire Scaling Problem | Chip Cross Section |
|---|
Plug-and-play
If chip designers and system architects can effectively build an on-chip communication infrastructure for connecting many discrete building blocks, it would enable plug-and-play of IP reuse, and would scale with new generations of process technology.
|
|
| Chip Design Revolution: Network Centric Architectures |
|---|
On-Chip Networks
As a result of the increasing degree of integration on a silicon die, the SoC design paradigm is seen as a way of the design of communication architectures for exceedingly high number of pre-designed computational and storage blocks. This communication-centric SoC is a new design paradigm, which is suitable for many applications; more thread level parallelism (TLP) with less focus on instruction level parallelism (ILP) and light weight parallel processing agents.
|
|
| On-Chip Networks: Rethinking Systems and Scalability |
|---|
Network-on-chip (NoC) architectures are expected to be crucial in implementing complex and function-rich chips for platform based design and manufacturing capabilities. As chip complexity continues to increase, a more systematic approach is required to effectively transport and manage on-chip traffic, optimize wire utilization and allow designs to scale down in size, complexity, and component reuse without compromising performance and reliability. Although a few researchers have investigated the NoC design space, a systematic, system wide design paradigm is yet to evolve.
Area-equivalent design
An essential but difficult step in designing an on-chip interconnect is to decide
how much area should be devoted to the communication substrate.
This is a critical design parameter since one must strike a balance
between the processing tiles and the network for optimizing performance
and power consumption. A poor layout may result in underutilization
of either the PEs or the network.
Moreover, the allocated area for the network dictates many of the
design choices.
Typically, 7% to as high as 40-50% area has been devoted
for the on-chip networks in earlier studies.
Therefore, I would like
to conduct an indepth analysis of the on-chip network area requirement.
Deciding this area requirement is challenging since it depends
on the application communication characteristics, network topology,
expected performance, and technology constraints.
I investigated the network area requirement for a wide class of applications.
This would be done by varying the network load
and observing the network switching activities to satisfy the required
performance goal.
For a given silicon budget in general purpose multicore architectures,
I have analyzed the buffer requirements,
placement of buffers at different ports, top level wiring constraints,
and many other CMOS implementation issues to obtain realistic area
and timing parameters.
On-Chip network workload characterization
Since SoCs are ideal candidates for dedicated applications, communication patterns of a wide range of applications should be used for analyzing the underlying interconnect design. Most prior studies have used only synthetic workloads, which may not represent the actual communication behavior.
|
| Multi-FPGA Board for Our Processor Core Emulation |
|---|
Analytical model
While most prior on-chip interconnect analyses are based on time consuming simulation models, in order to provide fast performance estimates during the design cycle, I have developed a queuing-theory-based model for quantifying the performance and energy behavior of on-chip networks in [ANCS 2005]. Although wormhole switched, traditional off-chip networks have been analyzed extensively analytical models for NoCs considering detailed architectural artifacts are almost nonexistent. My model is different from previous work in that I compute the average delay due to path contention, virtual channel and crossbar switch arbitration using a queuing theory approach, which I believe, can capture the blocking phenomena of wormhole switching quite accurately. I first developed a model for performance analysis for a generic wormhole switched router. The model was then used to estimate the power consumption by estimating the utilization of the router components and multiplying them with component level power profiles, obtained from actual circuit-level synthesis. Comparison with simulation results indicate that the proposed analytical model is quite accurate and can be used as an efficient design tool. An extension of the proposed analytical model is also shown to demonstrate the utility of the model for fault-tolerance study. The model would be extended to capture other architectures and combined performance, energy and fault-tolerance evaluation.
Performance enhancement techniques
The NoC latency impacts the performance of many on-chip applications. As technology scales down, we expect that there would be more area available for implementing several performance enhancement techniques in the network. For example, one could use more VCs, sophisticated scheduling schemes, adaptive routing algorithms using routing tables, and more efficient flow control mechanisms. Minimization of message latency by optimizing the intra-node delay and utilizing organized wiring layout with regular topologies has been targeted in NoC designs. I have proposed a low latency on-chip router supporting adaptivity (a Path-Sensitive Router architecture) and a gracefully degrading and energy-efficient modular router (a Row-column Decoupled Switch Architecture), designed with this objective, which consists of a two-stage pipelined model with look ahead routing and path-sensitive-guided flit queuing.
Energy model
As many of the SoCs are targeted at embedded applications and portable devices, only performance driven design is not adequate for such systems. Energy-efficient design is also equally important to minimize the power usage. The generic estimation relates the energy consumption for each packet to the number of hop traversals per flit per packet, times the energy consumed for each flit per router. I took this estimation a step further, decomposing the router into individual components which we laid out in 90nm/70nm technology, using a power supply voltage and clock frequency. Our team used the Berkeley Predictive Technology model, and measured the average dynamic power using 50% switching activity on the inputs. Additionally, we measured the leakage power of each component when no input activity is recorded. We imported these values into our architectural level cycle-accurate NoC simulator and simulated all individual components in unison to estimate both dynamic and leakage power in routing a flit. Doing so I was able to identify the individual utilization rates of each component, thus accurately modeling the leakage and dynamic power consumption.
Fault-tolerance
Reduced signal swing due to scaling of supply voltage to minimize
energy consumption, shrinking of geometry, increased wire density, faster
clock rates and higher integration will induce
transient and permanent failures resulting from accelerated aging effects
and manufacturing/testing challenges and process variation, crosstalk, power supply noise,
electromagnetic interference and leakage noise.
These noise will compromise the communication reliability of next generation
SoCs.
Solution to such problems is again a daunting task since power consumption and
reliable communication, and performance and reliability have conflicting requirements.
A Comprehensive simulation tool
Simulation testbed to conduct design space exploration, which captures architectural details and internal design conforming to the actual implementation; Integrating this simulator, energy models, and reliability models, a comprehensive platform has been developed for multi-objective evaluation of SoC architectures.
|
| Simuation Tool |
|---|
Related Publications
ANCS 2005, DAC 2005, MICRO 2006, ISCA 2006, ISCA 2007, TPDS.
Memory-Based Solution
There is one of biggest performance challenges for computer systems, which results from the conflict
between dramatically increasing CPU speeds and enormous requirements of memory and communication resources.
This temporal and spatial unbalance is becoming more critical bottleneck. Memory-based system is an attempt
to solve this problem by making processor macros combined with memory controllers. The paradigm shift can
provide large amounts of computing power in both in scientific and media applications by supporting
high bandwidth and direct memory access. Fast processing in a simple processor core will
rely on being able to have fast access to memory (hence, for example, PIM) and on being able
to support multiple simultaneous threads in
order to hide latency when communicating with remote processors or accessing remote data.
Design decisions of 3D interconnects is closely intertwined with the design of the
architecture. We plan to explore the influence of alternate design styles for stacking
different components (processors, memory blocks, etc...) on top of each other and then
completely redesign individual components for multiple layers to improve communica-
tion traffic across multiple layers and within a single layer.
3D Architectures
Single chip involves processing of video and audio streams, demand highly integrated heterogeneous SoC. I investigate multi-core processor architectures with generic Processing Elements (PEs) including DSP and MCU, and I plan to explore a set of dedicated processors for handling mission critical tasks such as 3G baseband signal processing and encryption, real time scheduling, and multimedia stream. Building such large SoC, NoC mapping and wire layout take place frequently communicating PEs close to each other. I will analyze parallel processing and stream processing, and develop global hierarchy and local pieplined organization as follow:
(i) Partitioning the overall functionality into several parallel tasks. (ii) Performance characetrization of interprocessor communications; designers can place and wire communicating operations in ways that minimize wire delay, minimize latency, and maximize bandwidth, considering that wire delays become relatively more significant with shrinking feature sizes and clock speeds. (iii) Pipelined multimedia platform in sub-connections; the performance of mobile multimedia SoC depends on efficiently streaming large amounts of variable data through the devices, and these data processing is pipelined by the dedicated processors.
|
|
| 3D Architectures |
|---|
As most applications of embedded system are data-centric, a primary bottleneck in the performance of processors is in its communication with memory subsystem. By allowing the placement of processor and memory in adjacent layers, 3D Chip design provides significant relief, reducing the communication latency.
Related Publications
ISCA 2007
Nanosystems Design
Single-walled carbon nanotubes (SWCNTs) have been proposed as a possible replacement for on-chip copper interconnects due to several favorable physical properties. SWCNTs are rolled graphitic sheets that can either be metallic or semiconducting depending on their chirality. With reported current densities as large as 109A/cm2, SWCNTs have significantly larger current carrying capability than traditional metallic interconnect, which typically have current densities of the order of 105A/cm2. Therefore, due to the covalently bonded structure, carbon nanotubes are extremely resistant to electromigration and other sources of physical breakdown. While pros of SWCNTs are physical stability and electrical conductivity, the cons are that individual nanotubes suffer from a large contact resistance that is not dependent on the length of the nanotube. To alleviate this problem, bundles of SWCNTs connected in parallel, have been proposed and physically demonstrated as a possible interconnect medium. There have been several studies that have compared the relative performance of Cu and CNT bundles.
|
|
|
|
Evolvable Hardware
Hardware Recycling and Self-Recovery System
The substitution of defective elements by healthy ones elsewhere in the system provides
a kind of virtual recycling bin, where functional components can be reused in other parts of
the implementation should the need arise. This scheme avoids the more traditional approach
in fault-tolerance, which resorts to replication of resources.
- Sub-component reuse; isolation minimum block
- Partial operation
- Virtualization
- Resource sharing
The cost in these techniques is minimal in terms of control logic, but some degradation
in performance is observed. Given that certain resources in hardware have similar roles which
are exercised at different time scales, faulty components can be bypassed by time multiplexing
the role of healthy components without significant impact on performance. The main driver
behind this effort is to fully utilize existing redundancies in architecture to combat reliability issues.
Decomposable Architectures and Fine-grained Parallelism
Design and analysis of System-on-Chip (SoC) architectures incorporating hundreds of functional units to solve real-world problems is an emerging and exciting research field. Based on the International Technology Roadmap for Semiconductors (ITRS), by the end of the decade, design of SoCs using 45/32nm technology will have billions of transistors in hundreds of cores, and SoCs incorporating multiple technologies will dominate the future semiconductor products. Functional blocks based on decomposable structures that consist of a number of compact, distinct and independent modules, each operating within its own regime. Despite their independency, the modules work in unison to provide seamless operation when viewed from a system perspective. The goal of this design approach is to provide graceful degradation in the presence of failures and energy conservation. Decomposable architectures exhibit inherent fault-tolerance, since a faulty module will not bring the whole system to a halt. Partial operation may continue in the healthy modules which are impervious to the faulty module. Incorporating modularity in all components, decomposable architectures are also attractive for dealing with process variability (PV) problems.
|
|
| Experiment Tool |
|---|
Morphable Hardware
Extending the notion of decomposable architectures further, we think the exploration of dynamically configurable structures. Hardware morphing can be triggered initially after manufacturing to cope with process variation and also performed periodically based on monitored effects of aging mechanisms. Based on the probing findings, the system will morph its architecture to comply with the new underlying conditions. Morphing options that will be explored include dynamically truncating the size of components such as buffer sizes to counter larger than nominal timing delays, activating additional stages of pipelining to maintain frequency at expense of additional latency and activating booster buffers to maintain performance at expense of increased power consumption.