CANDL

Home | Research and Projects | People | Publications | Partners | Course | Board | Gallery 


This research is supported by Georgia Tech Research Foundation.


System-on-Chip (SoC) architectures are being increasingly used in a diverse set of applications. By being able to adopt efficient component reuse and platform based design, the advent of true SoC design strategies and methodologies would dramatically impact the production cost and the breadth of functionality.
soc1 As single chip system design becomes a reality with the advent in deep sub-micron technology, managing the complexity of such designs with billions of transistors entails ingenious design strategies to its implementation. Recently Intel developed a new 80-core chip (the TeraFLOP Chip). The crux of SoC design embodying hundreds of functional units is the on-chip network architecture and orchestration of parallelism, which are becoming critical bottlenecks in meeting performance and power consumption budgets of the chip design. They are also expected to play crucial roles in dictating IP block reuse and design time.
teraflop
Intel Developer Forum (IDF)
These Networks-on-Chip (NoC) and multiple data stream designs are required to not only provide ultra-low latency, but also occupy a small footprint and consume as little energy as possible. Further, reliability is rapidly becoming a major challenge in nano scale technologies due to the increased prominence of permanent faults resulting from accelerated aging effects and manufacturing/testing challenges. Towards the goal of designing area-constrained, low-latency and energy-efficient reliable on-chip communication networks, we have developed simulation infrastructure and prototype designs to study scalable multicore SoC architectures. These modeled architectures are evaluated to demonstrate the effectiveness and different energy-reliability-performance tradeoffs of Network-on-Chip (NoC). The architecture was implemented in structural Register-Transfer Level (RTL) Verilog and then synthesized in Synopsys Design Compiler using TSMC 90 nm cell library and 70 nm CMOS process technology.

SoC Development and Multicore Architecture

SoC designs should be able to provide integrated solutions to increasing circuit complexity at the optimum production cost, while reducing the time-to-market.
IBM Cell Processor
IBM Cell Processor
Semiconductor industry predicts that while manufacturing complex SoC will be feasible at least down to 45/32 nm technology scale, the cost of developing and implementing a comprehensive design will be spectacularly rising. The economics of SoC design are simply not appealing in the ASIC framework because of the dramatically increasing cost of SoC design and manufacturing. For develpoing economically feasible SoC design flow, platform based design (PBD) methodology has been proposed by integrating IP and reuse of IP cores reuse, hardware/software co-design, and higher levels of abstractions.

This approach is appealing in both development and manafacturing strategies in terms of performance, functionality and product-volume constraints at the GigaScale Research Center (GSRC), an inter-university research consortium sponsored by the SIA and DARPA. Thus, the design flow of incorporating numerous functional blocks requires a structured methodology as the result of a natural progression for higher level abstraction and partitioning, which is helpful for both hardware and software design.


soc_design1


soc_design1

Billions of Transistors Era Evolution of Design Abstract
Platform-based design is based on stable microprocessor-based architectures, which are used as flexible design templates. These templates can be quickly extended, customized and delivered for a diverse set of applications by configuring, revising, or programming some components. By adapting such IP and system level reuse methodology, the whole system can be seen as a set of blocks designed with different methodologies and for different purposes. These purposes are the components re-usability and the components stand-alone design. A diverse set of components and modules can be ported and reused from previous designs according to the following features.

Modular design

Reuse concerns all additional activities that have to be performed to generate an ease-to-use and flexible module. This is based on a hierarchical approach, which proceeds by partitioning a system into modules, which requires compatibility and consistency. Proper system partitioning allows independence between the design of different modules. The decomposition is generally guided by structuring rules aimed at hiding local design decisions in such a way that only the interface of each module is visible. This kind of methodology is also called "a modular design", and it consists of sound design rules in terms of timing constraints, hierarchical design and floor-planning. The overall modular approach optimizes the insertion of reusable component within the circuit.

Communication infrastructure

The modualr design approach and component reuse necessiate to address the complex requirements in such SoCs. Therefore, the design of the SoC communication infrastructure, also called, on-chip networks facilitate the development and employment of reusable system components. Since wire delays for crossing the die approach tens of clock cycles, the network module using multiple packets and high-speed links is becoming a reality. This structured communication infrastructure yields reduced die size and costs with higher wire efficiency. The components of an on-chip network, such as switching fabric, link circuitry, buffer and control logic, and interface which are designed to be compatible with heterogeneous and homogeneous Processing elements (PEs), should be interoperable and reusable.


wire delay


interconnect dominate

Global Wire Scaling Problem Chip Cross Section

Plug-and-play

If chip designers and system architects can effectively build an on-chip communication infrastructure for connecting many discrete building blocks, it would enable plug-and-play of IP reuse, and would scale with new generations of process technology.


noc_design1

Chip Design Revolution: Network Centric Architectures


On-Chip Networks

As a result of the increasing degree of integration on a silicon die, the SoC design paradigm is seen as a way of the design of communication architectures for exceedingly high number of pre-designed computational and storage blocks. This communication-centric SoC is a new design paradigm, which is suitable for many applications; more thread level parallelism (TLP) with less focus on instruction level parallelism (ILP) and light weight parallel processing agents.


noc_design2

On-Chip Networks: Rethinking Systems and Scalability

Network-on-chip (NoC) architectures are expected to be crucial in implementing complex and function-rich chips for platform based design and manufacturing capabilities. As chip complexity continues to increase, a more systematic approach is required to effectively transport and manage on-chip traffic, optimize wire utilization and allow designs to scale down in size, complexity, and component reuse without compromising performance and reliability. Although a few researchers have investigated the NoC design space, a systematic, system wide design paradigm is yet to evolve.

Area-equivalent design

An essential but difficult step in designing an on-chip interconnect is to decide how much area should be devoted to the communication substrate. This is a critical design parameter since one must strike a balance between the processing tiles and the network for optimizing performance and power consumption. A poor layout may result in underutilization of either the PEs or the network. Moreover, the allocated area for the network dictates many of the design choices. Typically, 7% to as high as 40-50% area has been devoted for the on-chip networks in earlier studies. Therefore, I would like to conduct an indepth analysis of the on-chip network area requirement. Deciding this area requirement is challenging since it depends on the application communication characteristics, network topology, expected performance, and technology constraints.
I investigated the network area requirement for a wide class of applications. This would be done by varying the network load and observing the network switching activities to satisfy the required performance goal. For a given silicon budget in general purpose multicore architectures, I have analyzed the buffer requirements, placement of buffers at different ports, top level wiring constraints, and many other CMOS implementation issues to obtain realistic area and timing parameters.

On-Chip network workload characterization

Since SoCs are ideal candidates for dedicated applications, communication patterns of a wide range of applications should be used for analyzing the underlying interconnect design. Most prior studies have used only synthetic workloads, which may not represent the actual communication behavior.
Emulator
Multi-FPGA Board for Our Processor Core Emulation
Only a handful of researchers have used application characteristics like multimedia, DSP and scientific applications including bioinformatics. For the multimedia applications, researchers observed a self-similar traffic pattern and showed that these patterns are different from the short range dependent autoregressive or Markovian processes. Thus, performance analysis of NoCs with self-similar traffic is essential to understand the design tradeoffs for multimedia applications. Similarly, other applications may exhibit specific properties that I would like to use in our performance analysis. In my research, I analyze a variety of applications and understand their characteristics in terms of packet arrival distribution, packet size and destination distribution. Such an analysis would aid in developing a benchmark suite for conducting performance analysis.


Analytical model

While most prior on-chip interconnect analyses are based on time consuming simulation models, in order to provide fast performance estimates during the design cycle, I have developed a queuing-theory-based model for quantifying the performance and energy behavior of on-chip networks in [ANCS 2005]. Although wormhole switched, traditional off-chip networks have been analyzed extensively analytical models for NoCs considering detailed architectural artifacts are almost nonexistent. My model is different from previous work in that I compute the average delay due to path contention, virtual channel and crossbar switch arbitration using a queuing theory approach, which I believe, can capture the blocking phenomena of wormhole switching quite accurately. I first developed a model for performance analysis for a generic wormhole switched router. The model was then used to estimate the power consumption by estimating the utilization of the router components and multiplying them with component level power profiles, obtained from actual circuit-level synthesis. Comparison with simulation results indicate that the proposed analytical model is quite accurate and can be used as an efficient design tool. An extension of the proposed analytical model is also shown to demonstrate the utility of the model for fault-tolerance study. The model would be extended to capture other architectures and combined performance, energy and fault-tolerance evaluation.

Performance enhancement techniques

The NoC latency impacts the performance of many on-chip applications. As technology scales down, we expect that there would be more area available for implementing several performance enhancement techniques in the network. For example, one could use more VCs, sophisticated scheduling schemes, adaptive routing algorithms using routing tables, and more efficient flow control mechanisms. Minimization of message latency by optimizing the intra-node delay and utilizing organized wiring layout with regular topologies has been targeted in NoC designs. I have proposed a low latency on-chip router supporting adaptivity (a Path-Sensitive Router architecture) and a gracefully degrading and energy-efficient modular router (a Row-column Decoupled Switch Architecture), designed with this objective, which consists of a two-stage pipelined model with look ahead routing and path-sensitive-guided flit queuing.

Energy model

As many of the SoCs are targeted at embedded applications and portable devices, only performance driven design is not adequate for such systems. Energy-efficient design is also equally important to minimize the power usage. The generic estimation relates the energy consumption for each packet to the number of hop traversals per flit per packet, times the energy consumed for each flit per router. I took this estimation a step further, decomposing the router into individual components which we laid out in 90nm/70nm technology, using a power supply voltage and clock frequency. Our team used the Berkeley Predictive Technology model, and measured the average dynamic power using 50% switching activity on the inputs. Additionally, we measured the leakage power of each component when no input activity is recorded. We imported these values into our architectural level cycle-accurate NoC simulator and simulated all individual components in unison to estimate both dynamic and leakage power in routing a flit. Doing so I was able to identify the individual utilization rates of each component, thus accurately modeling the leakage and dynamic power consumption.

Fault-tolerance

fault1 Reduced signal swing due to scaling of supply voltage to minimize energy consumption, shrinking of geometry, increased wire density, faster clock rates and higher integration will induce transient and permanent failures resulting from accelerated aging effects and manufacturing/testing challenges and process variation, crosstalk, power supply noise, electromagnetic interference and leakage noise. These noise will compromise the communication reliability of next generation SoCs. Solution to such problems is again a daunting task since power consumption and reliable communication, and performance and reliability have conflicting requirements.

A Comprehensive simulation tool

Simulation testbed to conduct design space exploration, which captures architectural details and internal design conforming to the actual implementation; Integrating this simulator, energy models, and reliability models, a comprehensive platform has been developed for multi-objective evaluation of SoC architectures.
simulator
Simuation Tool

Related Publications

ANCS 2005, DAC 2005, MICRO 2006, ISCA 2006, ISCA 2007, TPDS.


Memory-Based Solution

There is one of biggest performance challenges for computer systems, which results from the conflict between dramatically increasing CPU speeds and enormous requirements of memory and communication resources. This temporal and spatial unbalance is becoming more critical bottleneck. Memory-based system is an attempt to solve this problem by making processor macros combined with memory controllers. The paradigm shift can provide large amounts of computing power in both in scientific and media applications by supporting high bandwidth and direct memory access. Fast processing in a simple processor core will rely on being able to have fast access to memory (hence, for example, PIM) and on being able to support multiple simultaneous threads in order to hide latency when communicating with remote processors or accessing remote data.
Design decisions of 3D interconnects is closely intertwined with the design of the architecture. We plan to explore the influence of alternate design styles for stacking different components (processors, memory blocks, etc...) on top of each other and then completely redesign individual components for multiple layers to improve communica- tion traffic across multiple layers and within a single layer.


3D Architectures

Single chip involves processing of video and audio streams, demand highly integrated heterogeneous SoC. I investigate multi-core processor architectures with generic Processing Elements (PEs) including DSP and MCU, and I plan to explore a set of dedicated processors for handling mission critical tasks such as 3G baseband signal processing and encryption, real time scheduling, and multimedia stream. Building such large SoC, NoC mapping and wire layout take place frequently communicating PEs close to each other. I will analyze parallel processing and stream processing, and develop global hierarchy and local pieplined organization as follow:

(i) Partitioning the overall functionality into several parallel tasks. (ii) Performance characetrization of interprocessor communications; designers can place and wire communicating operations in ways that minimize wire delay, minimize latency, and maximize bandwidth, considering that wire delays become relatively more significant with shrinking feature sizes and clock speeds. (iii) Pipelined multimedia platform in sub-connections; the performance of mobile multimedia SoC depends on efficiently streaming large amounts of variable data through the devices, and these data processing is pipelined by the dedicated processors.


3D NoC

3D Architectures

As most applications of embedded system are data-centric, a primary bottleneck in the performance of processors is in its communication with memory subsystem. By allowing the placement of processor and memory in adjacent layers, 3D Chip design provides significant relief, reducing the communication latency.

Related Publications

ISCA 2007

Nanosystems Design

Single-walled carbon nanotubes (SWCNTs) have been proposed as a possible replacement for on-chip copper interconnects due to several favorable physical properties. SWCNTs are rolled graphitic sheets that can either be metallic or semiconducting depending on their chirality. With reported current densities as large as 109A/cm2, SWCNTs have significantly larger current carrying capability than traditional metallic interconnect, which typically have current densities of the order of 105A/cm2. Therefore, due to the covalently bonded structure, carbon nanotubes are extremely resistant to electromigration and other sources of physical breakdown. While pros of SWCNTs are physical stability and electrical conductivity, the cons are that individual nanotubes suffer from a large contact resistance that is not dependent on the length of the nanotube. To alleviate this problem, bundles of SWCNTs connected in parallel, have been proposed and physically demonstrated as a possible interconnect medium. There have been several studies that have compared the relative performance of Cu and CNT bundles.


cnt1


cnt2


cnt3

Recent comprehensive studies show that the performance of CNT bundles is influenced by various factors and that well designed CNT bundles offer performance and power benefits for medium and long interconnects in an architecture. CNT bundles offer lesser ohmic resistance than equivalent length Cu wires. Several physical and structural factors like the bundle dimensions, individual CNT diameter, greatly impact the relative improvement over copper.


Evolvable Hardware

Hardware Recycling and Self-Recovery System

The substitution of defective elements by healthy ones elsewhere in the system provides a kind of virtual recycling bin, where functional components can be reused in other parts of the implementation should the need arise. This scheme avoids the more traditional approach in fault-tolerance, which resorts to replication of resources.

    Sub-component reuse; isolation minimum block
    Partial operation
    Virtualization
    Resource sharing

The cost in these techniques is minimal in terms of control logic, but some degradation in performance is observed. Given that certain resources in hardware have similar roles which are exercised at different time scales, faulty components can be bypassed by time multiplexing the role of healthy components without significant impact on performance. The main driver behind this effort is to fully utilize existing redundancies in architecture to combat reliability issues.

Decomposable Architectures and Fine-grained Parallelism

Design and analysis of System-on-Chip (SoC) architectures incorporating hundreds of functional units to solve real-world problems is an emerging and exciting research field. Based on the International Technology Roadmap for Semiconductors (ITRS), by the end of the decade, design of SoCs using 45/32nm technology will have billions of transistors in hundreds of cores, and SoCs incorporating multiple technologies will dominate the future semiconductor products. Functional blocks based on decomposable structures that consist of a number of compact, distinct and independent modules, each operating within its own regime. Despite their independency, the modules work in unison to provide seamless operation when viewed from a system perspective. The goal of this design approach is to provide graceful degradation in the presence of failures and energy conservation. Decomposable architectures exhibit inherent fault-tolerance, since a faulty module will not bring the whole system to a halt. Partial operation may continue in the healthy modules which are impervious to the faulty module. Incorporating modularity in all components, decomposable architectures are also attractive for dealing with process variability (PV) problems.


sim board

Experiment Tool

Morphable Hardware

Extending the notion of decomposable architectures further, we think the exploration of dynamically configurable structures. Hardware morphing can be triggered initially after manufacturing to cope with process variation and also performed periodically based on monitored effects of aging mechanisms. Based on the probing findings, the system will morph its architecture to comply with the new underlying conditions. Morphing options that will be explored include dynamically truncating the size of components such as buffer sizes to counter larger than nominal timing delays, activating additional stages of pipelining to maintain frequency at expense of additional latency and activating booster buffers to maintain performance at expense of increased power consumption.