Nam Khanh Dang, ”Development of On-Chip Communication Fault-Resilient Adaptive Architectures and Algorithms for 3D-IC Technologies (3次元IC技術のための適応型耐障害チップ内通信アーキテクチャとアルゴリズムの開発)”, Ph.D. Thesis, Graduate School of Computer Science and Engineering, The University of Aizu, September 2017. [thesis.pdf] [slides.pdf]
Research Advisor: Prof. Abderazek Ben Abdallah
Multicore processing is predicted to be the backbone of future complex embedded architectures. By distributing the tasks into multiple processing elements, the system’s frequency and operation voltage can be reduced; thus, a decrease in total power consumption can be obtained. However, due to the high complexity in terms of organization, communication, and operation, multicore processing demands in high scalability, efficient bandwidth, and better power efficiency solution have become primordial. Notably, wires have overcome gates to become the most dominant source of delay in the deep sub-micron era. Consequently, the power consumption caused by additional buffers and wires is considered as a critical obstacle. Moreover, conventional communication paradigms (e.g., bus, point-to-point) also encounter several scalability and latency issues. In the past few years, the benefits of 3D Integrated Circuits (3D-ICs) and mesh-based Network-on-Chips (NoCs) have been fused into a promising architecture, called 3D-Network-on-Chip (3D-NoC). In fact, the scalability and parallelism of NoCs can be enhanced in the third dimension thanks to the short wire length and the low power consumption of 3D-ICs interconnects. As a result, the 3D-NoC paradigm is considered to be one of the most advanced and auspicious architectures for the future of IC designs, as it is capable of providing extremely high bandwidth, efficient scalability and low power interconnect. While the 3D-NoC paradigm has been increasing in popularity with several commercial chips, it is threatened by the decreasing reliability of aggressively scaled transistors as they are approaching the fundamental physical limits. In deep sub-micron processes, gates have become more vulnerable to soft errors which can affect the operation accuracy of control logics and buffers in NoCs’ routers; thus, leading to chip failure. In addition, low supply voltages enforce a very narrow noise margin, which makes the architecture more vulnerable and more sensitive to faults. In particular, hard faults, including both permanent and intermittent, can occur during the manufacturing stage or under specific operating circumstances. Because the intermittent faults do not permanently damage a given component, they can pass through several testing stages, but can still cause operating failures. Furthermore, and by shifting to 3D-ICs, 3D-NoCs are introduced to a new major challenge. That is, the high probability for TSV (Through Silicon Via) defects to occur. With high defect-rates and the clustering effect, TSVs need a proper fault-tolerance methodology to ensure the overall reliability. By accumulating all the failure sources, 3D-NoCs’ reliability is expected to be one of the most critical issues in future System-on-Chips (SoCs) designs. Due to the numerous types of faults, many studies have proposed solutions for various individual aspects of on-chip reliability; however, a comprehensive approach encompassing soft errors, hard faults, and TSV defects pertaining to NoCs’ reliability has yet to evolve. In addition, the error detection and diagnosis in 3D-NoC architectures have been studied thoroughly in the scope of offline-testing. On the other hand, with soft errors and intermittent faults becoming a dominant failure mode in modern NoCs and general VLSI systems, a widespread deployment of online testing approaches has become crucial. In addition to the variety and complexity of failure modes, the rapid development of fault tolerance for NoC has become exposed to a new challenge: NoCs’ reliability needs to be evaluated and quantitatively assessed in the early design stages. As a matter of fact, most of the existing evaluation methodologies use the simple fault insertion and correctness-verification method. Such a method only ensures the functionality of a given technique. Moreover, this type of evaluation requires the complete design to be performed which may lead to a significant redesign time risk. To solve this issue, early reliability assessment is needed. After satisfying the performance requirements in the early assessment stage, the reliability of the design is also needed to be fully simulated and analyzed. Starting from the above facts, fault resilient adaptive architectures and algorithms for 3D-ICs, especially for 3D-NoCs based systems, are developed in this research. With the aid of efficient detection, diagnosis and recovery mechanisms and algorithms, the proposed system is capable of detecting and recovering from soft errors occurring in the routing pipeline stages. It also leverages configurable components to handle permanent faults’ occurrences in links, input buffers, and crossbars by adopting our previous works. For integrating these hard error fault tolerant techniques, a detection, diagnosis, and recovery mechanism is proposed. This mechanism analyzes the transmitting operation and its failure state to determine the fault’s position. Based on the position of the fault, it issues signals to handle it. Moreover, this work also proposes a dedicated faulttolerant technique for TSV-cluster defects, which are the most vulnerable components of 3D-IC technology. From another important perspective, this work presents a platform of reliability assessment for NoC systems. An analytical is used to help designers quickly estimate the efficiency of potential fault tolerant schemes. The result of this assessment can indicate the reliability enhancement of the evaluated technique. Later, the complete architecture is put into a netlist-based simulation process to estimate other results. The development of the reliability assessment, fault-tolerance architecture, and algorithms are integrated into the flow of Design for Reliability. In this flow, the analytical model helps the early assessment of the proposed fault-tolerant techniques, and the netlist simulations are conducted to confirm the reliability of the design. The final goal of this dissertation is to propose a comprehensive fault-resilient architecture, algorithms, and a design methodology for highly reliable 3D-NoC systems development. In addition to providing the fault-tolerance techniques to deal with soft errors, hard fault and TSV defects, a working flow is also presented. The complete working design stages are also provided to help designers understand their proposals, know how to approach the fault-tolerance challenge and complete a robust and graceful design.
- Khanh N. Dang, Akram Ben Ahmed, Yuichi Okuyama, and Abderazek Ben Abdallah, ”Scalable design methodology and online algorithm for TSV-cluster defects recovery in highly reliable 3D-NoC Systems”, IEEE Transactions on Emerging Topics in Computing, Special Issue on Reliability-aware Design and Analysis Methods for Digital Systems: from Gate to System Level, 2017 (in press). DOI: 10.1109/TVLSI.2017.2736004
- Khanh N. Dang, Akram Ben Ahmed, Xuan-Tu Tran, Yuichi Okuyama and Abderazek Ben Abdallah, ”A Comprehensive Reliability Assessment of Fault-Resilient Network-on-Chip Using Analytical Model”, IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2017 (in press). doi: 10.1109/TVLSI.2017.2736004.
- Khanh N. Dang, Michael Meyer, Yuichi Okuyama, Abderazek Ben Abdallah, ”A Low-overhead Soft-Hard Fault Tolerant Architecture, Design and Management Scheme for Reliable High-performance Many-core 3D-NoC Systems”, Journal of Supercomputing, Volume 73, Issue 6, pp 2705–2729, 2017. doi:10.1007/s11227-016-1951-0
- Khanh N. Dang, Michael Meyer, Yuichi Okuyama, Abderazek Ben Abdallah, Reliability Assessment and Quantitative Evaluation of Soft-Error Resilient 3D NoC System, 25th IEEE Asian Test Symposium (ATS’16), November 21-24, 2016.
- Khanh N. Dang, Yuichi Okuyama, Abderazek Ben Abdallah, ”Soft-Error Resilient Network-on-Chip for Safety-Critical Applications”, 2016 IEEE International Conference on Integrated Circuit Design and Technology (ICICDT), June 27 – 29, 2016.
- Khanh N. Dang, Michael Meyer, Yuichi Okuyama, Abderazek Ben Abdallah, Xuan-Tu Tran, “Soft-Error Resilient 3D Network-on-Chip Router“, Proc. of IEEE 7th International Conference on Awareness Science and Technology (iCAST 2015), pp. 84 – 90, Sep. 22-24, 2015.