Scaling a Reconfigurable Dataflow Accelerator
Author | : Yaqi Zhang |
Publisher | : |
Total Pages | : |
Release | : 2020 |
ISBN-10 | : OCLC:1157942980 |
ISBN-13 | : |
Rating | : 4/5 (80 Downloads) |
Download or read book Scaling a Reconfigurable Dataflow Accelerator written by Yaqi Zhang and published by . This book was released on 2020 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: With the slowdown of Moore's Law, specialized hardware accelerators are gaining traction for delivering 100-1000x performance improvement over general-purpose processors in a variety of applications domains, such as cloud computing, biocomputing, artificial intelligence, etc. As the performance scaling in multicores is coming to a limit, a new class of accelerators--reconfigurable dataflow architectures (RDAs)--offers high-throughput and energy-efficient acceleration that keeps up with the performance demand. Instead of dynamically fetching instructions as traditional processors do, RDAs have flexible data paths that can be statically configured to spatially parallelize and pipeline programs across distributed on-chip resources. The pipelined execution model and explicitly-managed scratchpad in RDAs eliminate the performance, area, and energy overhead from dynamic scheduling and conventional memory hierarchy. To adapt to the compute intensity in modern data-analytic workloads particularly in the deep learning domain, RDAs have increased to an unprecedented scale. With an area footprint of 133mm^2 at 28nm, Plasticine is a previously proposed large-scale RDA supplying 12.3 TFLOPs of computing power. Prior work has shown an up to 76x performance/watt benefit from Plasticine over a Stradix V FPGA due to an advantage in clock frequency and resource density. The increase in scale introduces new challenges in network-on-chip design to maintain the throughput and energy efficiency of an RDA. Furthermore, targeting and managing RDAs at this scale require new strategies in mapping, memory management, and flexible control to fully utilize their compute power. In this work, we focus on two aspects of the software-hardware co-design that impact the usability and scalability of the Plasticine accelerator. Although RDAs are flexible to support a wide range of applications, the biggest challenge that hinders the adoption of these accelerators is the required low-level knowledge in microarchitecture design and hardware constraints in order to efficiently map a new application. To address this challenge, we introduce a compiler stack--SARA--that raises the programming abstraction of Plasticine to an imperative-style domain-specific language with nested control flow for general spatial architectures. The abstraction is architecture-agnostic and contains explicit loop constructs that enable cross-kernel optimizations often not exploited on RDAs. SARA efficiently translates imperative control constructs to a streaming dataflow graph that scales performance with distributed on-chip resources. By virtualizing resources, SARA systematically handles hardware constraints, hiding the low-level architecture-specific restrictions from programmers. To address the scalability challenge with increasing chip size, we present a comprehensive study on the network-on-chip design space for RDAs. We found that network performance highly correlates with bandwidth instead of latency for RDAs with a streaming dataflow execution model. Lastly, we show that a static-dynamic hybrid network design can sustain performance in a scalable fashion with high energy efficiency.