Uploaded by cchen98

BR100 GPGPU Accelerating Datacenter Scale AI Computing

advertisement
2022 IEEE Hot Chips 34 Symposium (HCS) | 978-1-6654-6028-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/HCS55958.2022.9895604
TM
壁仞
BR100 GPGPU:
Accelerating Datacenter Scale AI Computing
Mike Hong, Lingjie Xu
& Team
Hot Chips 34, Aug 2022
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
1
Notice and Disclaimer
•
Copyright  2020-2022 Biren Technology. All rights reserved.
•
Confidentiality. This document contains confidential information of Biren Technology, which shall not be disclosed to
any third party by any means unless explicitly permitted by Biren Technology.
•
Trademark. All trade names, trademarks, graphical marks and domain names in this document are the properties of
Biren Technology. Without prior written consent, they may not be copied, reproduced, modified, published, uploaded,
posted, transmitted, or distributed in any way.
•
Forward Looking Statements. Information in this document, other than statements or description of historical fact, may
contain forward-looking statements. The forward-looking statements are made based on certain assumptions,
projections and calculations made by us with regards to the industry and management's expertise. These forwardlooking statements are subject to significant risks and uncertainties and our actual results may differ materially. Your
business decisions shall not be made solely based on the information.
•
Disclaimer. This material is provided "as is" without any express or implied warranty of any kind, including warranties of
merchantability, title, non-infringement of intellectual property, or fitness for any particular purpose.
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
2
BIREN BR100
Area
1074mm2 @7nm
Transistor Count
77Billion
Host Interface
PCIe Gen 5 x16 w/ CXL
Peak Performance
2048 TOPS @ INT8
1024 TFLOPS @ BF16
512 TFLOPS @ TF32+
256 TFLOPS @ FP32
Also supports FP16, INT32, INT16 and other
formats
Memory
64GB HBM2E
Interconnections
8 BLinkTM
2.3TB/s external I/O bandwidth
Form Factor
OAM with 550W Max TDP
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
3
Accelerating Deep Learning Workloads
Benchmark Performance Measured in Throughput
NVIDIA A100
3.0 X
BIREN BR100
Average speedup (~2.6 X)
2.8 X
2.5 X
2.7 X
2.6 X
2.5 X
2.4 X
2.0 X
1.5 X
1.0 X
1.0 X
1.0 X
1.0 X
1.0 X
1.0 X
0.5 X
0.0 X
ResNet-50
Bert-Large
Tacotron2
Yolo v5
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
Transformer
4
Purpose-built for Datacenter-Scale Computing
Compute density
Connectivity
Cost of ownership
Co-design
Compatibility with
hardware and software
datacenter infrastructure
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
5
One Tapeout, Multiple Products
 To break the reticle size limit and integrate more transistors on chip
HBM2E
HBM2E
Compute
tile
Compute
tile
 One tapeout to empower multiple SKUs
 Smaller die for better yield, hence lower cost
 896GB/s high speed die-to-die interconnect
HBM2E
HBM2E
 30% more performance, 20% better yield compared with a monolithic design
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
6
BR100 Product Line
BR100
BR104
Hearten Server
壁砺TM 100 OAM
壁砺TM 104 PCIe
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
7
OAM Server Interconnect Topology
NIC
CPU
NIC
NIC
NIC
CPU
NIC
NIC
PCIe Switch
NIC
NIC
NIC
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
NIC
8
BR100 Product Line
BR100
BR104
Hearten Server
壁砺TM 100 OAM
壁砺TM 104 PCIe
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
9
BR100 Architecture Diagram
Video Encoders
HBM2E
HBM2E
SPC
SPC
SPC
SPC
L2 Cache
L2 Cache
L2 Cache
L2 Cache
M
HBM subsystem
SPC (Streaming Processing Cluster)
Video Decoders
HBM subsystem
M
L2 Cache/Scratchpad, 8MB
Blink
M
M
L2 Cache
L2 Cache
L2 Cache
L2 Cache
SPC
SPC
SPC
SPC
EU
EU
EU
EU
EU
EU
EU
EU
EU
EU
L1 Cache (LSC), 64KB
PCIe
SPC
SPC
SPC
SPC
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
SPC
SPC
SPC
SPC
SPC
SPC
SPC
SPC
L2 Cache
L2 Cache
L2 Cache
L2 Cache
M
M
M
M
Blink
M
D2D D2D
SPC
SPC
SPC
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
SPC
SPC
SPC
SPC
M
M
M
SPC
D2D D2D
M
M
HBM subsystem
L2 Cache
L2 Cache
L2 Cache
SPC
SPC
SPC
SPC
EU
PCIe
L1 Cache (LSC), 64KB
EU
EU
HBM subsystem
L1 Cache (LSC), 64KB
M
L2 Cache
HBM2E
Video Decoders
M
EU
EU
EU
HBM2E
Video Encoders
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
L1 Cache (LSC), 64KB
10
BR100 SPC Architecture
Building blocks of a BR100 SPC:
EU (Execution Unit)
SPC
L0 Instruction Cache
L2 Cache/Scratchpad, 8MB
Warp Scheduler/ Dispatch Unit (32 thread/clk)
EU
EU
EU
EU
TLR (Register File, 40KB)
L1 Cache (LSC), 64KB
EU
EU
EU
EU
EU
EU
EU
EU
L1 Cache (LSC), 64KB
EU
 4 x 64KB L1 Cache/LSC (Load & Store Cache)
T-core
V-core
EU
L1 Cache (LSC), 64KB
EU
EU
SFU
L1 Cache (LSC), 64KB
SFU
SFU
 16 x EU (execution unit), each EU has:
• 16 x streaming processing core (V-core), 1 x tensor
engine (T-core)
• 40KB TLR (Thread Local Register)
• 4 x SFU
• TDA (Tensor Data Accelerator)
SFU
TDA (Tensor Data Accelerator for Load/Store)
 Up to 8MB Distributed L2 Cache
• Holds shared data for all SPCs
• Can be configured into scratchpad
• Built-in reduction engine
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
11
Scalable CU Architecture
SPC
L2 Cache/Scratchpad, 8MB
 Multiple EUs form a CU (compute unit)
EU
EU
EU
EU
 Thread groups in a CU are synchronized
 Each CU can contain 4/8/16 EUs
L1 Cache (LSC), 64KB
EU
EU
EU
EU
EU
EU
L1 Cache (LSC), 64KB
EU
EU
EU EU
EU EU EU EU
EU EU
EU EU
EU EU EU EU
EU EU
EU EU
EU EU EU EU
EU EU
EU EU EU EU
L1 Cache (LSC), 64KB
EU
EU
EU
EU
L1 Cache (LSC), 64KB
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
12
V-core: A General-Purpose SIMT Processor
Full set ISA for general purpose computing
EU
SPC
L0 Instruction Cache
L2 Cache/Scratchpad, 8MB
Warp Scheduler/ Dispatch Unit (32 thread/clk)
EU
EU
EU
EU
TLR (Register File, 40KB)
L1 Cache (LSC), 64KB
EU
EU
EU
EU
EU
EU
V-core
L1 Cache (LSC), 64KB
EU
EU






16x cores, supporting FP32, FP16, INT32, INT16
SFU
Load/Store
Data preprocessing
Manages T-core with multiple sync channels
Handles DL OPs like Batch Norm, ReLu, etc
T-core
Enhanced SIMT Model
L1 Cache (LSC), 64KB
EU
EU
EU
EU
SFU
L1 Cache (LSC), 64KB
SFU
SFU
SFU
TDA (Tensor Data Accelerator for Load/Store)
 128K threads run on 32 SPCs
 Cooperative Warps
 Super-scaler (static and dynamic)
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
13
Flexible V-Core Warp Control
Standard Control Mode
C-Warp/Kernel Coroutine Control Mode
EU
EU
Register File
Register File
Register File
Warps run the same code
and partition the register files
Warps run different codes, collaborate with each other (producerconsumer) and exchange data in register files
 Enhanced parallelism
 Highly efficient for OP fusion
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
14
T-core: High Level Overview
To accelerate common operations in AI:
EU
SPC
L0 Instruction Cache
L2 Cache/Scratchpad, 8MB
Warp Scheduler/ Dispatch Unit (32 thread/clk)
EU
EU
EU
EU
TLR (Register File, 40KB)
 MMA (Matrix Multiplication Addition)
 Convolution
 …
L1 Cache (LSC), 64KB
EU
EU
EU
EU
Design Goals:
L1 Cache (LSC), 64KB
EU
EU
EU
EU
EU
T-core
V-core
EU
L1 Cache (LSC), 64KB
EU
EU
SFU
L1 Cache (LSC), 64KB
SFU
SFU
SFU
TDA (Tensor Data Accelerator for Load/Store)
 To speed up MMA & convolution which account for majority
workload in deep learning
 Best reuse among different dimensions (batch, sample,
channel…)
 Model level and data level parallelism
 Closely coupled with V-cores and L2 scratchpad to maximize
utilization
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
15
SPC-Scale 2.5D GEMM Architecture
SPC
L2 Cache/Scratchpad, 8MB
EU
EU
EU
Consists of:
EU
L1 Cache (LSC), 64KB
EU
EU
EU
EU
EU
EU
L1 Cache (LSC), 64KB
EU
EU
L1 Cache (LSC), 64KB
EU
EU
EU
EU
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
dp
 16 T-cores in a 2D systolic array
 2 groups of 8 x 8 dot product (dp) operations per T-core (8
x 8 x dp8 3D MMA for BF16)
 Equivalent to 64 x 64 Matrix Multiplication
• Supports FP32, TF32+, BF16, INT16, INT8, INT4 tensor
formats
Benefits:





Higher data reuse rate with a big 2.5D GEMM
Less memory bandwidth and cache occupation
Better throughput and energy efficiency
Lower latency v.s. big 2D GEMM
More scalable and easier routing v.s. 3D GEMM
L1 Cache (LSC), 64KB
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
16
TF32+ Tensor Data Type
Why TF32+?
FP32
s
E8
M23
 E8M15, with 24 bits in total
 32x more precise compared to TF32 in AI training
TF32+ s
E8
TF32
E8
s
M15
M10
 To reuse BF16 multiplier (with 1+7 mantissa) and simplify Tcore design
 Automatically kicked in when using tensor acceleration
libraries and declared as FP32
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
17
Tensor Data Accelerator (TDA)
V-core
T-core
Tensor (1D-5D)
Load / Store Request
TDA
TDA
Load / Store Request
(using Tensor
Descriptor)
Handle address calculation and
OOB (out-of-bound)
Asynchronous Memory Copy
Memory System
 TDAs in T-core and V-core are dedicated to accelerate address calculation and OOB using tensor descriptor
 TDA improves tensor data fetch efficiency by offloading addressing overhead and supporting different tensor layouts
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
18
Memory Scheme: NUMA and UMA
SPC
SPC
SPC
SPC
Local Access
HBM/
L2
NUMA
Mem
NUMA
Mem
NUMA
Mem
NUMA
Mem
SPC
SPC
SPC
SPC
 Stores SPC’s private data, e.g. activation
 Local access with high bandwidth
 Data broadcasted to relevant SPC in model parallelism
UMA memory scheme
 Stores shared data used by all SPCs, e.g. weights
 Accelerated by NoC multicasting
NoC Multicast
HBM/
L2
NUMA memory scheme
UMA
Mem
M
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
M
19
Near Memory Computing
Embedding Accelerator
Near Memory Engine: L2 Reduction
 Reduction operations in L2 by SPC instructions
 Reduction by DMA (2)
 Embedding table processing accounts for major computation in
recommendation systems
(1)
 Reduces the embedding table in L2 before SPC
 To offload SPCs and accelerate computation
SPC
SPC
(1)
(2)
SPC
(1)
(1)
L2 (shared by all SPCs)
SPC
(1)
…
+×
-÷
(2)
Embedding Table
Reduction in L2
(L2 offloading SPC with reduction engine)
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
20
BIRENSUPATM Software Platform
Application
Application Workflows
AutoStream
…
Framework
suCTR
suInfer
Library
Programming Platform
BIRENSUPA
Programming Model
Tools
C++ Extension & Runtime API
Compiler
Driver/HAL
KMD
UMD
Virtualization
Firmware
Hardware
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
21
Summary
 We introduced BR100, a GPGPU designed for accelerating datacenter-scale AI computing
•
•
•
•
•
•
PFLOPS level compute density, high connection bandwidth (on-die/off-die)
7nm chiplet design with CoWoS packaging
550W OAM form factor with 8 cards all-to-all interconnection topology
Over 300MB on-chip SRAM for data cache and reuse
A general-purpose processor with 2.5D GEMM acceleration
A new TF32+ tensor data type
 We optimized data streaming by introducing following features in BR100:
•
•
•
•
Special C-Warp control mode
TDA
NUMA/UMA memory scheme with NoC multicast
Near memory computing
 BIRENSUPA software platform allows developers to program on BR100 with ease
Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply.
22
Download