For the past year, since ISSCC in February 2021, Samsung has been strongly promoting its “Aquabolt-XL” Processor-In-Memory (PIM) devices. In this two-part post The Memory Guy will explain the Aquabolt-XL architecture, its performance, other companies’ similar devices, and discuss the PIM approach’s outlook for commercial success.
Processing in memory is not a new concept. Something that has always charmed engineers is the notion that memories have incredible internal bandwidth that could be harnessed for big jobs. The first that I heard of the approach was in the late 1980s when an inventor approached IDT (my employer at that time) to explore a working relationship. His vision was to add graphics processing elements to one of our SRAMs. Our highest-density SRAM at the time was internally configured as 512×512 bits, so this would give the device a 512-bit wide data path at a faster speed than signals that had to communicate across I/O pins.
The way to do this is to insert a processor into the memory chip itself, which is why I chose a wedge for this post’s graphic.
This is the same thing that the Aquabolt-XL HBM-PIM is designed to do, but in this case Samsung has added processors to a relatively standard High-Bandwidth Memory (HBM) die, which has a very wide data bus. This chip has two 16-bit data paths for each of 32 processors for a total of 1,024 data paths. Each of the 32 processors consists of two 16-bit floating point multipliers and two 16-bit floating point adders, plus three register files. All processors execute the same instruction at the same time, the way that simple GPUs work, in what’s called a Single-Instruction Multiple-Data (SIMD) architecture. The processor’s architecture was chosen to fit the needs of AI processing.
The wide data channel provides Samsung an architectural advantage that I will explain later in this post.
Samsung’s researchers designed the chip by removing some of the DRAM from a standard Samsung Aquabolt HBM DRAM die and replacing it with a processor. This was a simple approach chosen simply to perform a feasibility study. The die layout appears below.
Each of the 32 numbered boxes is a compute block. The enlarged photo below of one of these blocks shows how they are organized. The Even and Odd Bank Arrays are simply the standard DRAM, using the same design as the standard Aquabolt HBM DRAM chip.
The Programmable Compute Unit uses an area that would contain DRAM in a standard commercial Aquabolt HBM DRAM die. This reduces the Aquabolt-XL chip’s DRAM density to half the 8Gb density of the standard Aquabolt HBM DRAM die.
Samsung’s Aquabolt-XL Processor-in-Memory HBM stacks four of these PIM-DRAM chips with four standard Aquabolt DRAM chips in a standard HBM package. That’s four 32-processor chips, for a total of 128 processors. The company tells us that Aquabolt-XL has been architected to be a drop-in replacement of a standard HBM2, and that it is fully compatible with JEDEC-compliant HBM2 memory controllers, so no changes are required of existing HBM2-compatible processors.
The company explained that the combined on-chip bandwidth available to all 128 of the HBM-PIM’s processors is four times that of the bandwidth available at the device’s 1,024 I/O pins: 4.92TB/s (terabytes per second) vs. the I/O pins’ 1.23TB/s.
According to Samsung: “Requiring changes to the host processor and/or application code has been the biggest hurdle for wide adoption of PIM by industry.” With this in mind, the company is offering a a complete support software solution, including device drivers, runtime software, a BLAS library, etc. The result, they say, is that TensorFlow source code will require no changes to operate with the Aquabolt-XL. Still, researchers admitted that the conversion requires a one-time recompile or dynamic linking.
How Well Does It Perform?
The Aquabolt-XL has been targeted at memory-bound workloads with low arithmetic intensity, like speech recognition and Natural Language Processing (NLP). Samsung tells us that the processors that their standard HBMs are normally tied to are more suited to compute-bound workloads and are often idle when dealing with memory-bound workloads. A 100TOPS (Trillion Operations per Second) processor may actually be able to run no faster than 10TOPS in such systems. The HBM-PIM offloads these memory-bound workloads, allowing the processor/HBM complex to perform at a higher level.
When teamed with a Xilinx Alveo U280 in an RNN-T AI application, Samsung found that the Aquabolt-XL improved performance by 2.49 times over the same Alveo FPGA with standard HBM DRAMs, while system energy dropped by 62%. (That makes sense, since a 2.49 times speed increase would result in a 60% energy reduction for any function as long as the system power remained the same. The power is also reduced by the fact that data is processed without leaving the chip and burning power on the I/O pins.) Samsung says that, in this benchmark, the LSTM layer (Long Short-Term Memory, an RNN term), which would be performed in the PIM, consumes 90.7% of the application’s running time, while vector-matrix manipulation, running on the Alveo, uses 78.8% of the running time, so both the HBM-PIM and the FPGA see very little idle time in this application.
In a DeepSpeech II application using a commercial processor the HBM-PIM doubled performance while reducing system energy by 70%.
In the second part of this series we will discuss earlier companies’ PIM products and the factors that are required for market success.