Samsung has been strongly promoting its “Aquabolt-XL” Processor-In-Memory (PIM) devices for the past year. In this second post of a two-part series The Memory Guy will present other companies’ similar PIM devices, and will discuss the PIM approach’s outlook for commercial success.
Part 1 of this series explains the concept of Processing in Memory (PIM), details Samsung’s Aquabolt-XL design, and shares some performance data. It can be found HERE.
Samsung’s Not the First PIM Maker
This is not at all the first time a company has produced a DRAM chip with an internal general-purpose processor. Although Samsung is the first to build a PIM in an HBM, other companies have added processors to more standard DDR DRAMs. Here’s a brief description of other efforts:
Venray: A start-up that has been promoting a PIM architecture since at least 2011, Venray is based on a standard DDR DRAM with a specialty processor designed to excel at matrix manipulation.
Micron Technology: Nearly a decade ago Micron introduced its Automata Processor, which was a standard DDR DRAM combined with a processor architecture designed to match the DRAM’s extreme internal bandwidth with graph searches, which are problems that can take very good advantage of high bandwidth. Micron later spun the group off to form Natural Intelligence Semiconductors.
UPMEM: In 2019 a start-up named UPMEM (for microprocessor – µP – and memory) introduced a DIMM with a chip that combined a processor with a standard DDR DRAM. This has been covered in two prior posts on The Memory Guy.
Samsung argues that the reason that these efforts have not reached commercial success is that software needs to be reconfigured to use any of them. Since these PIMs are all based upon standard DDR DRAM chips, they all have buses that are narrower than the processor bus they are attached to. This means that the data must be reconfigured by the processor before it is fed into the PIM, and that undermines the PIM’s speed advantage.
The company points out that in an HBM-PIM the data bus for the HBM is the same width as the data bus for the processor, so no such reconfiguration is required.
While Samsung’s argument may be true, another issue is the processor architectures, all of which, including Samsung’s, differ from the standard Intel instruction set which is so widely supported. Samsung addresses this issue with its software support, but the code still needs to be recompiled. Time will tell whether this approach gives the company the advantage it needs to achieve market success.
There are three key market considerations that must be addressed with any PIM. The first is that the memories these are built around achieve a very low cost structure thanks to their high shipment volumes. The Economies of Scale drive down commodity DRAM production costs, and these economies are unavailable to a low-volume product, so PIMs cannot be profitably sold at prices that rival DRAM prices. We have recently seen this same dynamic lead to significant losses in Intel’s Optane product line.
That would not be an issue if customers were quick to recognize the value that PIMs can bring to their systems, but this requires a costly effort to verify, and that means that it rarely is done. This second issue has a lot to do with the positioning of these products, as turbo-boosted memories, rather than as processors.
When you consider the typical Intel processor you notice something relatively similar to a PIM, where about half of the die area might be SRAM for a cache memory, as can be seen in the diagram below, a press photo of an Intel processor from a few years back.
Intel never uses the term “PIM” for their processors, and this allows the company to charge a much higher price than they might otherwise be able to do. Commodity SRAM megabytes sell for a small fraction of the cost of half of an Intel chip.
Still, the memory portion of this chip accounts for almost half the die area, and that’s about the same ratio as Samsung’s Aquabolt-XL.
The third and final market consideration has already been mentioned, and that’s software compatibility. Programmers put a lot of work into making their code work well in an established architecture, and will need to duplicate a portion of that effort to port the software to a new architecture and to test it to assure that the port didn’t cause any mischief. This takes time and money that management may not easily be able to justify spending on the project. Such issues have stopped other worthwhile projects, and these same issues are hard at work here.
Samsung claims to have worked around this issue to create a solution that requires no software changes. If they have indeed accomplished this then they will have a far easier time convincing customers to use the Aquabolt-XL HBM-PIM. Time will tell. I will be watching this effort carefully for the next few years.
5 thoughts on “Samsung’s Aquabolt-XL Processor-In-Memory (Part 2)”
PIMs versus IMPs
In passing it is worth remembering that in some applications there may be a competitor for PIMs. Ohms law and the ability to vary at will the conductance of most NV emerging memory devices, gives access to in-memory processing (IMP) or computation. Which provides the means to carry out in-memory and in parallel sum-of-products calculations, removing that workload from the processor. This was demonstrated some time ago by IBM for phase change memory (PCM) and more recently for MRAM by a team from Samsung’s Advanced Institute of Technology (SAIT),
Ron, Thanks for mentioning these.
Sometime soon I need to write a post that compares the different “In Memory” approaches. There are the ones I mentioned in this post, the ones that do linear processing in an array, like neural networks or IBM’s PCM hyperdimensional computing chip in your recent post, and then there is Gigascale’s Gemini APU that uses a special SRAM cell to do digital processing within each bit cell. They are all very different ways of harnessing the immense internal bandwidth of memory chips.
Thanks for the great post!
Regarding IMPs, prior works, as Ambit and ComputeDRAM, have demonstrated how to exploit DRAM timing parameters to execute majority of three operations natively in DRAM. Other works as SIMDRAM generalized such technique to execute complex operations as addition and multiplications, achieving significant large throughputs.
Wow, Geraldo! Sounds like I could have done more homework before posting this one.
Let me look into those that you mentioned. Sounds like I might want to write an additional post about them.
Thanks for letting me know, and, of course, thanks for the compliment on the post.
Comments are closed.