This article is part of the Technology Insight series, made possible with funding from Intel.
The ubiquity of cloud computing, the growth of edge computing, and rapid innovations in AI are all driven by data—gathering it, storing it, moving it, processing it, and distilling it down into valuable insights. Each of those tasks is unlike the others. So, in today’s world of specialized applications, purpose-built processing engines work together to make the heavy lifting more manageable. This approach is commonly known as heterogenous computing.
In such environments, many problems are solved faster by offloading to accelerators. Think graphics processors (GPUs), application-specific integrated circuits (ASICs), and field-programmable gate arrays (FPGAs). Each of these dissimilar devices are data hungry. Keeping them fed requires an interconnect able to facilitate lots of bandwidth and low latency.
- CXL is an open industry standard interconnect that builds on PCI Express 5.0’s infrastructure to reduce complexity and system cost.
- CXL’s protocols enable memory coherency, allowing more efficient resource sharing between host processors and accelerator devices.
- Host processors and accelerators with CXL support are expected in 2021.
Today, PCI Express is the most prevalent technology connecting host processors to accelerator devices. It’s an industry-standard, high-performance, general-purpose serial I/O interconnect designed for use in enterprise, desktop, mobile, communications, and embedded platforms.
But to really scale heterogeneous computing in the data center, compute-intensive workloads need an interconnect with more efficient data movement. The new Compute Express Link (CXL) builds upon PCI Express 5.0’s physical and electrical interface with protocols that address those demands by establishing coherency, simplifying the software stack, and maintaining compatibility with existing standards. More than 100 top companies, including Intel, Google, Facebook, Microsoft, and HP have signed on as members.
Read on for a dive into how CXL works, devices likely to use it, and directions in 2021.
What is CXL?
CXL is a CPU-to-device interconnect that targets high-performance workloads and the heterogenous compute engines driving them. It leverages a new feature in the PCI Express 5.0 specification that allows alternate protocols to use PCIe’s physical layer.
So when you plug a CXL-enabled accelerator into a x16 slot, the device starts negotiating with the host processor’s port at PCI Express 1.0 transfer rates (2.5 GT/s). If both sides support CXL, they switch over to the CXL transaction protocols. Otherwise they operate as PCIe devices.
The alignment of CXL and PCI Express 5.0 means both device classes will transfer data at 32 GT/s (giga transfers per second). That’s up to 64 GB/s in each direction over a 16-lane link. It’s also likely that the performance demands of CXL will be a driver for the adoption of the upcoming PCI Express 6.0 specification.
Given similar bandwidth as PCIe 5.0, CXL carves out its advantage over PCIe with three dynamically multiplexed transaction layer protocols: CXL.io, CXL.cache, and CXL.memory. The first, CXL.io, is almost identical to PCI Express 5.0. It’s used for device discovery, configuration, register access, interrupts, virtualization and bulk DMA, making it a mandatory ingredient. Although CXL.cache and CXL.memory are optional, they’re the special sauce that enable CXL’s coherency and low latency. The former allows an accelerator to cache system memory, while the latter gives a host processor access to memory attached to an accelerator.
“That accelerator-attached memory could be mapped into the coherent space of the CPU and be viewed as additional address space,” says Jim Pappas, chairman of the Compute Express Link Consortium. “It’d have performance similar to what you’d get from a dual-processor system going over a coherent interface between two CPUs.” PCI Express lacks this functionality. Prior to CXL, the CPU could go over PCIe to access the accelerator, but it would be uncached memory at best because PCIe is a noncoherent interface.
Pappas adds that coherency between the CPU memory space and memory on attached devices is especially important in heterogeneous computing. “Rather than doing DMA operations back and forth, the host processor or accelerator could read/write with memory operations directly into the other device’s memory system.” Accelerator manufacturers are shielded from much of the complexity that goes into enabling the benefits of coherency since CXL’s asymmetric design shifts most of the coherency management to the host processor’s home agent.
The CXL.cache and CXL.memory protocols are deliberately optimized for low latency. Pappas suggests that should allow them to match the performance of symmetric cache coherency links. They eschew CXL.io’s variable payload and the extra pipeline stages needed to accommodate that flexibility. Instead, they’re broken off into separate transaction and link layers, unencumbered by larger CXL.io transactions.
What devices stand to benefit most from CXL?
Mixing and matching CXL’s protocols yields a trio of use cases that show off the interconnect’s shiny new features.
The first, referred to by the CXL Consortium as a Type 1 device, includes accelerators with no local memory. This kind of device uses the CXL.io protocol (which, remember, is mandatory), along with CXL.cache to talk to the host processor’s DDR memory as if it were its own. An example might be a smart network interface card able to benefit from caching.
Type 2 devices include GPUs, ASICs, and FPGAs. Each has its own DDR memory or High Bandwidth Memory, requiring the CXL.memory protocol in addition to CXL.io and CXL.cache. Bringing all three protocols to bear makes the host processor’s memory locally available to the accelerator, and the accelerator’s memory locally available to the CPU. They also sit in the same cache coherent domain, giving heterogeneous workloads a big boost.
Memory expansion is a third use case enabled by the CXL.io and CXL.memory protocols. A buffer attached to the CXL bus might be used for DRAM capacity expansion, augmenting memory bandwidth, or adding persistent memory without tying up precious DRAM slots in high-performance workloads. High-speed, low-latency storage devices that would have previously displaced DRAM can instead complement it by way of CXL, opening the door to non-volatile technologies in add-in card, U.2, and EDSFF form factors.
CXL’s future is bright
How seriously should you take CXL’s potential impact on your high-performance computational workload? Just look at the level of industry support behind the interconnect. “In a year, we went from nine companies invited to 115 members,” says Pappas. “That tells the story. And it’s not just the number of companies. Look at them. It is the industry.”
Right out of the gate, CXL is going to be a data center play. The cloud, analytics, AI, the edge—that’s where the scaling problems live. Pappas continues, “The performance of the device in my pocket is known. But when I ask Siri a question, and it hits a data center receiving millions of other questions at once, that’s a scalable problem. The data center operators need acceleration.”
Relief isn’t far off. In a post published last year, Navin Shenoy, general manager of the Data Center Group at Intel, said to expect products with CXL support, including Xeon processors, FPGAs, GPUs, and SmartNICs starting in 2021.
By that point, the CXL 2.0 specification will already be finalized. It remains to be seen what the next generation adds. However, deconstructing the CXL Consortium’s board of directors hints that we may see switching at some point in the future. Shared pools of buffered memory, accessible by multiple host domains, would be especially attractive in a hyperconverged infrastructure affected by resource drift. Cloud providers could attach less memory to each node’s CPU and scale up using the pool.
Regardless of what the future holds, the CXL Consortium says it’s committed to evolving the interconnect in an open and collaborative manner. Given broad acceptance thus far, assured compatibility with PCIe 5.0-based platforms, and IP protections under the Adopter membership level, it’s only a matter of time before CXL finds its way into other types of computing, too.