d-Matrix announces SquadRack rack-scale solution to provide AI Inference at data centre scale 

Sree Ganesan, VP of Product at d-Matrix

Today, Santa Clara California-based generative AI inference pioneer d-Matrix, in collaboration with AI infrastructure leaders Arista, Broadcom and Supermicro, is announcing SquadRack, the industry’s first blueprint for disaggregated standards-based rack-scale solutions for ultra-low latency batched inference. It continues the expansion of d-Matrix’s AI product portfolio, following the launch of d-Matrix JetStream, a custom IO accelerator designed from the ground up to tackle the critical issue of IO bottlenecks hindering large-scale inference operations in AI by delivering industry-leading, data centre–scale AI inference. With millions of people now using AI services – and the rise of agentic AI, reasoning, and multi-modal interactive content – the industry’s focus is quickly shifting from model training to deploying AI at ultra-low latency across multiple users.

d-Matrix is showcasing SquadRack at the Open Compute Project Global Summit this week. SquadRack comes at a time when cloud providers, including sovereign clouds and enterprises, are struggling to keep up with generative AI inference demands. SquadRack provides a reference architecture to build turnkey solutions enabling blazing fast agentic AI, reasoning and video generation. It delivers up to 3x better cost-performance, 3x higher energy efficiency, and up to 10x faster token generation speeds compared to traditional accelerators.

So where does this approach come from? d-Matrix, which powers next generation compute for generative AI inference, started in 2019 and incubated.

“They broke into the hyperscaler market and developed a deep relationship with the Microsoft and Googles of the world,” said  Sree Ganesan, VP of Product at d-Matrix. “They saw that the hyperscalers were driving a huge demand for networking, wanted to be in on the next wave, didn’t want to be a vision accelerator company and saw a huge opportunity with AI inference.”

Even before the launch of ChatGPT ignited the AI revolution in late 2022, d-Matrix had already identified an unfilled need for bigger and faster memory in response to large language models (LLMs). d-Matrix CEO and co-founder Sid Sheth was already predicting a surge in AI inference workloads to result from the promising LLMs from OpenAI and Google that already were turning heads in the AI world and beyond.

“Because they wanted to double down on inference, the first couple of years they investigated the pain points of transformer workloads, realizing that they needed to solve this memory bandwidth problem,” Ganesan stated. “So they built a chiplet-based solution, which worked with low latency, and raised a Series B funding round in October 2023 to productize it.”

The result of these efforts was Corsair PCIe compute accelerators. Built for generative AI, Corsair provides a unique digital-In memory compute architecture with integrated performance memory for speed and of-chip capacity memory for offline batched inference.

“Our strength is in compute infrastructure and Corsair allows you to get memory with very low latency,” Ganesan said. “That is our value proposition.”

Ganesan said  this is all part and parcel of dMatrix’s unique approach to AI inference acceleration.

“The old model was capacity bound, in that every word that is generated has to go in and out of memory and into compute, and so the GPUs are underutilized,” she said. “The in-memory computing approach we pioneered  breaks things down to memory compute, instead of having separate blocks of memory and compute, so you don’t have to go back and forth between blocks.”

dMatrix now has the world’s first Digital In-Memory Computing (DIMC) solution that tightly integrates compute and memory to eliminate bottlenecks associated with traditional architectures.

“Other startups try and use SRAM [Static Random-Access Memory ] but it is very inefficient” Ganesan indicated. “We can pack a lot more memory capacity in a card, which allows a lot more memory capacity, so you can scale capacity without giving up on the bandwidth.”

Ganessan indicated that while d-Matrix started wth Corsair and compute in solving AI inference at scale with very low latency, they quickly had to move beyond that. While compute capabilities have advanced, IO has lagged. JetStream was built to close this disparity by enhancing data movement between memory, compute, and interconnect systems as models scale.

“We decided we had to build a super-fast NIC to keep up with the compute, and that’s JetStream,” she said. “By having a very fast NIC, it does keep up, and we would put them next to Corsairs on the same server.” Static Random-Access Memory (SRAM) performs computations directly within the memory array.

SquadRack is the newest extension of the d-Matrix approach.

“With the launch of SquadRack, we’re enabling customers to scale inference the right way –  with high efficiency, low latency, and standards-based deployment,” said Sid Sheth, CEO and Co-Founder, d-Matrix. “Corsair delivers the compute-memory acceleration, while JetStream delivers I/O acceleration. Combined with Supermicro’s AI servers, Arista’s ethernet switches, and Broadcom’s PCIe and ethernet switch chips, we’re delivering an AI inference rack that speeds up time to deployment. It’s a big step forward in making AI infrastructure commercially viable at scale.”

SquadRack configured with eight nodes in a single rack lets customers run Gen AI models up to 100 billion parameters with blazing fast speed.  For larger models or large-scale deployments, it uses industry standards-based ethernet to scale out to hundreds of nodes across multiple racks.

SquadRack’s key components include the d-Matrix Corsair Inference Accelerators and the d-Matrix JetStream IO Accelerators. They also include a Supermicro X14 AI Server Platform integrated with the Corsair accelerators and JetStream NICs, Broadcom Atlas PCIe switches for scaling up within a single node, and Arista Leaf Ethernet Switches connected to JetStream NICs enabling high performance, scalable, standards-based multi-node communication. A d-Matrix Aviator software stack makes it easy for customers to deploy Corsair and JetStream at scale and speed up time to inference.

“Supermicro is proud to collaborate with d-Matrix in delivering an efficient AI inference rack solution that combines compute acceleration, efficient networking, and server density in one integrated platform,” said Vik Malyala, President & Managing Director, EMEA and SVP Technology & AI, at Supermicro. “Our proven track record in rack-level integration, along with d-Matrix’s inference acceleration products, offers customers a practical path to scaling AI inference across the enterprise and cloud.”

“As a leader in high-performance PCIe and Ethernet connectivity, Broadcom is excited to see d-Matrix advancing AI infrastructure solutions,” stated Jas Tremblay, Vice President and General Manager, Data Center Solutions Group, at Broadcom. “d-Matrix is unlocking a new level of performance and efficiency in AI inference while leveraging the standards-based networking ecosystem that Broadcom has long supported.”

“Arista’s cloud networking fabric is designed to meet the rigorous demands of AI infrastructure, said Vijay Vusirikala, Distinguished Lead, AI Systems and Networks, at Arista Networks. “JetStream’s ability to enable accelerator-to-accelerator communication over standard ethernet pairs perfectly with Arista’s high-performance switches. Together, we’re demonstrating how AI inference can scale efficiently without requiring proprietary networking fabrics.”

SquadRack configurations will be available for purchase through Supermicro in Q1’26.

The plan is to eventually expand these routes to market beyond Supermicro. d-Matrix has been working with the Liqid and GigaIO ecosystems on their other products.

“Our customers have preferred OEM partners, and we want them to work with preferred OEM and ODM partners,” Ganesan stated. “They purchase directly from our partners. We plan to expand to other OEM partners over time, but we are starting small as a startup  – Our partners will provide efficient rackscale solutions to bring the best together.

“Hyperscalers tend to build their own server and solutions,” Ganesan added. “We sell them our cards and they put them into their server SKUs. It’s validated to work in their data centres. For us, it speeds up the time to inference because it can go into existing data centres. You don’t need a green fields environment. This lets our end customers get access to cutting edge technology in the generative AI space. Agentic AI is the workflow that automates tasks, that needs generative AI under the hood. We are building for that world because that’s where a lot of the pain points start emerging.”