Technology

Reducing attention and KV traffic at the memory interface

How It Works

The Problem

Modern AI accelerators are not limited by peak HBM bandwidth. They are limited by what workloads actually sustain under attention and KV cache traffic.

The gap between peak and sustained bandwidth is where real performance is lost.

As models scale and context windows grow, attention-driven data movement dominates, creating a bottleneck that additional compute cannot overcome.

What SoftNMC Does

SoftNMC reduces attention and KV cache traffic at the memory interface before it traverses the interconnect. This converts a bandwidth-limited system into a throughput-scaled system, increasing effective tokens per second without requiring changes to the programming model.

Where It Sits

SoftNMC is implemented as a small chiplet adjacent to each HBM stack or integrated into the memory subsystem.

Operating directly at the memory interface, it transforms attention flows before they hit the bus, reducing pressure on the interconnect and downstream compute.

This approach scales naturally with multi-stack HBM architectures.

System Placement

Flow Representation:

Compute Fabric (Tensix / Dataflow)
↓ KV / Attention Traffic
Soft-NMC (Memory Interface Layer)
↓ Reduced Traffic
HBM Controller → HBM

Why This Matters

Memory bandwidth limits sustained performance.

There is a gap between peak vs sustained throughput.

SoftNMC reduces data movement at the source and improves effective bandwidth utilization.

Why CMOS Doesn’t Already Do This

Traditional CMOS logic is optimized for fixed-function execution.

The structures required to dynamically adapt and compress attention flows become inefficient in area and power.

As a result, this function has not been practical at the memory interface.

What DRDCL Enables

DRDCL is a logic architecture designed to increase the utility of each transistor.

Integration

SoftNMC is delivered as characterized standard cells and hardened macros compatible with existing digital design flows.

SoftNMC does not increase peak bandwidth.

It enables systems to use the bandwidth they already have.