
Modern AI accelerators are not limited by peak HBM bandwidth. They are limited by what workloads actually sustain under attention and KV cache traffic.
The gap between peak and sustained bandwidth is where real performance is lost.
As models scale and context windows grow, attention-driven data movement dominates, creating a bottleneck that additional compute cannot overcome.

SoftNMC is implemented as a small chiplet adjacent to each HBM stack or integrated into the memory subsystem.
Operating directly at the memory interface, it transforms attention flows before they hit the bus, reducing pressure on the interconnect and downstream compute.
This approach scales naturally with multi-stack HBM architectures.
Flow Representation:
Compute Fabric (Tensix / Dataflow)
↓ KV / Attention Traffic
Soft-NMC (Memory Interface Layer)
↓ Reduced Traffic
HBM Controller → HBM
Memory bandwidth limits sustained performance.
There is a gap between peak vs sustained throughput.
SoftNMC reduces data movement at the source and improves effective bandwidth utilization.
Traditional CMOS logic is optimized for fixed-function execution.
The structures required to dynamically adapt and compress attention flows become inefficient in area and power.
As a result, this function has not been practical at the memory interface.
DRDCL is a logic architecture designed to increase the utility of each transistor.
SoftNMC is delivered as characterized standard cells and hardened macros compatible with existing digital design flows.
It enables systems to use the bandwidth they already have.