A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC | Paper Summaries

Context

This note summarizes how a mobile/embedded NPU handles mixed precision, low latency, and always-on requirements within one architecture.
The main focus is workloads with mismatched utilization and precision needs, such as shallow layers, depthwise convolution, and FP16 workloads.

This paper argues that a 4nm dual-core NPU can handle diverse mobile workloads more flexibly by combining data-movement techniques for higher HW utilization, a unified multi-precision datapath, and multiple operating modes.
The key idea is not to build separate datapaths for each workload. Instead, a single unified multi-precision datapath supports INT4/INT8/INT16/FP16, while operating modes such as low-latency and AON cover different mobile workload needs.
The paper reports that scatter-gather improves utilization by up to 4x in cases such as shallow layers, AON mode reduces power by 89% compared with standard mode, MobileNetEdgeTPU reaches 3433 inference/s, and peak energy efficiency reaches 11.59 TOPS/W.

Mobile NPUs face different precision, latency, and power requirements across workloads, so a single fixed optimum is difficult to design around.
This paper is therefore a useful example of how one architecture can trade off precision, utilization, latency, and always-on behavior.

Each core has 4096 8b MAC units and 1MB TCM; together, the two cores form an 8K MAC configuration.
For data movement and utilization, zero skipping reduces idle MACs on sparse feature maps, scatter-gather mitigates low MAC utilization in shallow/depthwise layers with small input channel depth, and TCM-based reuse reduces DRAM traffic.
For the unified multi-precision datapath, a fused dot-product structure supports INT4, INT8, INT16, and FP16, reducing area/power overhead versus separate datapaths.
For operating modes, the design includes a low-latency cooperative mode where two cores exchange halo regions, plus a separate AON mode that reduces DRAM access and powers only essential blocks.

As the paper acknowledges, the fused FP16 dot-product does not produce exactly the same numerical result as sequential floating-point accumulation.
The 4x benefit from scatter-gather is measured on specific utilization bottlenecks such as shallow layers or depthwise convolution; for the full DeepLabV3 FP16 model, compute utilization improves from 37% to about 50%, so the results should be interpreted separately.
More operating modes can increase runtime scheduling and software stack complexity, but the paper does not deeply analyze this aspect.

Compared with [[Knowledge/Paper Reviews/An On-Device Generative AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC with Fan-Out Wafer-Level Package]], this paper emphasizes data movement, a unified datapath, and operating modes, while the later paper emphasizes heterogeneous engines, memory hierarchy, and thermal/package co-design.
It would be useful to check whether scatter-gather and unified multi-precision datapaths remain recurring patterns in recent mobile NPUs.

An On-Device Generative AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC with Fan-Out Wafer-Level Package