Context
- This note summarizes how a mobile/embedded NPU handles mixed precision, low latency, and always-on requirements within one architecture.
- The main focus is workloads with mismatched utilization and precision needs, such as shallow layers, depthwise convolution, and FP16 workloads.
What
- This paper argues that a 4nm dual-core NPU can handle diverse mobile workloads more flexibly by combining data-movement techniques for higher
HW utilization, a unified multi-precision datapath, and multiple operating modes. - The key idea is not to build separate datapaths for each workload. Instead, a single
unified multi-precision datapathsupportsINT4/INT8/INT16/FP16, while operating modes such as low-latency andAONcover different mobile workload needs. - The paper reports that
scatter-gatherimproves utilization by up to4xin cases such as shallow layers,AON modereduces power by89%compared with standard mode, MobileNetEdgeTPU reaches3433 inference/s, and peak energy efficiency reaches11.59 TOPS/W.
Why
- Mobile NPUs face different precision, latency, and power requirements across workloads, so a single fixed optimum is difficult to design around.
- This paper is therefore a useful example of how one architecture can trade off precision, utilization, latency, and always-on behavior.
How
- Each core has
4096 8b MACunits and1MB TCM; together, the two cores form an8K MACconfiguration. - For data movement and utilization,
zero skippingreduces idle MACs on sparse feature maps,scatter-gathermitigates low MAC utilization in shallow/depthwise layers with small input channel depth, andTCM-based reuse reduces DRAM traffic. - For the unified multi-precision datapath, a fused dot-product structure supports
INT4,INT8,INT16, andFP16, reducing area/power overhead versus separate datapaths. - For operating modes, the design includes a low-latency cooperative mode where two cores exchange halo regions, plus a separate
AON modethat reduces DRAM access and powers only essential blocks.
Pitfalls
- As the paper acknowledges, the fused
FP16dot-product does not produce exactly the same numerical result as sequential floating-point accumulation. - The
4xbenefit fromscatter-gatheris measured on specific utilization bottlenecks such as shallow layers or depthwise convolution; for the full DeepLabV3FP16model, compute utilization improves from37%to about50%, so the results should be interpreted separately. - More operating modes can increase runtime scheduling and software stack complexity, but the paper does not deeply analyze this aspect.
Next steps
- Compared with
[[Knowledge/Paper Reviews/An On-Device Generative AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC with Fan-Out Wafer-Level Package]], this paper emphasizes data movement, a unified datapath, and operating modes, while the later paper emphasizes heterogeneous engines, memory hierarchy, and thermal/package co-design. - It would be useful to check whether
scatter-gatherand unified multi-precision datapaths remain recurring patterns in recent mobile NPUs.
Related notes
- An On-Device Generative AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC with Fan-Out Wafer-Level Package