Context
- This note summarizes an NPU architecture for handling on-device generative AI workloads in a mobile SoC.
- The main focus is how LLM/LVM workloads differ from conventional CNN-centered NPU requirements.
What
- This paper argues that the heterogeneous NPU in Samsung Exynos 2400 can practically support not only CNNs but also on-device generative AI workloads.
- The key idea is to co-optimize heterogeneous engine organization, memory hierarchy, tiling, and thermal path so the same NPU can handle both memory-intensive LLM decoding and compute-intensive vision generative models.
- The paper reports
1.81x~2.65xthroughput improvement over prior work,8.3 inference/sfor Stable Diffusion U-Net, and140.3 inference/sfor EDSR.
Why
- Transformer-family workloads cannot be made efficient by increasing MAC count alone; DRAM traffic and non-linear operations such as
softmax,normalization, andactivationbecome major bottlenecks. - This paper is therefore a useful example of why workload-aware architecture is needed beyond simply scaling a CNN-centered NPU.
How
- The design includes both
GTE(8K MAC) andSTE(512 MAC), splitting high-reuse convolution / matrix-matrix multiplication from bandwidth-sensitive matrix-vector multiplication / depthwise convolution. - The
VEis not just an auxiliary block; it consists of four SIMDvector engineunits that reduce Transformer/Gen-AI non-linear operation bottlenecks such assoftmax,activation, andnormalization, which are difficult to process efficiently with Tensor Engines alone. - A
6MB NPUMEMscratchpad, simplifiedQ-cache, prefetching, skewness-curve-based tiling, and L1-tile pipelining betweenTEandVEreduce latency between linear and non-linear operations while improving data reuse and utilization. - On the package side,
FOWLPimproves thermal resistance by16%; including process/package changes, the paper reports up to a30%higher NPU clock at the same power.
Pitfalls
- The architecture assumes workload-specific trade-offs. For example,
GTEalone can have low utilization on memory-intensive operations, so a separateSTEis needed. - The benchmark results are impressive, but detailed power/throughput breakdowns are not provided for every workload.
- Because thermal/package benefits contribute materially to sustained performance, interpreting the architecture in isolation can overstate its generality.
Next steps
- Compare this paper with
[[Knowledge/Paper Reviews/A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC]]: the 2022 paper emphasizes data movement, unified datapath, and operating modes, while the 2025 paper emphasizes heterogeneous engines, memory hierarchy, and thermal/package co-design. - It would be useful to collect more examples showing the importance of bandwidth-oriented engines like
STEin LLM decode paths.
Related notes
- A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC