An On-Device Generative AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC with Fan-Out Wafer-Level Package | Paper Summaries

Context

This note summarizes an NPU architecture for handling on-device generative AI workloads in a mobile SoC.
The main focus is how LLM/LVM workloads differ from conventional CNN-centered NPU requirements.

This paper argues that the heterogeneous NPU in Samsung Exynos 2400 can practically support not only CNNs but also on-device generative AI workloads.
The key idea is to co-optimize heterogeneous engine organization, memory hierarchy, tiling, and thermal path so the same NPU can handle both memory-intensive LLM decoding and compute-intensive vision generative models.
The paper reports 1.81x~2.65x throughput improvement over prior work, 8.3 inference/s for Stable Diffusion U-Net, and 140.3 inference/s for EDSR.

Transformer-family workloads cannot be made efficient by increasing MAC count alone; DRAM traffic and non-linear operations such as softmax, normalization, and activation become major bottlenecks.
This paper is therefore a useful example of why workload-aware architecture is needed beyond simply scaling a CNN-centered NPU.

The design includes both GTE (8K MAC) and STE (512 MAC), splitting high-reuse convolution / matrix-matrix multiplication from bandwidth-sensitive matrix-vector multiplication / depthwise convolution.
The VE is not just an auxiliary block; it consists of four SIMD vector engine units that reduce Transformer/Gen-AI non-linear operation bottlenecks such as softmax, activation, and normalization, which are difficult to process efficiently with Tensor Engines alone.
A 6MB NPUMEM scratchpad, simplified Q-cache, prefetching, skewness-curve-based tiling, and L1-tile pipelining between TE and VE reduce latency between linear and non-linear operations while improving data reuse and utilization.
On the package side, FOWLP improves thermal resistance by 16%; including process/package changes, the paper reports up to a 30% higher NPU clock at the same power.

The architecture assumes workload-specific trade-offs. For example, GTE alone can have low utilization on memory-intensive operations, so a separate STE is needed.
The benchmark results are impressive, but detailed power/throughput breakdowns are not provided for every workload.
Because thermal/package benefits contribute materially to sustained performance, interpreting the architecture in isolation can overstate its generality.

Compare this paper with [[Knowledge/Paper Reviews/A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC]]: the 2022 paper emphasizes data movement, unified datapath, and operating modes, while the 2025 paper emphasizes heterogeneous engines, memory hierarchy, and thermal/package co-design.
It would be useful to collect more examples showing the importance of bandwidth-oriented engines like STE in LLM decode paths.

A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC