Back to all summaries

Paper summary

An On-Device Generative AI Focused Neural Processing Unit in 4nm Flagship Mobile SoC with Fan-Out Wafer-Level Package

This note summarizes an NPU architecture for handling on-device generative AI workloads in a mobile SoC.

Authors

Jun-Seok Park, Taehee Lee, Heonsoo Lee, Changsoo Park, Youngsang Cho, Mookyung Kang, Heeseok Lee, Jinwon Kang, Taeho Jeon, Dongwoo Lee, Yesung Kang, Kyungmok Kum, Geunwon Lee, Hongki Lee, Minkyu Kim, Suknam Kwon, Sung-beom Park, Dongkeun Kim, Chulmin Jo, HyukJun Chung, Ilryoung Kim, Jongyoul Lee

Publication

2025 IEEE International Solid-State Circuits Conference (ISSCC 2025), 2025-02-18

Context

  • This note summarizes an NPU architecture for handling on-device generative AI workloads in a mobile SoC.
  • The main focus is how LLM/LVM workloads differ from conventional CNN-centered NPU requirements.

What

  • This paper argues that the heterogeneous NPU in Samsung Exynos 2400 can practically support not only CNNs but also on-device generative AI workloads.
  • The key idea is to co-optimize heterogeneous engine organization, memory hierarchy, tiling, and thermal path so the same NPU can handle both memory-intensive LLM decoding and compute-intensive vision generative models.
  • The paper reports 1.81x~2.65x throughput improvement over prior work, 8.3 inference/s for Stable Diffusion U-Net, and 140.3 inference/s for EDSR.

Why

  • Transformer-family workloads cannot be made efficient by increasing MAC count alone; DRAM traffic and non-linear operations such as softmax, normalization, and activation become major bottlenecks.
  • This paper is therefore a useful example of why workload-aware architecture is needed beyond simply scaling a CNN-centered NPU.

How

  • The design includes both GTE (8K MAC) and STE (512 MAC), splitting high-reuse convolution / matrix-matrix multiplication from bandwidth-sensitive matrix-vector multiplication / depthwise convolution.
  • The VE is not just an auxiliary block; it consists of four SIMD vector engine units that reduce Transformer/Gen-AI non-linear operation bottlenecks such as softmax, activation, and normalization, which are difficult to process efficiently with Tensor Engines alone.
  • A 6MB NPUMEM scratchpad, simplified Q-cache, prefetching, skewness-curve-based tiling, and L1-tile pipelining between TE and VE reduce latency between linear and non-linear operations while improving data reuse and utilization.
  • On the package side, FOWLP improves thermal resistance by 16%; including process/package changes, the paper reports up to a 30% higher NPU clock at the same power.

Pitfalls

  • The architecture assumes workload-specific trade-offs. For example, GTE alone can have low utilization on memory-intensive operations, so a separate STE is needed.
  • The benchmark results are impressive, but detailed power/throughput breakdowns are not provided for every workload.
  • Because thermal/package benefits contribute materially to sustained performance, interpreting the architecture in isolation can overstate its generality.

Next steps

  • Compare this paper with [[Knowledge/Paper Reviews/A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC]]: the 2022 paper emphasizes data movement, unified datapath, and operating modes, while the 2025 paper emphasizes heterogeneous engines, memory hierarchy, and thermal/package co-design.
  • It would be useful to collect more examples showing the importance of bandwidth-oriented engines like STE in LLM decode paths.

Related notes

  • A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC