Back to all summaries

Paper summary

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

This note summarizes the softmax bottleneck in Transformer inference from a hardware/software co-design perspective.

Authors

Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, Anand Raghunathan

Publication

arXiv, 2021-03-16

Context

  • This note summarizes the softmax bottleneck in Transformer inference from a hardware/software co-design perspective.
  • The main focus is redesigning softmax into a low-precision, hardware-friendly form instead of keeping a conventional floating-point-centered implementation.

What

  • This paper argues that Softermax can preserve Transformer accuracy while achieving much higher energy efficiency by reformulating softmax for hardware.
  • The key idea is not just one softmax approximation, but a complete path that combines equation transformation, low-precision arithmetic, and downstream fine-tuning so the method can be used in inference.
  • The paper reports 2.35x energy efficiency versus baseline, with very small accuracy loss and even improvements on some tasks.

Why

  • As sequence length grows, the cost of softmax inside self-attention becomes non-trivial, and unlike many conventional DNN operations it can become a real bottleneck.
  • This paper therefore shows why Transformer accelerators should treat softmax as an independent co-design target rather than a peripheral operation.

How

  • First, e^x is replaced with base-2 exponentiation, and exponentiation, accumulation, and division are performed in low-precision fixed point.
  • Online normalization reduces the need for a separate max-computation pass, while integer-max-based renormalization makes correction possible with shift operations.
  • On the software side, Softermax-aware fine-tuning recovers downstream-task accuracy loss caused by approximation and quantization.
  • The work then evaluates area and energy trade-offs at the unnormed softmax unit, normalization unit, and full processing element levels.

Pitfalls

  • The benefit assumes custom hardware support, so the same efficiency should not be expected on commodity hardware.
  • Accuracy preservation partly depends on Softermax-aware fine-tuning, so this is not a training-free drop-in replacement.
  • The evaluation mainly uses BERT-Base, BERT-Large, GLUE, and SQuAD, so broader generality across model families needs more validation.

Next steps

  • It would be useful to collect more examples of how often softmax approximation is treated as an independent co-design target in Transformer accelerator papers.
  • Later work should compare Softermax more clearly against approximation/co-design methods that target the full attention path.

Related notes