Vision Transformer Encoder Implementation

Course: 498 NSU - VLSI for Machine Learning

Author: Dev Patel

This project focuses on implementing the encoder layer of a Vision Transformer (ViT) in Verilog. The encoder consists of three main components: Layer Normalization, Multi-Head Self-Attention (MHSA), and a Feed-Forward Network (FFN). The implementation is optimized for 8-bit fixed-point arithmetic.

Design Highlights

- Layer Normalization: Computes mean/variance for stable token input.
- Multi-Head Self-Attention (MHSA): Uses systolic 2x2 matrix multiplication.
- Feed-Forward Network (FFN): Two-layer MLP with ReLU activation.
- Optimization: Fixed-point arithmetic (8-bit) for efficient hardware execution.
- Implementation: Functional blocks tested in Verilog, integrated with Python verification.

Implementation Details

The encoder layer was divided into modular blocks:

To overcome synthesis challenges, a hybrid approach was used where Verilog blocks were validated using Python simulation.

Feed-Forward Network (FFN)

The FFN in the ViT encoder consists of two dense layers with ReLU activation in between. Each layer performs matrix-vector multiplication followed by bias addition, and the ReLU activation is applied to the output of the first layer:

Each layer is optimized for efficient hardware implementation, with 8-bit fixed-point precision to balance speed and resource usage. This allows the design to be hardware-friendly while maintaining model performance.

Systolic Matrix Multiplication

The Multi-Head Self-Attention (MHSA) module uses systolic array architecture for efficient matrix multiplication. Systolic arrays allow for parallel processing of the matrix multiplication, which is crucial for high-speed operations in hardware implementations.

This parallel approach significantly reduces computation time and is well-suited for deployment in high-performance embedded systems.

Challenges and Future Work

Challenges:

Future Work: