Vision Transformer Encoder

Vision Transformer Encoder Implementation

Course: 498 NSU - VLSI for Machine Learning

Author: Dev Patel

This project focuses on implementing the encoder layer of a Vision Transformer (ViT) in Verilog. The encoder consists of three main components: Layer Normalization, Multi-Head Self-Attention (MHSA), and a Feed-Forward Network (FFN). The implementation is optimized for 8-bit fixed-point arithmetic.

Design Highlights

- Layer Normalization: Computes mean/variance for stable token input.
- Multi-Head Self-Attention (MHSA): Uses systolic 2x2 matrix multiplication.
- Feed-Forward Network (FFN): Two-layer MLP with ReLU activation.
- Optimization: Fixed-point arithmetic (8-bit) for efficient hardware execution.
- Implementation: Functional blocks tested in Verilog, integrated with Python verification.

Implementation Details

The encoder layer was divided into modular blocks:

Layer Normalization: Normalizes input tokens to zero mean, unit variance.
MHSA: Implements dot-product attention with systolic 2x2 matrix multiplication.
FFN: Processes tokens through two dense layers with ReLU activation.

To overcome synthesis challenges, a hybrid approach was used where Verilog blocks were validated using Python simulation.

Feed-Forward Network (FFN)

The FFN in the ViT encoder consists of two dense layers with ReLU activation in between. Each layer performs matrix-vector multiplication followed by bias addition, and the ReLU activation is applied to the output of the first layer:

First Layer: The input token is multiplied by a weight matrix, and a bias vector is added to produce the first output.
ReLU Activation: The output of the first layer undergoes ReLU activation, introducing non-linearity to the model.
Second Layer: The ReLU output is multiplied by another weight matrix and biased again to generate the final output of the FFN.

Each layer is optimized for efficient hardware implementation, with 8-bit fixed-point precision to balance speed and resource usage. This allows the design to be hardware-friendly while maintaining model performance.

Systolic Matrix Multiplication

The Multi-Head Self-Attention (MHSA) module uses systolic array architecture for efficient matrix multiplication. Systolic arrays allow for parallel processing of the matrix multiplication, which is crucial for high-speed operations in hardware implementations.

Matrix Partitioning: The large matrix is broken into smaller sub-matrices, which can be processed concurrently in a systolic array.
Parallelism: Each processing element in the systolic array computes part of the matrix multiplication, with data flowing synchronously through the array in a pipelined fashion.
2x2 Matrix Multiplier: The systolic array in this design uses 2x2 matrix multipliers to perform dot-product calculations for the attention mechanism, optimizing for hardware efficiency.

This parallel approach significantly reduces computation time and is well-suited for deployment in high-performance embedded systems.

Challenges and Future Work

Challenges:

Ensuring precision in 8-bit fixed-point calculations.
Interfacing Verilog modules with Python simulation.
Debugging Vivado synthesis issues for systolic array operations.

Future Work:

Developing a seamless Python-Verilog integration framework.
Extending the design to multiple encoder layers.
Implementing optimizations like quantization-aware training.