Vision Transformer Encoder Implementation
Course: 498 NSU - VLSI for Machine Learning
Author: Dev Patel
This project focuses on implementing the encoder layer of a Vision Transformer (ViT) in Verilog. The encoder consists of three main components: Layer Normalization, Multi-Head Self-Attention (MHSA), and a Feed-Forward Network (FFN). The implementation is optimized for 8-bit fixed-point arithmetic.
Design Highlights
- Layer Normalization: Computes mean/variance for stable token input.
- Multi-Head Self-Attention (MHSA): Uses systolic 2x2 matrix multiplication.
- Feed-Forward Network (FFN): Two-layer MLP with ReLU activation.
- Optimization: Fixed-point arithmetic (8-bit) for efficient hardware execution.
- Implementation: Functional blocks tested in Verilog, integrated with Python verification.




Implementation Details
The encoder layer was divided into modular blocks:
- Layer Normalization: Normalizes input tokens to zero mean, unit variance.
- MHSA: Implements dot-product attention with systolic 2x2 matrix multiplication.
- FFN: Processes tokens through two dense layers with ReLU activation.
To overcome synthesis challenges, a hybrid approach was used where Verilog blocks were validated using Python simulation.
Feed-Forward Network (FFN)
The FFN in the ViT encoder consists of two dense layers with ReLU activation in between. Each layer performs matrix-vector multiplication followed by bias addition, and the ReLU activation is applied to the output of the first layer:
- First Layer: The input token is multiplied by a weight matrix, and a bias vector is added to produce the first output.
- ReLU Activation: The output of the first layer undergoes ReLU activation, introducing non-linearity to the model.
- Second Layer: The ReLU output is multiplied by another weight matrix and biased again to generate the final output of the FFN.
Each layer is optimized for efficient hardware implementation, with 8-bit fixed-point precision to balance speed and resource usage. This allows the design to be hardware-friendly while maintaining model performance.
Systolic Matrix Multiplication
The Multi-Head Self-Attention (MHSA) module uses systolic array architecture for efficient matrix multiplication. Systolic arrays allow for parallel processing of the matrix multiplication, which is crucial for high-speed operations in hardware implementations.
- Matrix Partitioning: The large matrix is broken into smaller sub-matrices, which can be processed concurrently in a systolic array.
- Parallelism: Each processing element in the systolic array computes part of the matrix multiplication, with data flowing synchronously through the array in a pipelined fashion.
- 2x2 Matrix Multiplier: The systolic array in this design uses 2x2 matrix multipliers to perform dot-product calculations for the attention mechanism, optimizing for hardware efficiency.
This parallel approach significantly reduces computation time and is well-suited for deployment in high-performance embedded systems.
Challenges and Future Work
Challenges:
- Ensuring precision in 8-bit fixed-point calculations.
- Interfacing Verilog modules with Python simulation.
- Debugging Vivado synthesis issues for systolic array operations.
Future Work:
- Developing a seamless Python-Verilog integration framework.
- Extending the design to multiple encoder layers.
- Implementing optimizations like quantization-aware training.