This GitHub repository presents transformative advancements in machine learning accelerator architectures through a novel algorithm, the Free-pipeline Fast Inner Product (FFIP), which demands nearly half the number of multiplier units for equivalent performance, trading multiplications for low-bitwidth additions. It includes complete source code for implementing the FFIP algorithm and architecture, aimed at enhancing the computational efficiency of ML accelerators.
Main Points
FFIP Algorithm and Architecture
The repository delivers a novel algorithm (FFIP) alongside a hardware architecture that enhances the compute efficiency of ML accelerators by reducing the number of necessary multiplications.
Applicability and Performance of FFIP
The FFIP algorithm is applicable across various machine learning model layers and has been shown to outperform existing solutions in throughput and compute efficiency.
Comprehensive Source Code for Implementation
The source code provides a comprehensive setup for implementation including a compiler, RTL descriptions, simulation scripts, and testbenches.
Insights
Introduction of a novel algorithm and architecture
We introduce a new algorithm called the Free-pipeline Fast Inner Product (FFIP) and its hardware architecture that improve an under-explored fast inner-product algorithm (FIP) proposed by Winograd in 1968.
Potential impact on ML accelerators
FFIP can be seamlessly incorporated into traditional fixed-point systolic array ML accelerators to achieve the same throughput with half the number of multiply-accumulate (MAC) units, or it can double the maximum systolic array size that can fit onto devices with a fixed hardware budget.
Technical approach and implementation
The repository contains source code for ML hardware architectures that require nearly half the number of multiplier units to achieve the same performance by executing alternative inner-product algorithms. It includes a compiler for parsing Python model descriptions into accelerator instructions, synthesizable SystemVerilog RTL for the baseline, FIP, and FFIP systolic array architectures, and additional utilities for development.