CN113344179B

CN113344179B - IP core of binary convolution neural network algorithm based on FPGA

Info

Publication number: CN113344179B
Application number: CN202110599962.3A
Authority: CN
Inventors: 冯佳玮; 石晴文
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-06-14
Anticipated expiration: 2041-05-31
Also published as: CN113344179A

Abstract

An IP core of a binary convolution neural network algorithm based on an FPGA belongs to the field of digital circuit design and comprises the following components: the device comprises a data input cache module, a weight cache module, a convolution control module, a pooling control module, a binarization module, a full-link layer control module and a global control module. The global control module controls the operation of other modules, and comprises the following steps: 1) storing image data, a binary network weight and a bias coefficient; 2) inputting and caching image data; 3) carrying out binarization compression on image data; 4) carrying out convolution operation; 5) a pooling process; 6) and accumulating and calculating to output a final result. The invention rapidly deploys and accelerates the binary convolution neural network on the FPGA, provides the IP core which can rapidly deploy the binary convolution neural network on different FPGA platforms, and can deploy the algorithm of the designated scale and complete the calculation under the condition of low resource occupation.

Description

IP core of binary convolution neural network algorithm based on FPGA

Technical Field

The invention belongs to the field of digital circuit design, and particularly relates to an IP core of a binary convolution neural network algorithm based on an FPGA (field programmable gate array).

Background

At present, with the vigorous development of artificial intelligence, the fields of artificial intelligence and deep learning research are gradually rising. Artificial neural networks are abstract models of biological neural networks. The convolutional neural network is widely applied to the fields of image analysis, classification, unmanned driving and the like by simulating a visual neural working mechanism and extracting input features through convolutional checking.

In the development of neural networks, in order to achieve higher accuracy of the network, the network generally becomes more complex, and with the network, the data storage resources required by the network in computing and the computing consumption are increased. In addition, in the FPGA environment, the computation and storage resources are limited, and the computation is limited when the neural network is too complex and huge. In order to reduce the problem of resource consumption caused by the enlargement of the network size, network quantization is proposed. Research has shown that reducing the data bit width can still keep the calculation accuracy of the network within a certain reasonable range. Among various quantization methods, binarization is an extreme quantization method: the storage occupation of the network is reduced by about 32 times compared with that before quantization by forcibly converting the 32-bit floating point data in the network into +1 and 0 of 1-bit of bit width. Meanwhile, due to the reduction of the data bit width, the network multiplication calculation can be synchronously converted into the bit operation with higher speed and less occupied resources and area, so that the network calculation speed is greatly improved.

An IP core block is a pre-designed, even verified, integrated circuit, device or component with some defined functionality. With the increasing size of FPGAs/CPLDs, the design becomes more and more complex, and the primary task of the designer is to complete the complex design within a specified time period. Calling the IP core can avoid repeated labor and greatly reduce the burden of engineers, so that the use of the IP core is a development trend.

With the application of deep learning on various small chips, the FPGA serving as a programmable device has the advantages of flexibility, configurability and short development period, and is suitable for serving as an early verification platform. However, the scale of the neural network is too large, great pressure is caused to the self resources of the FPGA, most repeated operations exist due to the characteristics of the internal structure of the FPGA during deployment, and the transplantation among different platforms is relatively difficult.

Disclosure of Invention

In view of the above technical problems, the present invention provides an IP core of a binarized convolutional neural network algorithm based on an FPGA, including: the system comprises a data input cache module, a weight cache module, a convolution control module, a pooling control module, a binarization module, a full-link layer control module and a global control module;

the data input cache module consists of an input displacement register, an input data counter and an output data counter;

the input displacement register consists of FIFO cache units;

the weight caching module is used for reading the weight and the offset coefficient of the binarization network;

the convolution calculation module comprises a convolution shift register, a weight register, a pipeline adder and a control counter;

the pooling control module comprises a pooling shift register, a two-stage pipeline comparator and a control output counter;

the pooling shift register consists of 2 groups of FIFO buffer units; the number of FIFO buffer units in each group is consistent with the size of the pooling sliding window;

the input bit width of the binarization module is the bit width +1 occupied by the sum of the maximum output values of other modules;

the full-connection layer control module is realized by adopting an adder tree;

the global control module is a network overall control module, controls the start and the end of other modules, and ensures correct input and output of data, and specifically comprises the following steps:

step 1, storing image data, binaryzation network weight and offset coefficient;

the image data is processed by an upper computer and then stored into an SD card; storing the binaryzation network weight and the bias coefficient into an SD card;

the binary network weight comprises weight data of a convolution control module and a full connection layer control module;

step 2, inputting and caching image data;

after receiving a starting command transmitted by an upper computer, the global control module is converted into an input state, and image data are read into the data input cache module from the SD card; at the same time, the input data counter starts counting;

step 3, performing binarization compression on the image data;

the global control module is converted into a reading state, the output data counter starts counting, and the image data is read out from the FIFO cache unit; the read image data is subjected to binarization compression through the binarization module to obtain binarization data, and the binarization data is serially input into the convolution calculation module;

in the binarization compression process, because the output value of each module is a signed number, sign bits are directly compared to judge whether the output value is positive or negative during binarization compression, numbers larger than or equal to 0 are compressed into +1 of 1-bit, and numbers smaller than 0 are compressed into 0, specifically:

the binarization compression is carried out by adopting a sign function, and the following binarization formula is utilized:

wherein, x is a signal or weight to be compressed binarily in the network; sign (x) is a signal or weight after binarization compression; in the FIFO buffer unit, the-1 is simplified to 0 instead, and the binarization formula becomes:

step 4, performing convolution operation;

step 4.1, selecting binary data of the image data in a convolution sliding window to obtain cache data and storing the cache data in the convolution shift register;

in the selection process, the control counter counts the number of windows of the convolution sliding window, and sets the redundant window data to 0 according to the counting result;

step 4.2, performing same or convolution operation on the cache data and the convolution kernel of the convolution control module, and taking an operation result as convolution operation data;

convolution is used to discuss the change of a signal after passing through a linear system, and is expressed by the following formula:

in the formula, x (tau) represents an input signal, namely the buffer data; h (t- τ) represents a linear system; y (t) represents an output signal, i.e., the convolution operation data;

the convolution operation is replaced by the same or accumulation operation from the original multiply-accumulate operation, and the following formula is adopted for conversion:

Y＝2*Y_xnor-N

wherein Y is the output result of multiply-accumulate operation, Y_xnorThe result is the output result of the same or cumulative operation, and N is the number of the convolution kernel weights;

step 4.3, accumulating the cache data by the convolution operation data through a four-stage control counter, expanding the accumulated result by one bit to be used as convolution result data with 6-bit signed number, and serially inputting the convolution result data into a pooling control module;

step 5, performing pooling;

step 5.1, storing convolution result data into two groups of FIFO cache units of the pooling control module; after the two groups of FIFO buffer units are full, the convolution result data stored in the FIFO buffer units are read out, and meanwhile, the subsequent convolution result data are continuously written into the two groups of FIFO buffer units;

step 5.2, the read convolution result data is used as cache data, the cache data is selected in a pooling sliding window to be output, then convolution selection data is obtained and stored in the pooling shift register;

in the selection process, the control output counter counts the number of windows of the pooling sliding window, and sets the redundant window data to 0 according to the counting result;

step 5.3, transmitting convolution selection data stored in the 4 pooling shift registers into a second-stage pipeline comparator; the second-stage pipeline comparator takes the maximum value in the 4 convolution selection data as pooled data;

step 5.4, the pooled data is 6-bit signed number, the highest bit of the pooled data is used as a sign bit, 0 represents a positive number, and 1 represents a negative number; the other five bits represent the number itself, positive numbers are represented by the original code, and negative numbers are represented by the complement; the binarization module directly intercepts the highest bit of the pooled data to judge whether the highest bit is 1, outputs 1-bit number 0 if the highest bit is 1, and outputs 1-bit number 1 if the highest bit is 0, and the highest bit is used as a pooling result; then, transmitting the pooling result into a full-connection layer control module;

step 6, the full-connection layer control module performs accumulation and calculation through an adder tree, and the method comprises the following steps:

step 6.1, calculating the full connection of the hidden layer;

the weight cache module reads the weight data and the bias coefficient of the full-connection layer control module stored in the SD card for standby;

the fully-connected layer is arranged on the last layers of the network, and each neuron in the layer is connected with all neurons in the previous layer, so that a large number of weights and calculations are available;

the full connection layer formula is as follows:

X^k＝f(W^kX^k-1+b^k)

wherein X^kOutput activation value, W, for k layer^kRepresenting k-th layer weight data, X^k-1For the output activation value of the previous layer, b^kIs the bias coefficient of the k-th layer, f () is the activation function; if the previous layer is a pooling layer, the output activation value is a pooling result; if the previous layer is also the full connection layer, the output activation value is the calculation result of the activation function of the full connection layer;

the activation function is a ReLU linear rectification function:

f(x)＝max(0,x)

the multiplication operation of the output activation value of the previous layer after binarization compression and the weight w is replaced by parity or, and the calculation formula of the full connection layer is converted into the following formula:

X^k＝f(W^{k^}～X^k-1+b^k)

the result calculated by the formula is input into an output layer after being subjected to binarization compression by a binarization module again;

step 6.2, output layer full connection calculation;

the layer is the last layer of the network, the calculation mode is the same as the step 6.1, and the final result is output.

The data input buffer module and the convolution calculation module are both realized by RTL-level codes.

And the weight of each neuron of the full connection layer control module is stored by adopting a ROM (read only memory).

The adder tree is composed of a plurality of 1-bit adders which are connected in parallel by a multi-stage production line.

The step 1 of storing the image data specifically includes:

the image data comprises three lines of pixel values, the upper computer processes the size of the image data into 28 × 28, and the image data is stored in the SD card.

The FIFO cache unit consists of an RAM memory on an FPGA chip; the depth of the FIFO buffer unit is the same as the width of the image data; each channel of the input shift register corresponds to three sets of FIFO buffer units.

The caching of the image data in the step 3 specifically includes:

and the data input buffer module respectively stores the three lines of pixel values of the image data into the three groups of FIFO buffer units of the input displacement register to finish the buffer storage of the image data.

In the convolution operation in the step 4, the size of the convolution sliding window is 3 x 3, and 9 pieces of one-bit cache data are generated;

the size of a convolution kernel of the convolution operation is 3 x 3, and the convolution kernel corresponds to 9 weight data and is read from the SD card through a weight cache module; the read 9 weight data are distributed and stored in corresponding 9 weight registers;

in the pooling process of the step 5, the size of the pooling sliding window is 2 x 2.

The convolution operation is an integration process, and the image is processed by using two-dimensional convolution:

and covering a convolution kernel with a certain size on the image, moving according to a determined direction and step length, and performing a convolution integral process on a convolution kernel covering area once every time the convolution kernel is moved.

The final result is not compressed by a binarization module, but is directly output in the form of a full-precision result; and the final result is sequentially output through a UART interface.

The invention has the beneficial effects that:

(1) the weight of each neuron of the full connection layer control module is stored by adopting a ROM (read only memory), so that programmable logic resources in the FPGA (field programmable gate array) are saved;

(2) the input bit width of the binarization module is the bit width +1 occupied by the sum of the maximum output values of other modules; the design is established after the maximum output value of the network internal sub-module is analyzed, so that the resource occupancy rate can be reduced;

(3) in order to achieve higher speed, the full-connection layer control module is not realized in an accumulator mode, but a multi-stage production line is adopted to simultaneously and parallelly form an adder tree by a plurality of 1-bit adders so as to realize accumulation and calculation inside neurons of the full-connection layer control module;

(4) the convolution operation is replaced by the original multiply-accumulate operation into an exclusive-nor operation; therefore, the dependence on the multiplier on the FIFO cache unit is reduced, and the operation is changed into operation because the number of the multipliers is reduced, so that the operation speed is increased;

because the related data is all 1-bit data, and the input connection is more, the calculation of the accumulator is too time-consuming when the accumulator is designed for the purpose, so through balance, the invention adopts the adders with large parallel quantity to form the adder tree to complete the input accumulation and calculation;

and the multiplication operation of the output activation value of the upper layer after binarization compression and the weight w is replaced by bit identity or, thus greatly saving logic resources.

The invention has reasonable design, easy realization and good practical value.

Drawings

FIG. 1 is a basic design framework of an IP core of the FPGA-based binarization convolutional neural network algorithm in the embodiment of the invention;

FIG. 2 is a diagram illustrating a comparison between before and after the convolution binarization in the embodiment of the present invention;

FIG. 3 is a diagram of an embodiment of the pooled sliding window in accordance with an embodiment of the present invention;

FIG. 4 is a diagram illustrating simulation results of the pooling module in an embodiment of the present invention;

FIG. 5 is a diagram illustrating simulation results of the binarization module in the embodiment of the present invention;

FIG. 6 is a diagram illustrating an adder tree according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention aims to rapidly deploy and accelerate a binary convolution neural network on an FPGA (field programmable gate array), and provides an IP core capable of rapidly deploying the binary convolution neural network on different FPGA platforms, wherein the IP core can deploy an algorithm of a specified scale and complete calculation under the condition of low resource occupation.

In order to achieve the above object, the present invention provides an IP core of a binary convolution neural network algorithm based on an FPGA, comprising: the system comprises a global control module, a data input cache module, a weight cache module, a convolution control module, a pooling control module, a binarization module and a full-connection layer control module;

the global control module is a network overall control module and consists of a large-scale state machine, and is used for controlling the start and the end of other modules and ensuring correct input and output of data;

the weight cache module is used for reading the prepared binaryzation network weight and the bias coefficient stored in the SD card;

the weight of each neuron of the full connection layer control module is stored by adopting a ROM (read only memory) so as to save programmable logic resources in the FPGA;

the data input cache module and the convolution calculation module are both realized by RTL-level codes;

each channel of the input shift register corresponds to three sets of FIFO buffer units; the FIFO cache unit consists of an RAM memory on an FPGA chip; the depth of the FIFO buffer unit is the same as the width of the image data;

the input bit width of the binarization module is the bit width +1 occupied by the sum of the maximum output values of other modules; the design is established after analyzing the maximum output value of the sub-module in the network, so that the resource occupancy rate can be reduced;

the full-connection layer control module is realized by adopting an adder tree;

the adder tree consists of a plurality of 1-bit adders which are connected in parallel by a multistage assembly line;

in order to achieve higher speed, the full-connection layer control module is not realized in an accumulator mode, but a multi-stage production line is adopted to simultaneously and parallelly form an adder tree by a plurality of 1-bit adders so as to realize accumulation and calculation inside neurons of the full-connection layer control module;

in order to accelerate the development speed, a relatively mature and stable IP core provided by a manufacturer is selected; however, in order to facilitate multi-platform transplantation, the memory is described by a Verilog HDL and is associated with a mif file stored by weight;

the basic design framework of the IP core of the FPGA-based binarization convolutional neural network algorithm is shown in FIG. 1 and is represented as follows:

step 1, storing image data, binaryzation network weight and offset coefficient;

the image data comprises three rows of pixel values, the upper computer processes the size of the image data into 28 × 28 and stores the image data into the SD card; storing the binaryzation network weight and the offset coefficient into an SD card;

step 2, inputting and caching image data;

the data input buffer module respectively stores three lines of pixel values of the image data into three groups of FIFO buffer units of the input shift register to finish the buffer storage of the image data;

step 3, performing binarization compression on the image data;

in the binarization compression process, because the output value of each module is a signed number, directly comparing sign bits to judge whether the output value is positive or negative during binarization compression, compressing numbers which are greater than or equal to 0 into +1 of 1-bit and compressing numbers which are less than 0 into 0, specifically:

wherein, x is a signal or weight to be compressed binarily in the network; sign (x) is a signal or weight after binarization compression; in the FIFO buffer unit, in order to reduce the complexity of operation and the complexity of storage, here-1 is simplified to 0 instead, and the binarization formula becomes:

step 4, performing convolution operation;

step 4.1, selecting binary data of the image data in a convolution sliding window to obtain 9 one-bit cache data and storing the cache data into the convolution shift register; the size of the convolution sliding window is 3 x 3;

the size of a convolution kernel of the convolution operation is 3 x 3, the convolution kernel corresponds to 9 weight data, and the weight data is read from the SD card through the weight cache module; the read 9 weight data are distributed and stored in corresponding 9 weight registers;

the invention adopts a VGG16 network structure, and the convolution operation is an integration process; generally, processing an image by using two-dimensional convolution, covering a convolution kernel with a certain size on the image, moving according to a determined direction and step length, and performing a convolution integral process on a convolution kernel covering area once every time the convolution kernel is moved;

in the field of images, digital images are two-dimensional discrete signals; in the signal field, convolution is used to discuss the change of a signal after the signal passes through a linear system, and is generally expressed by the following formula:

as shown in fig. 2, the convolution operation is replaced by the original multiply-accumulate operation with the exclusive-nor operation; therefore, the dependence on the multiplier on the FIFO cache unit is reduced, and the operation is changed into operation because the number of the multipliers is reduced, so that the operation speed is increased; however, since the convolution result of the exclusive nor operation is not identical to the convolution result of the original multiply-accumulate operation, the following formula is adopted for conversion:

Y＝2*Y_xnor-N

wherein Y is the output result of multiply-accumulate operation, Y_xnorThe result is the output result of the same or cumulative operation, and N is the number of the convolution kernel weights; i.e. Y after conversion_xnorWhen the number of the convolution kernel weights N is 6, the number of the convolution kernel weights N is 9, an output result is calculated through a formula and is 3, and the output result is consistent with an output result Y of the original multiply-accumulate operation;

step 4.3, accumulating the 9-bit cache data by the convolution operation data through a four-stage control counter, expanding the accumulated result by one bit to be used as convolution result data with 6-bit signed number, and serially inputting the convolution result data into a pooling control module;

step 5, performing pooling;

step 5.1, storing convolution result data into two groups of FIFO cache units of the pooling control module; after the two groups of FIFO buffer units are full, reading out the convolution result data stored therein, and continuously writing the subsequent convolution result data into the two groups of FIFO buffer units, as shown in FIG. 3;

step 5.2, the read convolution result data is used as cache data, the cache data is selected in a pooling sliding window to be output, then convolution selection data is obtained and stored in the pooling shift register; the size of the pooling sliding window is 2 x 2;

in the selection process, the control output counter counts the number of windows of the pooling sliding window, the redundant window data is set to be 0 according to the counting result, and the module simulation result is shown in fig. 4;

step 5.3, transmitting convolution selection data stored in the 4 pooling shift registers into a second-stage pipeline comparator; the second-stage pipeline comparator takes the maximum value in the 4 convolution selection data as the pooled data;

step 5.4, the pooled data is 6-bit signed number, the highest bit of the pooled data is used as a sign bit, 0 represents a positive number, and 1 represents a negative number; the other five bits represent the number itself, positive numbers are represented by the original code, and negative numbers are represented by the complement; therefore, the binarization module directly intercepts the highest bit of the pooled data to judge whether the highest bit is 1, if the highest bit is 1, 1-bit number 0 is output, if the highest bit is 0, 1-bit number 1 is output, and the highest bit is taken as a pooling result, and a module simulation result is shown in fig. 5; then, transmitting the pooling result into a full-connection layer control module;

step 6.1, calculating the full connection of the hidden layer;

the full connection layer formula is as follows:

X^k＝f(W^kX^k-1+b^k)

wherein X^kOutputting activation value for k layer, Wk represents weight data of k layer, X^k-1For the output activation value of the previous layer, b^kIs the bias coefficient of the k-th layer, f () is the activation function; if the previous layer is a pooling layer, the output activation value is a pooling result; if the previous layer is also the full connection layer, the output activation value is the calculation result of the activation function of the full connection layer;

the activation function is a ReLU linear rectification function:

f(x)＝max(0,x)

because the related data is all 1-bit data, and the input connection is more, the calculation of the accumulator is too time-consuming when the accumulator is designed for the purpose, through balancing, the invention adopts an adder tree formed by a large number of adders in parallel to complete the accumulation and the calculation of the input, as shown in FIG. 6;

the multiplication operation of the output activation value of the upper layer after binarization compression and the weight w is replaced by parity or, thus greatly saving logic resources; thus, the full link layer calculation formula is converted to the following equation:

X^k＝f(Wk^{^～}X^k-1+b^k)

step 6.2, output layer full connection calculation;

the layer is the last layer of the network, the calculation mode is the same as the step 6.1, but the final result is not compressed by a binarization module, but is directly output in the form of a full-precision result;

the final result is output sequentially through a UART interface, and the configuration interface is as follows:

Claims

1. an IP core of a binary convolution neural network algorithm based on an FPGA is characterized by comprising the following steps: the system comprises a data input cache module, a weight cache module, a convolution calculation module, a convolution control module, a pooling control module, a binarization module, a full connection layer control module and a global control module;

the input displacement register consists of FIFO cache units;

the full-connection layer control module is realized by adopting an adder tree;

step 1, storing image data, binaryzation network weight and offset coefficient;

the image data is processed by an upper computer and then stored into an SD card; storing the binaryzation network weight and the offset coefficient into an SD card;

step 2, inputting and caching image data;

step 3, performing binarization compression on the image data;

the global control module is converted into a reading state, the output data counter starts counting, and the image data is read out from the FIFO cache unit; performing binarization compression on the read image data through the binarization module to obtain binarization data, and inputting the binarization data into the convolution calculation module in series;

step 4, performing convolution operation;

convolution is used to discuss the change of a signal after passing through a linear system, and is expressed by the following equation:

Y＝2*Y_xnor-N

step 5, performing pooling;

step 6.1, calculating the full connection of the hidden layer;

the formula of the full connection layer is as follows:

X^k＝f(W^kX^k-1+b^k)

wherein X^kOutput activation value, W, for k layer^kRepresenting k-th layer weight data, X^k-1For the output activation value of the previous layer, b^kIs the bias coefficient of the k-th layer, f () is the activation function; if the previous layer is a pooling layer, the output activation value is a pooling result; if the previous layer is also a full connection layer, the output activation value is an activation function calculation result of the full connection layer;

the activation function is a ReLU linear rectification function:

f(x)＝max(0,x)

X^k＝f(W^k^～X^k-1+b^k)

step 6.2, output layer full connection calculation;

2. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein the data input buffer module and the convolution calculation module are both realized by RTL level code.

3. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein each neuron weight of the full link layer control module is stored by adopting a ROM memory.

4. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein the adder tree is composed of a plurality of 1-bit adders which are simultaneously and parallelly arranged in a multistage pipeline.

5. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein the storing of the image data in the step 1 specifically is:

6. The IP core of the FPGA-based binarization convolutional neural network algorithm of the claim 5, wherein the FIFO buffer unit is composed of a RAM memory on an FPGA chip; the depth of the FIFO buffer unit is the same as the width of the image data; each channel of the input shift register corresponds to three sets of FIFO buffer units.

7. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 6, wherein the image data cached in the step 2 specifically comprises:

and the data input cache module respectively stores the three lines of pixel values of the image data into the three groups of FIFO cache units of the input displacement register to finish the caching of the image data.

8. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 7, wherein in the convolution operation of the step 4, the size of the convolution sliding window is 3 x 3, and 9 pieces of one-bit cache data are generated;

during the pooling of step 5, the size of the pooling sliding window was 2 x 2.

9. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1,

10. The IP core of the FPGA-based binarization convolutional neural network algorithm as claimed in claim 1, wherein the final result is not compressed by a binarization module, but is directly output in a form of full-precision result; and the final result is sequentially output through a UART interface.