CN113344179B - IP core of binary convolution neural network algorithm based on FPGA - Google Patents

IP core of binary convolution neural network algorithm based on FPGA Download PDF

Info

Publication number
CN113344179B
CN113344179B CN202110599962.3A CN202110599962A CN113344179B CN 113344179 B CN113344179 B CN 113344179B CN 202110599962 A CN202110599962 A CN 202110599962A CN 113344179 B CN113344179 B CN 113344179B
Authority
CN
China
Prior art keywords
data
convolution
binarization
module
control module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110599962.3A
Other languages
Chinese (zh)
Other versions
CN113344179A (en
Inventor
冯佳玮
石晴文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202110599962.3A priority Critical patent/CN113344179B/en
Publication of CN113344179A publication Critical patent/CN113344179A/en
Application granted granted Critical
Publication of CN113344179B publication Critical patent/CN113344179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

An IP core of a binary convolution neural network algorithm based on an FPGA belongs to the field of digital circuit design and comprises the following components: the device comprises a data input cache module, a weight cache module, a convolution control module, a pooling control module, a binarization module, a full-link layer control module and a global control module. The global control module controls the operation of other modules, and comprises the following steps: 1) storing image data, a binary network weight and a bias coefficient; 2) inputting and caching image data; 3) carrying out binarization compression on image data; 4) carrying out convolution operation; 5) a pooling process; 6) and accumulating and calculating to output a final result. The invention rapidly deploys and accelerates the binary convolution neural network on the FPGA, provides the IP core which can rapidly deploy the binary convolution neural network on different FPGA platforms, and can deploy the algorithm of the designated scale and complete the calculation under the condition of low resource occupation.

Description

IP core of binary convolution neural network algorithm based on FPGA
Technical Field
The invention belongs to the field of digital circuit design, and particularly relates to an IP core of a binary convolution neural network algorithm based on an FPGA (field programmable gate array).
Background
At present, with the vigorous development of artificial intelligence, the fields of artificial intelligence and deep learning research are gradually rising. Artificial neural networks are abstract models of biological neural networks. The convolutional neural network is widely applied to the fields of image analysis, classification, unmanned driving and the like by simulating a visual neural working mechanism and extracting input features through convolutional checking.
In the development of neural networks, in order to achieve higher accuracy of the network, the network generally becomes more complex, and with the network, the data storage resources required by the network in computing and the computing consumption are increased. In addition, in the FPGA environment, the computation and storage resources are limited, and the computation is limited when the neural network is too complex and huge. In order to reduce the problem of resource consumption caused by the enlargement of the network size, network quantization is proposed. Research has shown that reducing the data bit width can still keep the calculation accuracy of the network within a certain reasonable range. Among various quantization methods, binarization is an extreme quantization method: the storage occupation of the network is reduced by about 32 times compared with that before quantization by forcibly converting the 32-bit floating point data in the network into +1 and 0 of 1-bit of bit width. Meanwhile, due to the reduction of the data bit width, the network multiplication calculation can be synchronously converted into the bit operation with higher speed and less occupied resources and area, so that the network calculation speed is greatly improved.
An IP core block is a pre-designed, even verified, integrated circuit, device or component with some defined functionality. With the increasing size of FPGAs/CPLDs, the design becomes more and more complex, and the primary task of the designer is to complete the complex design within a specified time period. Calling the IP core can avoid repeated labor and greatly reduce the burden of engineers, so that the use of the IP core is a development trend.
With the application of deep learning on various small chips, the FPGA serving as a programmable device has the advantages of flexibility, configurability and short development period, and is suitable for serving as an early verification platform. However, the scale of the neural network is too large, great pressure is caused to the self resources of the FPGA, most repeated operations exist due to the characteristics of the internal structure of the FPGA during deployment, and the transplantation among different platforms is relatively difficult.
Disclosure of Invention
In view of the above technical problems, the present invention provides an IP core of a binarized convolutional neural network algorithm based on an FPGA, including: the system comprises a data input cache module, a weight cache module, a convolution control module, a pooling control module, a binarization module, a full-link layer control module and a global control module;
the data input cache module consists of an input displacement register, an input data counter and an output data counter;
the input displacement register consists of FIFO cache units;
the weight caching module is used for reading the weight and the offset coefficient of the binarization network;
the convolution calculation module comprises a convolution shift register, a weight register, a pipeline adder and a control counter;
the pooling control module comprises a pooling shift register, a two-stage pipeline comparator and a control output counter;
the pooling shift register consists of 2 groups of FIFO buffer units; the number of FIFO buffer units in each group is consistent with the size of the pooling sliding window;
the input bit width of the binarization module is the bit width +1 occupied by the sum of the maximum output values of other modules;
the full-connection layer control module is realized by adopting an adder tree;
the global control module is a network overall control module, controls the start and the end of other modules, and ensures correct input and output of data, and specifically comprises the following steps:
step 1, storing image data, binaryzation network weight and offset coefficient;
the image data is processed by an upper computer and then stored into an SD card; storing the binaryzation network weight and the bias coefficient into an SD card;
the binary network weight comprises weight data of a convolution control module and a full connection layer control module;
step 2, inputting and caching image data;
after receiving a starting command transmitted by an upper computer, the global control module is converted into an input state, and image data are read into the data input cache module from the SD card; at the same time, the input data counter starts counting;
step 3, performing binarization compression on the image data;
the global control module is converted into a reading state, the output data counter starts counting, and the image data is read out from the FIFO cache unit; the read image data is subjected to binarization compression through the binarization module to obtain binarization data, and the binarization data is serially input into the convolution calculation module;
in the binarization compression process, because the output value of each module is a signed number, sign bits are directly compared to judge whether the output value is positive or negative during binarization compression, numbers larger than or equal to 0 are compressed into +1 of 1-bit, and numbers smaller than 0 are compressed into 0, specifically:
the binarization compression is carried out by adopting a sign function, and the following binarization formula is utilized:
Figure BDA0003092381810000021
wherein, x is a signal or weight to be compressed binarily in the network; sign (x) is a signal or weight after binarization compression; in the FIFO buffer unit, the-1 is simplified to 0 instead, and the binarization formula becomes:
Figure BDA0003092381810000022
step 4, performing convolution operation;
step 4.1, selecting binary data of the image data in a convolution sliding window to obtain cache data and storing the cache data in the convolution shift register;
in the selection process, the control counter counts the number of windows of the convolution sliding window, and sets the redundant window data to 0 according to the counting result;
step 4.2, performing same or convolution operation on the cache data and the convolution kernel of the convolution control module, and taking an operation result as convolution operation data;
convolution is used to discuss the change of a signal after passing through a linear system, and is expressed by the following formula:
Figure BDA0003092381810000031
in the formula, x (tau) represents an input signal, namely the buffer data; h (t- τ) represents a linear system; y (t) represents an output signal, i.e., the convolution operation data;
the convolution operation is replaced by the same or accumulation operation from the original multiply-accumulate operation, and the following formula is adopted for conversion:
Y=2*Yxnor-N
wherein Y is the output result of multiply-accumulate operation, YxnorThe result is the output result of the same or cumulative operation, and N is the number of the convolution kernel weights;
step 4.3, accumulating the cache data by the convolution operation data through a four-stage control counter, expanding the accumulated result by one bit to be used as convolution result data with 6-bit signed number, and serially inputting the convolution result data into a pooling control module;
step 5, performing pooling;
step 5.1, storing convolution result data into two groups of FIFO cache units of the pooling control module; after the two groups of FIFO buffer units are full, the convolution result data stored in the FIFO buffer units are read out, and meanwhile, the subsequent convolution result data are continuously written into the two groups of FIFO buffer units;
step 5.2, the read convolution result data is used as cache data, the cache data is selected in a pooling sliding window to be output, then convolution selection data is obtained and stored in the pooling shift register;
in the selection process, the control output counter counts the number of windows of the pooling sliding window, and sets the redundant window data to 0 according to the counting result;
step 5.3, transmitting convolution selection data stored in the 4 pooling shift registers into a second-stage pipeline comparator; the second-stage pipeline comparator takes the maximum value in the 4 convolution selection data as pooled data;
step 5.4, the pooled data is 6-bit signed number, the highest bit of the pooled data is used as a sign bit, 0 represents a positive number, and 1 represents a negative number; the other five bits represent the number itself, positive numbers are represented by the original code, and negative numbers are represented by the complement; the binarization module directly intercepts the highest bit of the pooled data to judge whether the highest bit is 1, outputs 1-bit number 0 if the highest bit is 1, and outputs 1-bit number 1 if the highest bit is 0, and the highest bit is used as a pooling result; then, transmitting the pooling result into a full-connection layer control module;
step 6, the full-connection layer control module performs accumulation and calculation through an adder tree, and the method comprises the following steps:
step 6.1, calculating the full connection of the hidden layer;
the weight cache module reads the weight data and the bias coefficient of the full-connection layer control module stored in the SD card for standby;
the fully-connected layer is arranged on the last layers of the network, and each neuron in the layer is connected with all neurons in the previous layer, so that a large number of weights and calculations are available;
the full connection layer formula is as follows:
Xk=f(WkXk-1+bk)
wherein XkOutput activation value, W, for k layerkRepresenting k-th layer weight data, Xk-1For the output activation value of the previous layer, bkIs the bias coefficient of the k-th layer, f () is the activation function; if the previous layer is a pooling layer, the output activation value is a pooling result; if the previous layer is also the full connection layer, the output activation value is the calculation result of the activation function of the full connection layer;
the activation function is a ReLU linear rectification function:
f(x)=max(0,x)
the multiplication operation of the output activation value of the previous layer after binarization compression and the weight w is replaced by parity or, and the calculation formula of the full connection layer is converted into the following formula:
Xk=f(Wk^~Xk-1+bk)
the result calculated by the formula is input into an output layer after being subjected to binarization compression by a binarization module again;
step 6.2, output layer full connection calculation;
the layer is the last layer of the network, the calculation mode is the same as the step 6.1, and the final result is output.
The data input buffer module and the convolution calculation module are both realized by RTL-level codes.
And the weight of each neuron of the full connection layer control module is stored by adopting a ROM (read only memory).
The adder tree is composed of a plurality of 1-bit adders which are connected in parallel by a multi-stage production line.
The step 1 of storing the image data specifically includes:
the image data comprises three lines of pixel values, the upper computer processes the size of the image data into 28 × 28, and the image data is stored in the SD card.
The FIFO cache unit consists of an RAM memory on an FPGA chip; the depth of the FIFO buffer unit is the same as the width of the image data; each channel of the input shift register corresponds to three sets of FIFO buffer units.
The caching of the image data in the step 3 specifically includes:
and the data input buffer module respectively stores the three lines of pixel values of the image data into the three groups of FIFO buffer units of the input displacement register to finish the buffer storage of the image data.
In the convolution operation in the step 4, the size of the convolution sliding window is 3 x 3, and 9 pieces of one-bit cache data are generated;
the size of a convolution kernel of the convolution operation is 3 x 3, and the convolution kernel corresponds to 9 weight data and is read from the SD card through a weight cache module; the read 9 weight data are distributed and stored in corresponding 9 weight registers;
in the pooling process of the step 5, the size of the pooling sliding window is 2 x 2.
The convolution operation is an integration process, and the image is processed by using two-dimensional convolution:
and covering a convolution kernel with a certain size on the image, moving according to a determined direction and step length, and performing a convolution integral process on a convolution kernel covering area once every time the convolution kernel is moved.
The final result is not compressed by a binarization module, but is directly output in the form of a full-precision result; and the final result is sequentially output through a UART interface.
The invention has the beneficial effects that:
(1) the weight of each neuron of the full connection layer control module is stored by adopting a ROM (read only memory), so that programmable logic resources in the FPGA (field programmable gate array) are saved;
(2) the input bit width of the binarization module is the bit width +1 occupied by the sum of the maximum output values of other modules; the design is established after the maximum output value of the network internal sub-module is analyzed, so that the resource occupancy rate can be reduced;
(3) in order to achieve higher speed, the full-connection layer control module is not realized in an accumulator mode, but a multi-stage production line is adopted to simultaneously and parallelly form an adder tree by a plurality of 1-bit adders so as to realize accumulation and calculation inside neurons of the full-connection layer control module;
(4) the convolution operation is replaced by the original multiply-accumulate operation into an exclusive-nor operation; therefore, the dependence on the multiplier on the FIFO cache unit is reduced, and the operation is changed into operation because the number of the multipliers is reduced, so that the operation speed is increased;
because the related data is all 1-bit data, and the input connection is more, the calculation of the accumulator is too time-consuming when the accumulator is designed for the purpose, so through balance, the invention adopts the adders with large parallel quantity to form the adder tree to complete the input accumulation and calculation;
and the multiplication operation of the output activation value of the upper layer after binarization compression and the weight w is replaced by bit identity or, thus greatly saving logic resources.
The invention has reasonable design, easy realization and good practical value.
Drawings
FIG. 1 is a basic design framework of an IP core of the FPGA-based binarization convolutional neural network algorithm in the embodiment of the invention;
FIG. 2 is a diagram illustrating a comparison between before and after the convolution binarization in the embodiment of the present invention;
FIG. 3 is a diagram of an embodiment of the pooled sliding window in accordance with an embodiment of the present invention;
FIG. 4 is a diagram illustrating simulation results of the pooling module in an embodiment of the present invention;
FIG. 5 is a diagram illustrating simulation results of the binarization module in the embodiment of the present invention;
FIG. 6 is a diagram illustrating an adder tree according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention aims to rapidly deploy and accelerate a binary convolution neural network on an FPGA (field programmable gate array), and provides an IP core capable of rapidly deploying the binary convolution neural network on different FPGA platforms, wherein the IP core can deploy an algorithm of a specified scale and complete calculation under the condition of low resource occupation.
In order to achieve the above object, the present invention provides an IP core of a binary convolution neural network algorithm based on an FPGA, comprising: the system comprises a global control module, a data input cache module, a weight cache module, a convolution control module, a pooling control module, a binarization module and a full-connection layer control module;
the global control module is a network overall control module and consists of a large-scale state machine, and is used for controlling the start and the end of other modules and ensuring correct input and output of data;
the weight cache module is used for reading the prepared binaryzation network weight and the bias coefficient stored in the SD card;
the weight of each neuron of the full connection layer control module is stored by adopting a ROM (read only memory) so as to save programmable logic resources in the FPGA;
the data input cache module and the convolution calculation module are both realized by RTL-level codes;
the data input cache module consists of an input displacement register, an input data counter and an output data counter;
each channel of the input shift register corresponds to three sets of FIFO buffer units; the FIFO cache unit consists of an RAM memory on an FPGA chip; the depth of the FIFO buffer unit is the same as the width of the image data;
the convolution calculation module comprises a convolution shift register, a weight register, a pipeline adder and a control counter;
the pooling control module comprises a pooling shift register, a two-stage pipeline comparator and a control output counter;
the pooling shift register consists of 2 groups of FIFO buffer units; the number of FIFO buffer units in each group is consistent with the size of the pooling sliding window;
the input bit width of the binarization module is the bit width +1 occupied by the sum of the maximum output values of other modules; the design is established after analyzing the maximum output value of the sub-module in the network, so that the resource occupancy rate can be reduced;
the full-connection layer control module is realized by adopting an adder tree;
the adder tree consists of a plurality of 1-bit adders which are connected in parallel by a multistage assembly line;
in order to achieve higher speed, the full-connection layer control module is not realized in an accumulator mode, but a multi-stage production line is adopted to simultaneously and parallelly form an adder tree by a plurality of 1-bit adders so as to realize accumulation and calculation inside neurons of the full-connection layer control module;
in order to accelerate the development speed, a relatively mature and stable IP core provided by a manufacturer is selected; however, in order to facilitate multi-platform transplantation, the memory is described by a Verilog HDL and is associated with a mif file stored by weight;
the basic design framework of the IP core of the FPGA-based binarization convolutional neural network algorithm is shown in FIG. 1 and is represented as follows:
the global control module is a network overall control module, controls the start and the end of other modules, and ensures correct input and output of data, and specifically comprises the following steps:
step 1, storing image data, binaryzation network weight and offset coefficient;
the image data comprises three rows of pixel values, the upper computer processes the size of the image data into 28 × 28 and stores the image data into the SD card; storing the binaryzation network weight and the offset coefficient into an SD card;
the binary network weight comprises weight data of a convolution control module and a full connection layer control module;
step 2, inputting and caching image data;
after receiving a starting command transmitted by an upper computer, the global control module is converted into an input state, and image data are read into the data input cache module from the SD card; at the same time, the input data counter starts counting;
the data input buffer module respectively stores three lines of pixel values of the image data into three groups of FIFO buffer units of the input shift register to finish the buffer storage of the image data;
step 3, performing binarization compression on the image data;
the global control module is converted into a reading state, the output data counter starts counting, and the image data is read out from the FIFO cache unit; the read image data is subjected to binarization compression through the binarization module to obtain binarization data, and the binarization data is serially input into the convolution calculation module;
in the binarization compression process, because the output value of each module is a signed number, directly comparing sign bits to judge whether the output value is positive or negative during binarization compression, compressing numbers which are greater than or equal to 0 into +1 of 1-bit and compressing numbers which are less than 0 into 0, specifically:
the binarization compression is carried out by adopting a sign function, and the following binarization formula is utilized:
Figure BDA0003092381810000081
wherein, x is a signal or weight to be compressed binarily in the network; sign (x) is a signal or weight after binarization compression; in the FIFO buffer unit, in order to reduce the complexity of operation and the complexity of storage, here-1 is simplified to 0 instead, and the binarization formula becomes:
Figure BDA0003092381810000082
step 4, performing convolution operation;
step 4.1, selecting binary data of the image data in a convolution sliding window to obtain 9 one-bit cache data and storing the cache data into the convolution shift register; the size of the convolution sliding window is 3 x 3;
in the selection process, the control counter counts the number of windows of the convolution sliding window, and sets the redundant window data to 0 according to the counting result;
step 4.2, performing same or convolution operation on the cache data and the convolution kernel of the convolution control module, and taking an operation result as convolution operation data;
the size of a convolution kernel of the convolution operation is 3 x 3, the convolution kernel corresponds to 9 weight data, and the weight data is read from the SD card through the weight cache module; the read 9 weight data are distributed and stored in corresponding 9 weight registers;
the invention adopts a VGG16 network structure, and the convolution operation is an integration process; generally, processing an image by using two-dimensional convolution, covering a convolution kernel with a certain size on the image, moving according to a determined direction and step length, and performing a convolution integral process on a convolution kernel covering area once every time the convolution kernel is moved;
in the field of images, digital images are two-dimensional discrete signals; in the signal field, convolution is used to discuss the change of a signal after the signal passes through a linear system, and is generally expressed by the following formula:
Figure BDA0003092381810000083
in the formula, x (tau) represents an input signal, namely the buffer data; h (t- τ) represents a linear system; y (t) represents an output signal, i.e., the convolution operation data;
as shown in fig. 2, the convolution operation is replaced by the original multiply-accumulate operation with the exclusive-nor operation; therefore, the dependence on the multiplier on the FIFO cache unit is reduced, and the operation is changed into operation because the number of the multipliers is reduced, so that the operation speed is increased; however, since the convolution result of the exclusive nor operation is not identical to the convolution result of the original multiply-accumulate operation, the following formula is adopted for conversion:
Y=2*Yxnor-N
wherein Y is the output result of multiply-accumulate operation, YxnorThe result is the output result of the same or cumulative operation, and N is the number of the convolution kernel weights; i.e. Y after conversionxnorWhen the number of the convolution kernel weights N is 6, the number of the convolution kernel weights N is 9, an output result is calculated through a formula and is 3, and the output result is consistent with an output result Y of the original multiply-accumulate operation;
step 4.3, accumulating the 9-bit cache data by the convolution operation data through a four-stage control counter, expanding the accumulated result by one bit to be used as convolution result data with 6-bit signed number, and serially inputting the convolution result data into a pooling control module;
step 5, performing pooling;
step 5.1, storing convolution result data into two groups of FIFO cache units of the pooling control module; after the two groups of FIFO buffer units are full, reading out the convolution result data stored therein, and continuously writing the subsequent convolution result data into the two groups of FIFO buffer units, as shown in FIG. 3;
step 5.2, the read convolution result data is used as cache data, the cache data is selected in a pooling sliding window to be output, then convolution selection data is obtained and stored in the pooling shift register; the size of the pooling sliding window is 2 x 2;
in the selection process, the control output counter counts the number of windows of the pooling sliding window, the redundant window data is set to be 0 according to the counting result, and the module simulation result is shown in fig. 4;
step 5.3, transmitting convolution selection data stored in the 4 pooling shift registers into a second-stage pipeline comparator; the second-stage pipeline comparator takes the maximum value in the 4 convolution selection data as the pooled data;
step 5.4, the pooled data is 6-bit signed number, the highest bit of the pooled data is used as a sign bit, 0 represents a positive number, and 1 represents a negative number; the other five bits represent the number itself, positive numbers are represented by the original code, and negative numbers are represented by the complement; therefore, the binarization module directly intercepts the highest bit of the pooled data to judge whether the highest bit is 1, if the highest bit is 1, 1-bit number 0 is output, if the highest bit is 0, 1-bit number 1 is output, and the highest bit is taken as a pooling result, and a module simulation result is shown in fig. 5; then, transmitting the pooling result into a full-connection layer control module;
step 6, the full-connection layer control module performs accumulation and calculation through an adder tree, and the method comprises the following steps:
step 6.1, calculating the full connection of the hidden layer;
the weight cache module reads the weight data and the bias coefficient of the full-connection layer control module stored in the SD card for standby;
the fully-connected layer is arranged on the last layers of the network, and each neuron in the layer is connected with all neurons in the previous layer, so that a large number of weights and calculations are available;
the full connection layer formula is as follows:
Xk=f(WkXk-1+bk)
wherein XkOutputting activation value for k layer, Wk represents weight data of k layer, Xk-1For the output activation value of the previous layer, bkIs the bias coefficient of the k-th layer, f () is the activation function; if the previous layer is a pooling layer, the output activation value is a pooling result; if the previous layer is also the full connection layer, the output activation value is the calculation result of the activation function of the full connection layer;
the activation function is a ReLU linear rectification function:
f(x)=max(0,x)
because the related data is all 1-bit data, and the input connection is more, the calculation of the accumulator is too time-consuming when the accumulator is designed for the purpose, through balancing, the invention adopts an adder tree formed by a large number of adders in parallel to complete the accumulation and the calculation of the input, as shown in FIG. 6;
the multiplication operation of the output activation value of the upper layer after binarization compression and the weight w is replaced by parity or, thus greatly saving logic resources; thus, the full link layer calculation formula is converted to the following equation:
Xk=f(Wk^~Xk-1+bk)
the result calculated by the formula is input into an output layer after being subjected to binarization compression by a binarization module again;
step 6.2, output layer full connection calculation;
the layer is the last layer of the network, the calculation mode is the same as the step 6.1, but the final result is not compressed by a binarization module, but is directly output in the form of a full-precision result;
the final result is output sequentially through a UART interface, and the configuration interface is as follows:
Figure 1

Claims (10)

1. an IP core of a binary convolution neural network algorithm based on an FPGA is characterized by comprising the following steps: the system comprises a data input cache module, a weight cache module, a convolution calculation module, a convolution control module, a pooling control module, a binarization module, a full connection layer control module and a global control module;
the data input cache module consists of an input displacement register, an input data counter and an output data counter;
the input displacement register consists of FIFO cache units;
the weight caching module is used for reading the weight and the offset coefficient of the binarization network;
the convolution calculation module comprises a convolution shift register, a weight register, a pipeline adder and a control counter;
the pooling control module comprises a pooling shift register, a two-stage pipeline comparator and a control output counter;
the pooling shift register consists of 2 groups of FIFO buffer units; the number of FIFO buffer units in each group is consistent with the size of the pooling sliding window;
the input bit width of the binarization module is the bit width +1 occupied by the sum of the maximum output values of other modules;
the full-connection layer control module is realized by adopting an adder tree;
the global control module is a network overall control module, controls the start and the end of other modules, and ensures correct input and output of data, and specifically comprises the following steps:
step 1, storing image data, binaryzation network weight and offset coefficient;
the image data is processed by an upper computer and then stored into an SD card; storing the binaryzation network weight and the offset coefficient into an SD card;
the binary network weight comprises weight data of a convolution control module and a full connection layer control module;
step 2, inputting and caching image data;
after receiving a starting command transmitted by an upper computer, the global control module is converted into an input state, and image data are read into the data input cache module from the SD card; at the same time, the input data counter starts counting;
step 3, performing binarization compression on the image data;
the global control module is converted into a reading state, the output data counter starts counting, and the image data is read out from the FIFO cache unit; performing binarization compression on the read image data through the binarization module to obtain binarization data, and inputting the binarization data into the convolution calculation module in series;
in the binarization compression process, because the output value of each module is a signed number, sign bits are directly compared to judge whether the output value is positive or negative during binarization compression, numbers larger than or equal to 0 are compressed into +1 of 1-bit, and numbers smaller than 0 are compressed into 0, specifically:
the binarization compression is carried out by adopting a sign function, and the following binarization formula is utilized:
Figure FDA0003514899290000011
wherein, x is a signal or weight to be compressed binarily in the network; sign (x) is a signal or weight after binarization compression; in the FIFO buffer unit, the-1 is simplified to 0 instead, and the binarization formula becomes:
Figure FDA0003514899290000021
step 4, performing convolution operation;
step 4.1, selecting binary data of the image data in a convolution sliding window to obtain cache data and storing the cache data in the convolution shift register;
in the selection process, the control counter counts the number of windows of the convolution sliding window, and sets the redundant window data to 0 according to the counting result;
step 4.2, performing same or convolution operation on the cache data and the convolution kernel of the convolution control module, and taking an operation result as convolution operation data;
convolution is used to discuss the change of a signal after passing through a linear system, and is expressed by the following equation:
Figure FDA0003514899290000022
in the formula, x (tau) represents an input signal, namely the buffer data; h (t- τ) represents a linear system; y (t) represents an output signal, i.e., the convolution operation data;
the convolution operation is replaced by the same or accumulation operation from the original multiply-accumulate operation, and the following formula is adopted for conversion:
Y=2*Yxnor-N
wherein Y is the output result of multiply-accumulate operation, YxnorThe result is the output result of the same or cumulative operation, and N is the number of the convolution kernel weights;
step 4.3, accumulating the cache data by the convolution operation data through a four-stage control counter, expanding the accumulated result by one bit to be used as convolution result data with 6-bit signed number, and serially inputting the convolution result data into a pooling control module;
step 5, performing pooling;
step 5.1, storing convolution result data into two groups of FIFO cache units of the pooling control module; after the two groups of FIFO buffer units are full, the convolution result data stored in the FIFO buffer units are read out, and meanwhile, the subsequent convolution result data are continuously written into the two groups of FIFO buffer units;
step 5.2, the read convolution result data is used as cache data, the cache data is selected in a pooling sliding window to be output, then convolution selection data is obtained and stored in the pooling shift register;
in the selection process, the control output counter counts the number of windows of the pooling sliding window, and sets the redundant window data to 0 according to the counting result;
step 5.3, transmitting convolution selection data stored in the 4 pooling shift registers into a second-stage pipeline comparator; the second-stage pipeline comparator takes the maximum value in the 4 convolution selection data as pooled data;
step 5.4, the pooled data is 6-bit signed number, the highest bit of the pooled data is used as a sign bit, 0 represents a positive number, and 1 represents a negative number; the other five bits represent the number itself, positive numbers are represented by the original code, and negative numbers are represented by the complement; the binarization module directly intercepts the highest bit of the pooled data to judge whether the highest bit is 1, outputs 1-bit number 0 if the highest bit is 1, and outputs 1-bit number 1 if the highest bit is 0, and the highest bit is used as a pooling result; then, transmitting the pooling result into a full-connection layer control module;
step 6, the full-connection layer control module performs accumulation and calculation through an adder tree, and the method comprises the following steps:
step 6.1, calculating the full connection of the hidden layer;
the weight cache module reads the weight data and the bias coefficient of the full-connection layer control module stored in the SD card for standby;
the fully-connected layer is arranged on the last layers of the network, and each neuron in the layer is connected with all neurons in the previous layer, so that a large number of weights and calculations are available;
the formula of the full connection layer is as follows:
Xk=f(WkXk-1+bk)
wherein XkOutput activation value, W, for k layerkRepresenting k-th layer weight data, Xk-1For the output activation value of the previous layer, bkIs the bias coefficient of the k-th layer, f () is the activation function; if the previous layer is a pooling layer, the output activation value is a pooling result; if the previous layer is also a full connection layer, the output activation value is an activation function calculation result of the full connection layer;
the activation function is a ReLU linear rectification function:
f(x)=max(0,x)
the multiplication operation of the output activation value of the previous layer after binarization compression and the weight w is replaced by parity or, and the calculation formula of the full connection layer is converted into the following formula:
Xk=f(Wk^~Xk-1+bk)
the result calculated by the formula is input into an output layer after being subjected to binarization compression by a binarization module again;
step 6.2, output layer full connection calculation;
the layer is the last layer of the network, the calculation mode is the same as the step 6.1, and the final result is output.
2. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein the data input buffer module and the convolution calculation module are both realized by RTL level code.
3. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein each neuron weight of the full link layer control module is stored by adopting a ROM memory.
4. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein the adder tree is composed of a plurality of 1-bit adders which are simultaneously and parallelly arranged in a multistage pipeline.
5. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1, wherein the storing of the image data in the step 1 specifically is:
the image data comprises three lines of pixel values, the upper computer processes the size of the image data into 28 × 28, and the image data is stored in the SD card.
6. The IP core of the FPGA-based binarization convolutional neural network algorithm of the claim 5, wherein the FIFO buffer unit is composed of a RAM memory on an FPGA chip; the depth of the FIFO buffer unit is the same as the width of the image data; each channel of the input shift register corresponds to three sets of FIFO buffer units.
7. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 6, wherein the image data cached in the step 2 specifically comprises:
and the data input cache module respectively stores the three lines of pixel values of the image data into the three groups of FIFO cache units of the input displacement register to finish the caching of the image data.
8. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 7, wherein in the convolution operation of the step 4, the size of the convolution sliding window is 3 x 3, and 9 pieces of one-bit cache data are generated;
the size of a convolution kernel of the convolution operation is 3 x 3, the convolution kernel corresponds to 9 weight data, and the weight data is read from the SD card through the weight cache module; the read 9 weight data are distributed and stored in corresponding 9 weight registers;
during the pooling of step 5, the size of the pooling sliding window was 2 x 2.
9. The IP core of the FPGA-based binarization convolutional neural network algorithm of claim 1,
the convolution operation is an integration process, and the image is processed by using two-dimensional convolution:
and covering a convolution kernel with a certain size on the image, moving according to a determined direction and step length, and performing a convolution integral process on a convolution kernel covering area once every time the convolution kernel is moved.
10. The IP core of the FPGA-based binarization convolutional neural network algorithm as claimed in claim 1, wherein the final result is not compressed by a binarization module, but is directly output in a form of full-precision result; and the final result is sequentially output through a UART interface.
CN202110599962.3A 2021-05-31 2021-05-31 IP core of binary convolution neural network algorithm based on FPGA Active CN113344179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110599962.3A CN113344179B (en) 2021-05-31 2021-05-31 IP core of binary convolution neural network algorithm based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110599962.3A CN113344179B (en) 2021-05-31 2021-05-31 IP core of binary convolution neural network algorithm based on FPGA

Publications (2)

Publication Number Publication Date
CN113344179A CN113344179A (en) 2021-09-03
CN113344179B true CN113344179B (en) 2022-06-14

Family

ID=77472489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110599962.3A Active CN113344179B (en) 2021-05-31 2021-05-31 IP core of binary convolution neural network algorithm based on FPGA

Country Status (1)

Country Link
CN (1) CN113344179B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462587B (en) * 2022-02-10 2023-04-07 电子科技大学 FPGA implementation method for photoelectric hybrid computation neural network
CN115660057B (en) * 2022-12-13 2023-05-12 至讯创新科技(无锡)有限公司 Control method for realizing convolution operation of NAND flash memory
CN116863490B (en) * 2023-09-04 2023-12-12 之江实验室 Digital identification method and hardware accelerator for FeFET memory array

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
CN109711533A (en) * 2018-12-20 2019-05-03 西安电子科技大学 Convolutional neural networks module based on FPGA
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN110543939A (en) * 2019-06-12 2019-12-06 电子科技大学 hardware acceleration implementation framework for convolutional neural network backward training based on FPGA
CN111563582A (en) * 2020-05-06 2020-08-21 哈尔滨理工大学 Method for realizing and optimizing accelerated convolution neural network on FPGA (field programmable Gate array)
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN112149814A (en) * 2020-09-23 2020-12-29 哈尔滨理工大学 Convolutional neural network acceleration system based on FPGA

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6658033B2 (en) * 2016-02-05 2020-03-04 富士通株式会社 Arithmetic processing circuit and information processing device
CN108073549B (en) * 2016-11-14 2021-04-27 耐能股份有限公司 Convolution operation device and method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
CN109711533A (en) * 2018-12-20 2019-05-03 西安电子科技大学 Convolutional neural networks module based on FPGA
CN109784489A (en) * 2019-01-16 2019-05-21 北京大学软件与微电子学院 Convolutional neural networks IP kernel based on FPGA
CN110543939A (en) * 2019-06-12 2019-12-06 电子科技大学 hardware acceleration implementation framework for convolutional neural network backward training based on FPGA
CN111563582A (en) * 2020-05-06 2020-08-21 哈尔滨理工大学 Method for realizing and optimizing accelerated convolution neural network on FPGA (field programmable Gate array)
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN112149814A (en) * 2020-09-23 2020-12-29 哈尔滨理工大学 Convolutional neural network acceleration system based on FPGA

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Solution to Optimize Multi-Operand Adders in CNN Architecture on FPGA;Fasih Ud Din Farrukh 等;《2019 IEEE International Symposium on Circuits and Systems》;20190918;1-4 *
High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression;Adrien P. 等;《ACM Transactions on Reconfigurable Technology and Systems》;20181212;第11卷(第3期);1-24 *
卷积神经网络的二值化及FPGA加速的研究;程亚浪;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20210515(第05期);I135-409 *

Also Published As

Publication number Publication date
CN113344179A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN113344179B (en) IP core of binary convolution neural network algorithm based on FPGA
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
US20190087713A1 (en) Compression of sparse deep convolutional network weights
US20180218518A1 (en) Data compaction and memory bandwidth reduction for sparse neural networks
CN110780923B (en) Hardware accelerator applied to binary convolution neural network and data processing method thereof
CN107423816B (en) Multi-calculation-precision neural network processing method and system
CN108665063B (en) Bidirectional parallel processing convolution acceleration system for BNN hardware accelerator
CN111967468A (en) FPGA-based lightweight target detection neural network implementation method
CN110321997B (en) High-parallelism computing platform, system and computing implementation method
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN113051216B (en) MobileNet-SSD target detection device and method based on FPGA acceleration
CN110991631A (en) Neural network acceleration system based on FPGA
CN111626403B (en) Convolutional neural network accelerator based on CPU-FPGA memory sharing
CN108647184B (en) Method for realizing dynamic bit convolution multiplication
Li et al. A multistage dataflow implementation of a deep convolutional neural network based on FPGA for high-speed object recognition
CN110543936B (en) Multi-parallel acceleration method for CNN full-connection layer operation
CN114742225A (en) Neural network reasoning acceleration method based on heterogeneous platform
CN111768458A (en) Sparse image processing method based on convolutional neural network
CN112036475A (en) Fusion module, multi-scale feature fusion convolutional neural network and image identification method
CN113392973A (en) AI chip neural network acceleration method based on FPGA
Xiao et al. FPGA implementation of CNN for handwritten digit recognition
CN113792621A (en) Target detection accelerator design method based on FPGA
CN113240101B (en) Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
CN111008691A (en) Convolutional neural network accelerator architecture with weight and activation value both binarized

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Li Shu

Inventor after: Feng Jiawei

Inventor after: Shi Qingwen

Inventor before: Feng Jiawei

Inventor before: Shi Qingwen

CB03 Change of inventor or designer information