CN109447241A - A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field - Google Patents

A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field Download PDF

Info

Publication number
CN109447241A
CN109447241A CN201811149741.0A CN201811149741A CN109447241A CN 109447241 A CN109447241 A CN 109447241A CN 201811149741 A CN201811149741 A CN 201811149741A CN 109447241 A CN109447241 A CN 109447241A
Authority
CN
China
Prior art keywords
data
module
input
output
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811149741.0A
Other languages
Chinese (zh)
Other versions
CN109447241B (en
Inventor
杨晨
王逸洲
王小力
耿莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811149741.0A priority Critical patent/CN109447241B/en
Publication of CN109447241A publication Critical patent/CN109447241A/en
Application granted granted Critical
Publication of CN109447241B publication Critical patent/CN109447241B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field of the present invention, including cache structure etc., cache structure is used to store the data generated in data or calculating process from storage external memory, after being organized into, arranging, it is transmitted in pe array and is calculated with data structure;Pe array is stored in cache structure after carrying out convolution operation processing for receiving the data from cache structure;For computing module for receiving the data from pe array, selection carries out pond, standardization or three kinds of activation primitive operations, and output data is stored in cache structure;Controller is used to send to cache structure, pe array and computing module and order, and design has external interface, for being communicated with external system.The present invention by design high degree of parallelism, high usage pe array and can be promoted data-reusing rate cache structure improve convolutional neural networks accelerator performance, reduce power consumption.

Description

Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
Technical Field
The invention belongs to the field of neural network accelerators, and particularly relates to a dynamic reconfigurable convolutional neural network accelerator architecture oriented to the field of Internet of things.
Background
Artificial intelligence is one of popular computer science at present, deep learning is deeply developed as a main mode for realizing artificial intelligence, and the calculation complexity of a model exponentially increases along with the increase of the number of network layers and the number of neurons in each layer. Therefore, the learning speed and the running speed of the deep learning algorithm are more and more dependent on large-scale computing platforms such as cloud computing. For hardware acceleration of a deep learning algorithm, at present, three types of implementation modes, namely a multi-core CPU, a GPU and an FPGA are generally available, and the common characteristic of the three types of implementation modes is that high-parallelism computation can be realized. However, the existing hardware implementation has a problem of high power consumption and low energy efficiency (performance/power consumption), and cannot be applied to smart mobile terminals, such as smart phones, wearable devices, or autonomous vehicles. Under the background, the reconfigurable processor is proved to be a parallel computing architecture with high flexibility and high energy efficiency, has the advantages that a proper resource configuration strategy can be selected according to different model sizes, the application range of the special processor is expanded, the processing performance is improved, the reconfigurable processor is one of the solutions for limiting the further development of the multi-core CPU and the FPGA technology, and is likely to become one of the schemes for realizing high-efficiency deep learning SoC in the future.
The convolutional neural network accelerator firstly meets the requirements of reconfigurability and configurability, supports the continuous evolution of a network structure on an algorithm level and meets rich and diverse application scenes; secondly, the requirements of high performance and low energy consumption are met, the limitation of storage bandwidth is required, and hardware resources are fully utilized.
Disclosure of Invention
The invention aims to provide a dynamic reconfigurable convolutional neural network accelerator architecture oriented to the field of Internet of things, which improves the performance of the convolutional neural network accelerator and reduces the power consumption by designing a processing unit array with high parallelism and high utilization rate and a cache architecture capable of improving the data reuse rate, has certain configurability, and is suitable for various application scenes.
The invention is realized by adopting the following technical scheme:
a dynamic reconfigurable convolutional neural network accelerator architecture facing the field of Internet of things comprises a cache architecture, a processing unit array, a computing module and a controller; wherein,
the cache architecture is used for storing data from an external memory or data generated in the calculation process, organizing and arranging the data, and transmitting the data to the processing unit array for calculation according to a preset data structure; the processing unit array is used for receiving the data from the cache architecture, performing convolution operation processing and storing the data in the cache architecture; the computing module is used for receiving data from the processing unit array, selecting three operations of pooling, standardization and activation of functions, and storing output data in a cache architecture; the controller is used for sending commands to the cache architecture, the processing unit array and the computing module, and is designed with an external interface used for communicating with an external system.
The invention has the further improvement that the cache architecture consists of an input data cache, a convolution kernel cache and an output data cache; the output end of the controller is connected with the input end of the input data cache, the input end of the convolution kernel cache and the input end of the output data cache, the output end of the input data cache and the output end of the convolution kernel cache are connected with the input end of the processing unit array, the output end of the processing unit array is connected with the input end of the calculation module, and the output end of the calculation module is connected with the input end of the output data cache;
the input data cache is used for receiving input image data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array for operation; the convolution kernel data cache is used for receiving convolution kernel data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array for operation; the output data buffer is used for storing intermediate data generated from the computing module or the processing unit array and transmitting the data to an external system through the controller.
The invention has the further improvement that the processing unit array consists of 20 processing units, and each processing unit consists of an input data conversion module, a convolution kernel conversion module, a multiplier, an output data conversion module and a channel accumulation module;
the input data conversion module is used for simultaneously converting the multi-channel input image data from the input data cache on a plurality of processing units; the convolution kernel conversion module is used for simultaneously converting the multi-channel convolution kernel data from the convolution kernel cache on a plurality of processing units; the multiplier is used for multiplying the output data from the input data conversion module and the convolution kernel conversion module; the output data conversion module is used for converting the output result of the multiplier; the channel accumulation module is used for summing the multi-channel data to obtain data of one channel.
The invention is further improved in that the processing unit array is used for receiving data from an input data cache, an output data cache or a convolution kernel cache, and outputting the data to the output data cache; each processing unit implements the Winograd algorithm with a window of 5 × 5, wherein the formula of the Winograd algorithm is as follows:
U=GFGT(1)
V=BTInB (2)
Out=AT[U·V]A (3)
the formula (1) represents the conversion of a convolution kernel, an F matrix is the convolution kernel, a G matrix is a conversion matrix, and a U matrix is the result of the conversion of the convolution kernel; formula (2) represents the conversion of input data, In matrix represents the input data, B is the conversion matrix, and V is the result of the conversion of the input data; equation (3) represents the output data conversion, a is the conversion matrix, and Out is the final output result.
The invention has the further improvement that the calculation module consists of a pooling module, an activation function module, a data standardization module, an input selection module and an output selection module;
the output end of the controller is connected with the input end of the input selection module and the input end of the output selection module, the output end of the processing unit array is connected with the input end of the input selection module, the output end of the input selection module is connected with the input end of the pooling module, the input end of the activation function module and the input end of the data standardization module, the output end of the pooling module, the output end of the activation function module and the output end of the data standardization module are connected with the input end of the input selection module, and the output end of the input selection module is connected with the input end of the output data cache;
the input selection module is used for selecting to perform pooling, standardization or activation function operation on data from the processing unit array; the pooling module is used for realizing pooling operation; the data standardization module is used for realizing standardization operation; the activation function module is used for realizing the ReLU activation function operation; the output selection module is used for selecting one of the pooling module, the data normalization module or the activation function module to be output to the cache architecture as a result.
In a further development of the invention, the activation function module executes a ReLU activation function, whose expression is shown in equation (4):
f(x)=x(x>0);f(x)=0(x<=0) (4)
wherein, x in the formula (4) represents the input of the ReLU activation function, i.e. the output result of the channel accumulation module, and f represents the output of the activation function module.
The invention has the following beneficial technical effects:
1. the accelerator architecture is oriented to lightweight networks such as ShuffleNet, and the like, and the networks have the characteristics of higher accuracy, simpler network structure and less network parameters.
2. The accelerator adopts a Winograd algorithm to accelerate convolution operation, so that multiplication times can be reduced, the speed of the accelerator is increased, and the system energy efficiency is improved.
3. The accelerator supports convolution, pooling, activation functions, normalization, full join operations. Can complete various operations at one time, and reduce the access of the external memory.
4. The accelerator is fully configurable and comprises input and output channel numbers, input image size, convolution kernel size and convolution step size.
5. The accelerator has three on-chip caches, the operation of exchanging input and output data caches is proposed in the cache structure, data multiplexing is carried out on input image data (rows and columns) and intermediate data, external memory reading and on-chip cache reading are greatly reduced, and energy efficiency is improved.
6. Accelerators have different levels of high parallelism: the input and output channels are parallel, the input image is parallel internally, the convolution window is parallel internally, and the parallelism can be modified according to the bandwidth.
7. An accelerator can be integrated on a general SOC platform, and various network structures can be configured more conveniently through a software platform.
In summary, the invention is directed to the field of mobile terminals of the internet of things, and by designing a processing unit array with high parallelism and high utilization rate and a proper cache architecture, the requirements of low power consumption and high performance are met, and meanwhile, the invention has certain configurability, and is suitable for various lightweight convolutional neural networks and partial deep neural networks.
Drawings
Fig. 1 is a schematic diagram of a dynamic reconfigurable convolutional neural network processor architecture oriented to the field of internet of things.
FIG. 2 is a schematic diagram of an off-chip operation mode of an input data buffer.
Fig. 3 is a schematic diagram of an operation mode on an output data cache chip.
Fig. 4 is a schematic diagram of a processing unit structure.
Description of reference numerals:
1 is a cache architecture, 10 is an input data cache, 11 is a convolution kernel cache, and 12 is an output data cache;
2, a processing unit array, 21, an input data conversion module, 22, a convolution kernel conversion module, 23, a multiplier, a 24-bit output data conversion module and 25, a channel accumulation module;
3 is a calculation module, 31 is a pooling module, 32 is a data standardization module, 33 is an activation function module, 34 is an input selection module, and 35 is an output selection module;
and 4, a controller.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the dynamic reconfigurable convolutional neural network accelerator architecture for the field of internet of things provided by the invention is composed of four parts, namely a cache architecture 1, a processing unit array 2, a computing module 3 and a controller 4.
The cache architecture 1 provided by the invention is composed of an input data cache 10, an output data cache 12 and a convolution kernel cache 11. The cache architecture 1 functions to store input data, temporary intermediate data, output data, and the like. The multiplexing rate among data is increased, and high energy consumption and high delay brought by external storage access are reduced. The sizes of the input data cache, the output data cache and the convolution kernel cache are determined by the size of the convolution neural network to be realized. The whole cache architecture has two working modes:
1) off-chip data mode of operation
The method is applicable to the first layer of the convolutional neural network, input data are stored in an input data cache, output data are stored in an output data cache, and a convolutional kernel is stored in a convolutional kernel cache. The specific operation mode is shown in fig. 2.
2) On-chip data mode of operation
The method is suitable for middle layer operation of the convolutional neural network, input data come from an output data cache, a convolutional kernel comes from a convolutional kernel cache, and output temporary data cover old data and are stored in the output data cache. Therefore, the temporary data do not need to be stored outside the chip in the operation, so that the external memory access is reduced, but the premise is that the size of the temporary data is proper, and the operation is feasible because the temporary data is small for light-weight networks such as shuffleNet and the like. The specific operation mode is shown in fig. 3.
The cache architecture 1 provided by the invention is composed of the following modules, and the specific functions and implementation are as follows:
1) input data caching
The input data buffer 10 receives data, generally image data, from an external storage, and outputs the data to the processing unit array 2 according to a conventional structure, which functions to convert large blocks of image data into small blocks of data that are easy to process, and to improve data reuse efficiency by using "row and column" data superposition between blocks, as shown in fig. 2. The input data buffer 10 is comprised of two pieces of memory that accept data in a "ping-pong" manner, i.e., one piece of memory receives data and the other piece of memory outputs data, and vice versa.
2) Output data caching
The output data buffer 12 receives data from the processing unit array 2 or the computing module 3, and returns the output data to the processing unit array 2 or the computing module 3 for the next batch processing in the buffer architecture on-chip processing mode. The output data buffer 12 is formed by a plurality of sets of memories in parallel, the number depending on the convolutional neural network to be implemented.
3) Convolution kernel caching
The convolution kernel buffer 11 stores convolution kernels of each layer of the convolution neural network and outputs data to the processing unit array according to a convention structure. The convolution kernel buffer 11 is formed by a plurality of groups of memories in parallel, and the number of the memories is determined according to a convolution neural network to be realized. The convolution kernels of each layer of the convolutional neural network are different, so that data in the convolution kernel storage needs to be imported again after batch operation.
The processing unit array 2 provided by the invention is composed of 20 same processing units, the processing unit array 2 receives data from an input data buffer 10, an output data buffer 12 or a convolution kernel buffer 11, and outputs the data to the output data buffer 12. The processing unit array 2 is used for receiving input image data or intermediate temporary data and performing convolution operations with different step sizes, different convolution kernel sizes and different image sizes. Each processing unit implements the Winograd algorithm with a window of 5 x 5. The formula of the Winograd algorithm is expressed as follows:
U=GFGT(1)
V=BTInB (2)
Out=AT[U·V]A (3)
the formula (1) represents the conversion of a convolution kernel, an F matrix is the convolution kernel, a G matrix is a conversion matrix, and a U matrix is the result of the conversion of the convolution kernel; formula (2) represents the conversion of input data, In matrix represents the input data, B is the conversion matrix, and V is the result of the conversion of the input data; equation (3) represents the output data conversion, a is the conversion matrix, and Out is the final output result.
The structure of each processing unit is shown in fig. 4, and is composed of five modules, namely an input data conversion module 21, a convolution kernel conversion module 22, a multiplier 23, an output data conversion module 24 and a channel accumulation module 25. Wherein:
1) input data conversion module
The input data conversion module 21 implements formula (2), and converts the data output from the input data buffer into a matrix with a result of 5 × 5 after five cycles.
2) Convolution kernel conversion module
The convolution kernel conversion module 22 implements formula (1), and converts the data output from the convolution kernel buffer in columns first, and after five cycles, a matrix with a result of 5 × 5 is obtained.
3) Multiplier and method for generating a digital signal
Each processing unit has 25 multipliers 23, and the multipliers 23 perform the transformed input data and the transformed convolution kernel to perform matrix dot multiplication of two 5 × 5 matrices to obtain a 5 × 5 matrix.
4) Output data conversion module
The output data conversion module 24 implements formula (3), and converts the 5 × 5 result output by the multiplier 23 by columns to obtain a new matrix of 5 × 5.
5) Channel accumulation module
For each input channel, the output data conversion module 24 will generate a 5 × 5 matrix, and the channel accumulation module 25 will sum the matrices output by all input channels to obtain the final convolution result.
The calculation module 3 proposed by the present invention comprises a pooling module 31, a data normalization module 32, an activation function module 33, an input selector 34 and an output selector 35. The calculation module 3 realizes the operation of other layers except the convolution layer in the neural network, and each submodule corresponds to the pooling layer, the activation function ReLU layer and the Batch Normalization layer respectively. The computation module 3 is connected to the processing unit array 2 through an input selector, and is connected to the output data buffer 12 through an output selector, and is used for processing the data after convolution. Wherein:
1) pooling module
The pooling module 31 performs a pooling operation, i.e. a maximum or average of the data in the window.
2) Activating function modules
The activation function module 33 performs a ReLU activation function, the expression of which is shown in equation (4):
f(x)=x(x>0);f(x)=0(x<=0) (4)
in equation (4), x represents the input of the ReLU activation function, i.e. the output result of the channel accumulation module 25, and f represents the output of the activation function module 33.
3) Data standardization module
The data normalization module 32 normalizes the output data of each layer and transmits the result to the output data buffer 12.
4) Input selection module and output selection module
The input selection module 34 selects the output data of the processing unit array 2 to enter the pooling module 31, the activation function module 33 or the data normalization module 32, and the output selection module 35 selects the output data of that module to be transferred to the output data buffer 12.
The controller 4 of the present invention sends control signals to the input data buffer 10, the output data buffer 12, the convolution kernel buffer 11, the input selection module 34, and the output selection module 35, that is, informs when the input data buffer and the convolution kernel buffer receive data of the external buffer, and when the data is transmitted to the processing unit array; informing an output data cache when to accept data from a compute module or a processing unit array; and informing the input selection module and the output selection module to select which layer of the convolutional neural network to perform operation selection. Besides, a slave interface connected with an external bus is also arranged, and the internal cache and the register of the accelerator are uniformly addressed.
The performance of the invention was tested as follows:
the evaluation indexes of the convolutional neural network accelerator are mainly resources, speed and power consumption. In the test, a lightweight convolutional neural network ShuffleNet is selected as a target and is mapped to an accelerator. In performance and power consumption tests, input data and a convolution kernel are read into an input data cache and a convolution kernel cache in an accelerator, the time of a final output result is counted, and the speed of the accelerator can be obtained by dividing the time according to the complexity of a ShuffleNet network. The power consumption depends on the implementation platform, and Xilinx Zynq XC7Z102 is selected as the implementation platform. The resources of the accelerator are shown in the following table:
in addition, the ratio of the speed and power consumption indexes of the invention and the prior art is as follows:
as can be seen from the above table, in terms of the operating frequency being only 150MHz, but having more practical accelerator calculation speed, the invention achieves 2137.2GOP/S far exceeding the comparison objects of the same type, and simultaneously the power consumption is kept within an acceptable range, and the unit energy consumption speed reaches 82.39GOPS/W, which is also better than that of other comparison objects. The on-chip cache size is only 0.781 MB.
Examples
For the speed index, the advantages of the present invention come from the design of the processing unit array and the cache architecture. Firstly, the processing unit adopts Winograd convolution acceleration algorithm, for example, for convolution operation with 5 × 5 input data, 3 × 3 convolution kernel size and step size of 1, the traditional convolution needs to introduce 81 multiplication operations, while in the present publication, each processing unit only needs to introduce 25 multiplications, and in addition, the processing unit array adopts a certain degree of parallelism processing for input channels and output channels in the convolution network, so that the convolution operation speed is increased. On the other hand, the cache architecture has two working modes, under the on-chip working mode, data generated by the middle layer of the convolutional neural network does not need to be stored outside the chip, and can be directly transmitted to the next layer of network, because the data volume of the middle layer is not very large in the convolutional neural network model facing the internet of things, such as ShuffleNet, and the data volume can be stored on the chip.
For the resource and power consumption indexes, because a Winograd convolution acceleration algorithm is adopted, multiplier resources are greatly saved, a module in each processing unit is composed of simple logics, the granularity of the processing unit is large, and each processing unit can finish the operation of one input channel, so that only 20 processing units are used in total, and a large number of resources are saved. The accelerator has lower working frequency and smaller on-chip cache, and can save a large amount of power.

Claims (6)

1. A dynamic reconfigurable convolutional neural network accelerator architecture oriented to the field of Internet of things is characterized by comprising a cache architecture (1), a processing unit array (2), a computing module (3) and a controller (4); wherein,
the cache architecture (1) is used for storing data from an external memory or data generated in the calculation process, organizing and arranging the data, and transmitting the data to the processing unit array (2) for calculation according to a preset data structure; the processing unit array (2) is used for receiving the data from the cache architecture (1), performing convolution operation processing and storing the data in the cache architecture; the computing module (3) is used for receiving data from the processing unit array (2), selecting to perform three operations of pooling, standardization and activation functions, and storing output data in the cache architecture (1); the controller (4) is used for sending commands to the cache architecture (1), the processing unit array (2) and the computing module (3), and is designed with an external interface used for communicating with an external system.
2. The Internet of things field-oriented dynamic reconfigurable convolutional neural network accelerator architecture as claimed in claim 1, wherein the cache architecture (1) is composed of an input data cache (10), a convolutional kernel cache (11) and an output data cache (12); the output end of the controller (4) is connected with the input end of the input data cache (10), the input end of the convolution kernel cache (11) and the input end of the output data cache (12), the output end of the input data cache (10) and the output end of the convolution kernel cache (11) are connected with the input end of the processing unit array (2), the output end of the processing unit array (2) is connected with the input end of the calculation module (3), and the output end of the calculation module (3) is connected with the input end of the output data cache (12);
the input data cache (10) is used for receiving input image data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array (2) for operation; the convolution kernel data cache (11) is used for receiving convolution kernel data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array (2) for operation; the output data buffer (12) is used for storing intermediate data generated from the computing module (3) or the processing unit array (2) and transmitting the data to an external system through the controller (4).
3. The architecture of the accelerator of the dynamically reconfigurable convolutional neural network oriented to the field of internet of things as claimed in claim 2, wherein the processing unit array is composed of 20 processing units, and each processing unit is composed of an input data conversion module (20), a convolutional kernel conversion module (21), a multiplier (22), an output data conversion module (23) and a channel accumulation module (24);
the input data conversion module (20) is used for simultaneously converting the multi-channel input image data from the input data cache (10) on a plurality of processing units; the convolution kernel conversion module (21) is used for simultaneously converting the multi-channel convolution kernel data from the convolution kernel buffer (11) on a plurality of processing units; the multiplier (22) is used for multiplying the output data from the input data conversion module (20) and the convolution kernel conversion module (21); the output data conversion module (23) is used for converting the output result of the multiplier (22); the channel accumulation module (24) is used for summing the multi-channel data to obtain data of one channel.
4. The accelerator architecture of the dynamically reconfigurable convolutional neural network oriented to the field of internet of things as claimed in claim 3, wherein the processing unit array (2) receives data from an input data buffer (10), an output data buffer (12) or a convolutional kernel buffer (11), and outputs the data to the output data buffer (12); each processing unit implements the Winograd algorithm with a window of 5 × 5, wherein the formula of the Winograd algorithm is as follows:
U=GFGT(1)
V=BTInB (2)
Out=AT[U·V]A (3)
the formula (1) represents the conversion of a convolution kernel, an F matrix is the convolution kernel, a G matrix is a conversion matrix, and a U matrix is the result of the conversion of the convolution kernel; formula (2) represents the conversion of input data, In matrix represents the input data, B is the conversion matrix, and V is the result of the conversion of the input data; equation (3) represents the output data conversion, a is the conversion matrix, and Out is the final output result.
5. The architecture of the accelerator of the dynamically reconfigurable convolutional neural network oriented to the field of internet of things as claimed in claim 3, wherein the computing module (3) is composed of a pooling module (31), an activation function module (32), a data normalization module (33), an input selection module (34) and an output selection module (35);
the output end of the controller (4) is connected with the input end of an input selection module (34) and the input end of an output selection module (35), the output end of the processing unit array (2) is connected with the input end of the input selection module (34), the output end of the input selection module (34) is connected with the input end of a pooling module (31), the input end of an activation function module (32) and the input end of a data standardization module (33), the output end of the pooling module (31), the output end of the activation function module (32) and the output end of the data standardization module (33) are connected with the input end of the input selection module (34), and the output end of the input selection module (34) is connected with the input end of an output data cache (12);
an input selection module (30) for selecting to pool, normalize, or activate functional operations on data from the array of processing elements; the pooling module (31) is used for realizing pooling operation; the data standardization module (32) is used for realizing standardization operation; the activation function module (33) is used for realizing a ReLU activation function operation; an output selection module (34) is used for selecting one of the pooling module (31), the data normalization module (32) or the activation function module (33) to be output as a result into the cache architecture (1).
6. The architecture of claim 5, wherein the activation function module (33) executes a ReLU activation function, which is expressed as formula (4):
f(x)=x(x>0);f(x)=0(x<=0) (4)
wherein, x in the formula (4) represents the input of the ReLU activation function, namely the output result of the channel accumulation module (25), and f represents the output of the activation function module (33).
CN201811149741.0A 2018-09-29 2018-09-29 Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things Active CN109447241B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811149741.0A CN109447241B (en) 2018-09-29 2018-09-29 Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811149741.0A CN109447241B (en) 2018-09-29 2018-09-29 Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things

Publications (2)

Publication Number Publication Date
CN109447241A true CN109447241A (en) 2019-03-08
CN109447241B CN109447241B (en) 2022-02-22

Family

ID=65546038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811149741.0A Active CN109447241B (en) 2018-09-29 2018-09-29 Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things

Country Status (1)

Country Link
CN (1) CN109447241B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110225067A (en) * 2019-07-24 2019-09-10 上海戎磐网络科技有限公司 A kind of Internet of Things safety pre-warning system
CN110276444A (en) * 2019-06-04 2019-09-24 北京清微智能科技有限公司 Image processing method and device based on convolutional neural networks
CN110288086A (en) * 2019-06-13 2019-09-27 天津大学 A kind of configurable convolution array accelerator structure based on Winograd
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN112199036A (en) * 2019-07-08 2021-01-08 爱思开海力士有限公司 Data storage device, data processing system and acceleration device thereof
WO2021082746A1 (en) * 2019-11-01 2021-05-06 中科寒武纪科技股份有限公司 Operation apparatus and related product
CN112862091A (en) * 2021-01-26 2021-05-28 合肥工业大学 Resource multiplexing type neural network hardware accelerating circuit based on quick convolution
WO2021217502A1 (en) * 2020-04-27 2021-11-04 西安交通大学 Computing architecture
CN114064331A (en) * 2020-07-29 2022-02-18 中国科学院深圳先进技术研究院 Fault-tolerant computing method, fault-tolerant computing device, storage medium, and computer apparatus
CN115329951A (en) * 2022-09-13 2022-11-11 北京工商大学 FPGA (field programmable Gate array) framework for fast convolution operation of convolution neural network
US11876514B2 (en) 2021-04-29 2024-01-16 Nxp Usa, Inc Optocoupler circuit with level shifter
US11995442B2 (en) 2021-04-23 2024-05-28 Nxp B.V. Processor having a register file, processing unit, and instruction sequencer, and operable with an instruction set having variable length instructions and a table that maps opcodes to register file addresses

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203621A (en) * 2016-07-11 2016-12-07 姚颂 The processor calculated for convolutional neural networks
US20180046894A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Method for optimizing an artificial neural network (ann)
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108229656A (en) * 2016-12-14 2018-06-29 上海寒武纪信息科技有限公司 Neural network computing device and method
CN108241890A (en) * 2018-01-29 2018-07-03 清华大学 A kind of restructural neural network accelerated method and framework
CN108537331A (en) * 2018-04-04 2018-09-14 清华大学 A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203621A (en) * 2016-07-11 2016-12-07 姚颂 The processor calculated for convolutional neural networks
US20180046894A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Method for optimizing an artificial neural network (ann)
CN108229656A (en) * 2016-12-14 2018-06-29 上海寒武纪信息科技有限公司 Neural network computing device and method
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108241890A (en) * 2018-01-29 2018-07-03 清华大学 A kind of restructural neural network accelerated method and framework
CN108537331A (en) * 2018-04-04 2018-09-14 清华大学 A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
鲍贤亮: "一种高性能CNN专用卷积加速器的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276444A (en) * 2019-06-04 2019-09-24 北京清微智能科技有限公司 Image processing method and device based on convolutional neural networks
CN110288086B (en) * 2019-06-13 2023-07-21 天津大学 Winograd-based configurable convolution array accelerator structure
CN110288086A (en) * 2019-06-13 2019-09-27 天津大学 A kind of configurable convolution array accelerator structure based on Winograd
CN112199036A (en) * 2019-07-08 2021-01-08 爱思开海力士有限公司 Data storage device, data processing system and acceleration device thereof
CN112199036B (en) * 2019-07-08 2024-02-27 爱思开海力士有限公司 Data storage device, data processing system and acceleration device thereof
CN110225067B (en) * 2019-07-24 2021-08-24 上海戎磐网络科技有限公司 Internet of things safety early warning system
CN110225067A (en) * 2019-07-24 2019-09-10 上海戎磐网络科技有限公司 A kind of Internet of Things safety pre-warning system
CN110516801B (en) * 2019-08-05 2022-04-22 西安交通大学 High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
WO2021082746A1 (en) * 2019-11-01 2021-05-06 中科寒武纪科技股份有限公司 Operation apparatus and related product
WO2021217502A1 (en) * 2020-04-27 2021-11-04 西安交通大学 Computing architecture
US11886347B2 (en) 2020-04-27 2024-01-30 Xi'an Jiaotong University Large-scale data processing computer architecture
CN114064331A (en) * 2020-07-29 2022-02-18 中国科学院深圳先进技术研究院 Fault-tolerant computing method, fault-tolerant computing device, storage medium, and computer apparatus
CN112862091A (en) * 2021-01-26 2021-05-28 合肥工业大学 Resource multiplexing type neural network hardware accelerating circuit based on quick convolution
US11995442B2 (en) 2021-04-23 2024-05-28 Nxp B.V. Processor having a register file, processing unit, and instruction sequencer, and operable with an instruction set having variable length instructions and a table that maps opcodes to register file addresses
US11876514B2 (en) 2021-04-29 2024-01-16 Nxp Usa, Inc Optocoupler circuit with level shifter
CN115329951A (en) * 2022-09-13 2022-11-11 北京工商大学 FPGA (field programmable Gate array) framework for fast convolution operation of convolution neural network
CN115329951B (en) * 2022-09-13 2023-09-15 北京工商大学 FPGA architecture for convolutional neural network fast convolutional operation

Also Published As

Publication number Publication date
CN109447241B (en) 2022-02-22

Similar Documents

Publication Publication Date Title
CN109447241B (en) Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
CN110516801B (en) High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN109948774B (en) Neural network accelerator based on network layer binding operation and implementation method thereof
CN110852428B (en) Neural network acceleration method and accelerator based on FPGA
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN110348574B (en) ZYNQ-based universal convolutional neural network acceleration structure and design method
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN110390383A (en) A kind of deep neural network hardware accelerator based on power exponent quantization
CN111445012A (en) FPGA-based packet convolution hardware accelerator and method thereof
CN112418396B (en) Sparse activation perception type neural network accelerator based on FPGA
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
CN110705702A (en) Dynamic extensible convolutional neural network accelerator
CN111860773B (en) Processing apparatus and method for information processing
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
CN111340198A (en) Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)
CN115329260A (en) Transformer accelerator based on offset diagonal matrix
Zong-ling et al. The design of lightweight and multi parallel CNN accelerator based on FPGA
Yin et al. FPGA-based high-performance CNN accelerator architecture with high DSP utilization and efficient scheduling mode
CN109948787B (en) Arithmetic device, chip and method for neural network convolution layer
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant