CN109447241A

CN109447241A - A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field

Info

Publication number: CN109447241A
Application number: CN201811149741.0A
Authority: CN
Inventors: 杨晨; 王逸洲; 王小力; 耿莉
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2019-03-08
Anticipated expiration: 2038-09-29
Also published as: CN109447241B

Abstract

A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field of the present invention, including cache structure etc., cache structure is used to store the data generated in data or calculating process from storage external memory, after being organized into, arranging, it is transmitted in pe array and is calculated with data structure；Pe array is stored in cache structure after carrying out convolution operation processing for receiving the data from cache structure；For computing module for receiving the data from pe array, selection carries out pond, standardization or three kinds of activation primitive operations, and output data is stored in cache structure；Controller is used to send to cache structure, pe array and computing module and order, and design has external interface, for being communicated with external system.The present invention by design high degree of parallelism, high usage pe array and can be promoted data-reusing rate cache structure improve convolutional neural networks accelerator performance, reduce power consumption.

Description

Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things

Technical Field

The invention belongs to the field of neural network accelerators, and particularly relates to a dynamic reconfigurable convolutional neural network accelerator architecture oriented to the field of Internet of things.

Background

Artificial intelligence is one of popular computer science at present, deep learning is deeply developed as a main mode for realizing artificial intelligence, and the calculation complexity of a model exponentially increases along with the increase of the number of network layers and the number of neurons in each layer. Therefore, the learning speed and the running speed of the deep learning algorithm are more and more dependent on large-scale computing platforms such as cloud computing. For hardware acceleration of a deep learning algorithm, at present, three types of implementation modes, namely a multi-core CPU, a GPU and an FPGA are generally available, and the common characteristic of the three types of implementation modes is that high-parallelism computation can be realized. However, the existing hardware implementation has a problem of high power consumption and low energy efficiency (performance/power consumption), and cannot be applied to smart mobile terminals, such as smart phones, wearable devices, or autonomous vehicles. Under the background, the reconfigurable processor is proved to be a parallel computing architecture with high flexibility and high energy efficiency, has the advantages that a proper resource configuration strategy can be selected according to different model sizes, the application range of the special processor is expanded, the processing performance is improved, the reconfigurable processor is one of the solutions for limiting the further development of the multi-core CPU and the FPGA technology, and is likely to become one of the schemes for realizing high-efficiency deep learning SoC in the future.

The convolutional neural network accelerator firstly meets the requirements of reconfigurability and configurability, supports the continuous evolution of a network structure on an algorithm level and meets rich and diverse application scenes; secondly, the requirements of high performance and low energy consumption are met, the limitation of storage bandwidth is required, and hardware resources are fully utilized.

Disclosure of Invention

The invention aims to provide a dynamic reconfigurable convolutional neural network accelerator architecture oriented to the field of Internet of things, which improves the performance of the convolutional neural network accelerator and reduces the power consumption by designing a processing unit array with high parallelism and high utilization rate and a cache architecture capable of improving the data reuse rate, has certain configurability, and is suitable for various application scenes.

The invention is realized by adopting the following technical scheme:

a dynamic reconfigurable convolutional neural network accelerator architecture facing the field of Internet of things comprises a cache architecture, a processing unit array, a computing module and a controller; wherein,

the cache architecture is used for storing data from an external memory or data generated in the calculation process, organizing and arranging the data, and transmitting the data to the processing unit array for calculation according to a preset data structure; the processing unit array is used for receiving the data from the cache architecture, performing convolution operation processing and storing the data in the cache architecture; the computing module is used for receiving data from the processing unit array, selecting three operations of pooling, standardization and activation of functions, and storing output data in a cache architecture; the controller is used for sending commands to the cache architecture, the processing unit array and the computing module, and is designed with an external interface used for communicating with an external system.

The invention has the further improvement that the cache architecture consists of an input data cache, a convolution kernel cache and an output data cache; the output end of the controller is connected with the input end of the input data cache, the input end of the convolution kernel cache and the input end of the output data cache, the output end of the input data cache and the output end of the convolution kernel cache are connected with the input end of the processing unit array, the output end of the processing unit array is connected with the input end of the calculation module, and the output end of the calculation module is connected with the input end of the output data cache;

the input data cache is used for receiving input image data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array for operation; the convolution kernel data cache is used for receiving convolution kernel data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array for operation; the output data buffer is used for storing intermediate data generated from the computing module or the processing unit array and transmitting the data to an external system through the controller.

The invention has the further improvement that the processing unit array consists of 20 processing units, and each processing unit consists of an input data conversion module, a convolution kernel conversion module, a multiplier, an output data conversion module and a channel accumulation module;

the input data conversion module is used for simultaneously converting the multi-channel input image data from the input data cache on a plurality of processing units; the convolution kernel conversion module is used for simultaneously converting the multi-channel convolution kernel data from the convolution kernel cache on a plurality of processing units; the multiplier is used for multiplying the output data from the input data conversion module and the convolution kernel conversion module; the output data conversion module is used for converting the output result of the multiplier; the channel accumulation module is used for summing the multi-channel data to obtain data of one channel.

The invention is further improved in that the processing unit array is used for receiving data from an input data cache, an output data cache or a convolution kernel cache, and outputting the data to the output data cache; each processing unit implements the Winograd algorithm with a window of 5 × 5, wherein the formula of the Winograd algorithm is as follows:

U＝GFG^T(1)

V＝B^TInB (2)

Out＝A^T[U·V]A (3)

the formula (1) represents the conversion of a convolution kernel, an F matrix is the convolution kernel, a G matrix is a conversion matrix, and a U matrix is the result of the conversion of the convolution kernel; formula (2) represents the conversion of input data, In matrix represents the input data, B is the conversion matrix, and V is the result of the conversion of the input data; equation (3) represents the output data conversion, a is the conversion matrix, and Out is the final output result.

The invention has the further improvement that the calculation module consists of a pooling module, an activation function module, a data standardization module, an input selection module and an output selection module;

the output end of the controller is connected with the input end of the input selection module and the input end of the output selection module, the output end of the processing unit array is connected with the input end of the input selection module, the output end of the input selection module is connected with the input end of the pooling module, the input end of the activation function module and the input end of the data standardization module, the output end of the pooling module, the output end of the activation function module and the output end of the data standardization module are connected with the input end of the input selection module, and the output end of the input selection module is connected with the input end of the output data cache;

the input selection module is used for selecting to perform pooling, standardization or activation function operation on data from the processing unit array; the pooling module is used for realizing pooling operation; the data standardization module is used for realizing standardization operation; the activation function module is used for realizing the ReLU activation function operation; the output selection module is used for selecting one of the pooling module, the data normalization module or the activation function module to be output to the cache architecture as a result.

In a further development of the invention, the activation function module executes a ReLU activation function, whose expression is shown in equation (4):

f(x)＝x(x＞0)；f(x)＝0(x＜＝0) (4)

wherein, x in the formula (4) represents the input of the ReLU activation function, i.e. the output result of the channel accumulation module, and f represents the output of the activation function module.

The invention has the following beneficial technical effects:

1. the accelerator architecture is oriented to lightweight networks such as ShuffleNet, and the like, and the networks have the characteristics of higher accuracy, simpler network structure and less network parameters.

2. The accelerator adopts a Winograd algorithm to accelerate convolution operation, so that multiplication times can be reduced, the speed of the accelerator is increased, and the system energy efficiency is improved.

3. The accelerator supports convolution, pooling, activation functions, normalization, full join operations. Can complete various operations at one time, and reduce the access of the external memory.

4. The accelerator is fully configurable and comprises input and output channel numbers, input image size, convolution kernel size and convolution step size.

5. The accelerator has three on-chip caches, the operation of exchanging input and output data caches is proposed in the cache structure, data multiplexing is carried out on input image data (rows and columns) and intermediate data, external memory reading and on-chip cache reading are greatly reduced, and energy efficiency is improved.

6. Accelerators have different levels of high parallelism: the input and output channels are parallel, the input image is parallel internally, the convolution window is parallel internally, and the parallelism can be modified according to the bandwidth.

7. An accelerator can be integrated on a general SOC platform, and various network structures can be configured more conveniently through a software platform.

In summary, the invention is directed to the field of mobile terminals of the internet of things, and by designing a processing unit array with high parallelism and high utilization rate and a proper cache architecture, the requirements of low power consumption and high performance are met, and meanwhile, the invention has certain configurability, and is suitable for various lightweight convolutional neural networks and partial deep neural networks.

Drawings

Fig. 1 is a schematic diagram of a dynamic reconfigurable convolutional neural network processor architecture oriented to the field of internet of things.

FIG. 2 is a schematic diagram of an off-chip operation mode of an input data buffer.

Fig. 3 is a schematic diagram of an operation mode on an output data cache chip.

Fig. 4 is a schematic diagram of a processing unit structure.

Description of reference numerals:

1 is a cache architecture, 10 is an input data cache, 11 is a convolution kernel cache, and 12 is an output data cache;

2, a processing unit array, 21, an input data conversion module, 22, a convolution kernel conversion module, 23, a multiplier, a 24-bit output data conversion module and 25, a channel accumulation module;

3 is a calculation module, 31 is a pooling module, 32 is a data standardization module, 33 is an activation function module, 34 is an input selection module, and 35 is an output selection module;

and 4, a controller.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the dynamic reconfigurable convolutional neural network accelerator architecture for the field of internet of things provided by the invention is composed of four parts, namely a cache architecture 1, a processing unit array 2, a computing module 3 and a controller 4.

The cache architecture 1 provided by the invention is composed of an input data cache 10, an output data cache 12 and a convolution kernel cache 11. The cache architecture 1 functions to store input data, temporary intermediate data, output data, and the like. The multiplexing rate among data is increased, and high energy consumption and high delay brought by external storage access are reduced. The sizes of the input data cache, the output data cache and the convolution kernel cache are determined by the size of the convolution neural network to be realized. The whole cache architecture has two working modes:

1) off-chip data mode of operation

The method is applicable to the first layer of the convolutional neural network, input data are stored in an input data cache, output data are stored in an output data cache, and a convolutional kernel is stored in a convolutional kernel cache. The specific operation mode is shown in fig. 2.

2) On-chip data mode of operation

The method is suitable for middle layer operation of the convolutional neural network, input data come from an output data cache, a convolutional kernel comes from a convolutional kernel cache, and output temporary data cover old data and are stored in the output data cache. Therefore, the temporary data do not need to be stored outside the chip in the operation, so that the external memory access is reduced, but the premise is that the size of the temporary data is proper, and the operation is feasible because the temporary data is small for light-weight networks such as shuffleNet and the like. The specific operation mode is shown in fig. 3.

The cache architecture 1 provided by the invention is composed of the following modules, and the specific functions and implementation are as follows:

1) input data caching

The input data buffer 10 receives data, generally image data, from an external storage, and outputs the data to the processing unit array 2 according to a conventional structure, which functions to convert large blocks of image data into small blocks of data that are easy to process, and to improve data reuse efficiency by using "row and column" data superposition between blocks, as shown in fig. 2. The input data buffer 10 is comprised of two pieces of memory that accept data in a "ping-pong" manner, i.e., one piece of memory receives data and the other piece of memory outputs data, and vice versa.

2) Output data caching

The output data buffer 12 receives data from the processing unit array 2 or the computing module 3, and returns the output data to the processing unit array 2 or the computing module 3 for the next batch processing in the buffer architecture on-chip processing mode. The output data buffer 12 is formed by a plurality of sets of memories in parallel, the number depending on the convolutional neural network to be implemented.

3) Convolution kernel caching

The convolution kernel buffer 11 stores convolution kernels of each layer of the convolution neural network and outputs data to the processing unit array according to a convention structure. The convolution kernel buffer 11 is formed by a plurality of groups of memories in parallel, and the number of the memories is determined according to a convolution neural network to be realized. The convolution kernels of each layer of the convolutional neural network are different, so that data in the convolution kernel storage needs to be imported again after batch operation.

The processing unit array 2 provided by the invention is composed of 20 same processing units, the processing unit array 2 receives data from an input data buffer 10, an output data buffer 12 or a convolution kernel buffer 11, and outputs the data to the output data buffer 12. The processing unit array 2 is used for receiving input image data or intermediate temporary data and performing convolution operations with different step sizes, different convolution kernel sizes and different image sizes. Each processing unit implements the Winograd algorithm with a window of 5 x 5. The formula of the Winograd algorithm is expressed as follows:

U＝GFG^T(1)

V＝B^TInB (2)

Out＝A^T[U·V]A (3)

The structure of each processing unit is shown in fig. 4, and is composed of five modules, namely an input data conversion module 21, a convolution kernel conversion module 22, a multiplier 23, an output data conversion module 24 and a channel accumulation module 25. Wherein:

1) input data conversion module

The input data conversion module 21 implements formula (2), and converts the data output from the input data buffer into a matrix with a result of 5 × 5 after five cycles.

2) Convolution kernel conversion module

The convolution kernel conversion module 22 implements formula (1), and converts the data output from the convolution kernel buffer in columns first, and after five cycles, a matrix with a result of 5 × 5 is obtained.

3) Multiplier and method for generating a digital signal

Each processing unit has 25 multipliers 23, and the multipliers 23 perform the transformed input data and the transformed convolution kernel to perform matrix dot multiplication of two 5 × 5 matrices to obtain a 5 × 5 matrix.

4) Output data conversion module

The output data conversion module 24 implements formula (3), and converts the 5 × 5 result output by the multiplier 23 by columns to obtain a new matrix of 5 × 5.

5) Channel accumulation module

For each input channel, the output data conversion module 24 will generate a 5 × 5 matrix, and the channel accumulation module 25 will sum the matrices output by all input channels to obtain the final convolution result.

The calculation module 3 proposed by the present invention comprises a pooling module 31, a data normalization module 32, an activation function module 33, an input selector 34 and an output selector 35. The calculation module 3 realizes the operation of other layers except the convolution layer in the neural network, and each submodule corresponds to the pooling layer, the activation function ReLU layer and the Batch Normalization layer respectively. The computation module 3 is connected to the processing unit array 2 through an input selector, and is connected to the output data buffer 12 through an output selector, and is used for processing the data after convolution. Wherein:

1) pooling module

The pooling module 31 performs a pooling operation, i.e. a maximum or average of the data in the window.

2) Activating function modules

The activation function module 33 performs a ReLU activation function, the expression of which is shown in equation (4):

f(x)＝x(x＞0)；f(x)＝0(x＜＝0) (4)

in equation (4), x represents the input of the ReLU activation function, i.e. the output result of the channel accumulation module 25, and f represents the output of the activation function module 33.

3) Data standardization module

The data normalization module 32 normalizes the output data of each layer and transmits the result to the output data buffer 12.

4) Input selection module and output selection module

The input selection module 34 selects the output data of the processing unit array 2 to enter the pooling module 31, the activation function module 33 or the data normalization module 32, and the output selection module 35 selects the output data of that module to be transferred to the output data buffer 12.

The controller 4 of the present invention sends control signals to the input data buffer 10, the output data buffer 12, the convolution kernel buffer 11, the input selection module 34, and the output selection module 35, that is, informs when the input data buffer and the convolution kernel buffer receive data of the external buffer, and when the data is transmitted to the processing unit array; informing an output data cache when to accept data from a compute module or a processing unit array; and informing the input selection module and the output selection module to select which layer of the convolutional neural network to perform operation selection. Besides, a slave interface connected with an external bus is also arranged, and the internal cache and the register of the accelerator are uniformly addressed.

The performance of the invention was tested as follows:

the evaluation indexes of the convolutional neural network accelerator are mainly resources, speed and power consumption. In the test, a lightweight convolutional neural network ShuffleNet is selected as a target and is mapped to an accelerator. In performance and power consumption tests, input data and a convolution kernel are read into an input data cache and a convolution kernel cache in an accelerator, the time of a final output result is counted, and the speed of the accelerator can be obtained by dividing the time according to the complexity of a ShuffleNet network. The power consumption depends on the implementation platform, and Xilinx Zynq XC7Z102 is selected as the implementation platform. The resources of the accelerator are shown in the following table:

in addition, the ratio of the speed and power consumption indexes of the invention and the prior art is as follows:

as can be seen from the above table, in terms of the operating frequency being only 150MHz, but having more practical accelerator calculation speed, the invention achieves 2137.2GOP/S far exceeding the comparison objects of the same type, and simultaneously the power consumption is kept within an acceptable range, and the unit energy consumption speed reaches 82.39GOPS/W, which is also better than that of other comparison objects. The on-chip cache size is only 0.781 MB.

Examples

For the speed index, the advantages of the present invention come from the design of the processing unit array and the cache architecture. Firstly, the processing unit adopts Winograd convolution acceleration algorithm, for example, for convolution operation with 5 × 5 input data, 3 × 3 convolution kernel size and step size of 1, the traditional convolution needs to introduce 81 multiplication operations, while in the present publication, each processing unit only needs to introduce 25 multiplications, and in addition, the processing unit array adopts a certain degree of parallelism processing for input channels and output channels in the convolution network, so that the convolution operation speed is increased. On the other hand, the cache architecture has two working modes, under the on-chip working mode, data generated by the middle layer of the convolutional neural network does not need to be stored outside the chip, and can be directly transmitted to the next layer of network, because the data volume of the middle layer is not very large in the convolutional neural network model facing the internet of things, such as ShuffleNet, and the data volume can be stored on the chip.

For the resource and power consumption indexes, because a Winograd convolution acceleration algorithm is adopted, multiplier resources are greatly saved, a module in each processing unit is composed of simple logics, the granularity of the processing unit is large, and each processing unit can finish the operation of one input channel, so that only 20 processing units are used in total, and a large number of resources are saved. The accelerator has lower working frequency and smaller on-chip cache, and can save a large amount of power.

Claims

1. A dynamic reconfigurable convolutional neural network accelerator architecture oriented to the field of Internet of things is characterized by comprising a cache architecture (1), a processing unit array (2), a computing module (3) and a controller (4); wherein,

the cache architecture (1) is used for storing data from an external memory or data generated in the calculation process, organizing and arranging the data, and transmitting the data to the processing unit array (2) for calculation according to a preset data structure; the processing unit array (2) is used for receiving the data from the cache architecture (1), performing convolution operation processing and storing the data in the cache architecture; the computing module (3) is used for receiving data from the processing unit array (2), selecting to perform three operations of pooling, standardization and activation functions, and storing output data in the cache architecture (1); the controller (4) is used for sending commands to the cache architecture (1), the processing unit array (2) and the computing module (3), and is designed with an external interface used for communicating with an external system.

2. The Internet of things field-oriented dynamic reconfigurable convolutional neural network accelerator architecture as claimed in claim 1, wherein the cache architecture (1) is composed of an input data cache (10), a convolutional kernel cache (11) and an output data cache (12); the output end of the controller (4) is connected with the input end of the input data cache (10), the input end of the convolution kernel cache (11) and the input end of the output data cache (12), the output end of the input data cache (10) and the output end of the convolution kernel cache (11) are connected with the input end of the processing unit array (2), the output end of the processing unit array (2) is connected with the input end of the calculation module (3), and the output end of the calculation module (3) is connected with the input end of the output data cache (12);

the input data cache (10) is used for receiving input image data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array (2) for operation; the convolution kernel data cache (11) is used for receiving convolution kernel data of multiple input channels and simultaneously transmitting the data of the multiple input channels to the processing unit array (2) for operation; the output data buffer (12) is used for storing intermediate data generated from the computing module (3) or the processing unit array (2) and transmitting the data to an external system through the controller (4).

3. The architecture of the accelerator of the dynamically reconfigurable convolutional neural network oriented to the field of internet of things as claimed in claim 2, wherein the processing unit array is composed of 20 processing units, and each processing unit is composed of an input data conversion module (20), a convolutional kernel conversion module (21), a multiplier (22), an output data conversion module (23) and a channel accumulation module (24);

the input data conversion module (20) is used for simultaneously converting the multi-channel input image data from the input data cache (10) on a plurality of processing units; the convolution kernel conversion module (21) is used for simultaneously converting the multi-channel convolution kernel data from the convolution kernel buffer (11) on a plurality of processing units; the multiplier (22) is used for multiplying the output data from the input data conversion module (20) and the convolution kernel conversion module (21); the output data conversion module (23) is used for converting the output result of the multiplier (22); the channel accumulation module (24) is used for summing the multi-channel data to obtain data of one channel.

4. The accelerator architecture of the dynamically reconfigurable convolutional neural network oriented to the field of internet of things as claimed in claim 3, wherein the processing unit array (2) receives data from an input data buffer (10), an output data buffer (12) or a convolutional kernel buffer (11), and outputs the data to the output data buffer (12); each processing unit implements the Winograd algorithm with a window of 5 × 5, wherein the formula of the Winograd algorithm is as follows:

U＝GFG^T(1)

V＝B^TInB (2)

Out＝A^T[U·V]A (3)

5. The architecture of the accelerator of the dynamically reconfigurable convolutional neural network oriented to the field of internet of things as claimed in claim 3, wherein the computing module (3) is composed of a pooling module (31), an activation function module (32), a data normalization module (33), an input selection module (34) and an output selection module (35);

the output end of the controller (4) is connected with the input end of an input selection module (34) and the input end of an output selection module (35), the output end of the processing unit array (2) is connected with the input end of the input selection module (34), the output end of the input selection module (34) is connected with the input end of a pooling module (31), the input end of an activation function module (32) and the input end of a data standardization module (33), the output end of the pooling module (31), the output end of the activation function module (32) and the output end of the data standardization module (33) are connected with the input end of the input selection module (34), and the output end of the input selection module (34) is connected with the input end of an output data cache (12);

an input selection module (30) for selecting to pool, normalize, or activate functional operations on data from the array of processing elements; the pooling module (31) is used for realizing pooling operation; the data standardization module (32) is used for realizing standardization operation; the activation function module (33) is used for realizing a ReLU activation function operation; an output selection module (34) is used for selecting one of the pooling module (31), the data normalization module (32) or the activation function module (33) to be output as a result into the cache architecture (1).

6. The architecture of claim 5, wherein the activation function module (33) executes a ReLU activation function, which is expressed as formula (4):

f(x)＝x(x＞0)；f(x)＝0(x＜＝0) (4)

wherein, x in the formula (4) represents the input of the ReLU activation function, namely the output result of the channel accumulation module (25), and f represents the output of the activation function module (33).