CN113807509A

CN113807509A - Neural network acceleration device, method and communication equipment

Info

Publication number: CN113807509A
Application number: CN202111071508.7A
Authority: CN
Inventors: 王赟; 张官兴; 郭蔚; 黄康莹; 张铁亮
Original assignee: Shanghai Ewa Intelligent Technology Co ltd; Shaoxing Ewa Technology Co Ltd
Current assignee: Shanghai Ewa Intelligent Technology Co ltd; Shaoxing Ewa Technology Co Ltd
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2021-12-17
Anticipated expiration: 2041-09-14
Also published as: CN113807509B

Abstract

The invention provides a neural network accelerating device, a neural network accelerating method and communication equipment, which belong to the field of data processing, and specifically comprise a main memory, a main memory and a weight data processing module, wherein the main memory receives and stores feature map data and weight data of an image to be processed; the main controller generates configuration information and an operation instruction according to the structural parameters of the neural network; the data caching module comprises a characteristic data caching unit for caching characteristic line data extracted from the characteristic diagram data and a convolution kernel caching unit for caching convolution kernel data extracted from the weight data; the data controller adjusts a data channel according to the configuration information and the instruction information, controls the data stream extracted by the data extractor to be loaded to a corresponding neural network computing unit, and the neural network computing unit at least completes convolution operation of one convolution kernel and the characteristic diagram data and completes accumulation of a plurality of convolution results in at least one period, so that circuit reconstruction and data multiplexing are realized; and the accumulator accumulates the convolution result and outputs the output characteristic diagram data corresponding to the convolution kernel.

Description

Neural network acceleration device, method and communication equipment

Technical Field

The invention relates to the field of data processing, in particular to a neural network accelerating device, a neural network accelerating method and communication equipment.

Background

The convolutional neural network is composed of an input layer (inputlayer), an arbitrary number of hidden layers (hiddenlayer) as intermediate layers, and an output layer (outputlayer). The input layer (inputlayer) has a plurality of input nodes (neurons). The output layer has output nodes (neurons) that identify the number of objects.

The convolution kernel is a small window placed in the hidden layer, in which the weight parameters are stored. And sequentially sliding the convolution kernels on the input image according to the step length, and performing multiplication and addition operation on the convolution kernels and the input characteristic image of the corresponding area, namely multiplying the weight parameters in the convolution kernels by the values of the corresponding input image and then summing the weight parameters. The traditional convolution accelerating operation device needs to use img2col method to perform matrix-form expansion processing on the input characteristic diagram data and convolution kernel data according to the convolution kernel size and step length parameters and then operate the expanded matrix, therefore, the convolution acceleration can be carried out according to the matrix multiplication operation rule, but in the method, after the characteristic data matrix is expanded, larger on-chip cache is needed, more off-chip main memory reading frequency is needed similarly, and data which can not be efficiently multiplexed for reading needs to occupy the read-write bandwidth of off-chip access memory, so that the hardware power consumption is increased, meanwhile, the convolution acceleration operation method based on the img2col expansion mode is not beneficial to the realization of hardware logic circuits of convolution operation of convolution kernels with different sizes and step lengths, therefore, in the operation process of the convolution network, each input channel needs to carry out convolution matrix operation with a plurality of convolution kernels, and characteristic map data need to be obtained for a plurality of times; and all the feature map data on each channel are completely cached in the buffer, so that the data volume is huge, and when the convolution matrix calculation is carried out, on-chip storage resources can be wasted due to the fact that the feature data size after matrix conversion far exceeds the original feature data size, and the operation of large data volume cannot be executed.

Disclosure of Invention

Accordingly, to overcome the above-mentioned shortcomings of the prior art, the present invention provides a neural network acceleration apparatus, method and communication device.

In order to achieve the above object, the present invention provides a neural network acceleration apparatus, including: the main memory is used for receiving and storing feature map data and weight data of the image to be processed; the main controller is used for analyzing the neural network program compiling instruction and generating configuration information and an operation instruction according to the structural parameters of the neural network; the data caching module comprises a characteristic data caching unit for caching characteristic line data extracted from the characteristic diagram data and a convolution kernel caching unit for caching convolution kernel data extracted from the weight data; the neural network computing module comprises a data controller, a data extractor and a neural network computing unit, wherein the data controller adjusts a data channel according to configuration information and instruction information, controls data streams extracted by the data extractor to be loaded to the corresponding neural network computing unit according to the instruction information, and the neural network computing unit at least completes convolution operation of one convolution kernel and the characteristic diagram data and completes accumulation of a plurality of convolution results in at least one period so as to realize circuit reconstruction and data multiplexing; and the accumulator is used for accumulating convolution results on the plurality of input channel characteristic graphs obtained by the convolution kernel operation unit and outputting output characteristic graph data corresponding to the convolution kernels.

In one embodiment, the neural network computing unit includes a plurality of neural network acceleration slices, each neural network acceleration slice includes a plurality of convolution operation multiply-add arrays, each neural network acceleration slice at least completes convolution operation of feature map data of one input channel and convolution kernel data, and a plurality of neural network acceleration slices complete convolution operation of feature map data of a plurality of input channels and convolution kernel data.

In one embodiment, the plurality of neural network acceleration slices form a first neural network operational matrix, and the plurality of first neural network operational matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and a convolution kernel, and the plurality of second neural network acceleration matrices complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

In one embodiment, each group of convolution operation multiplication and addition arrays obtains characteristic line data in a parallel input mode; and each group of convolution operation multiplication and addition arrays obtains convolution kernel data in a serial input mode.

In one embodiment, the neural network turbo slicing comprises a plurality of first multiplexers and a plurality of second multiplexers, the first multiplexers being coupled in parallel with the convolution operation multiply-add arrays in a one-to-one correspondence, and the second multiplexers being coupled in series with the convolution operation multiply-add arrays in a one-to-one correspondence; the first multiplexer acquires the characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires the convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

In one embodiment, the neural network computing module further includes a first shift register group and a second shift register group, the neural network computing unit includes a multiplication and addition subunit, a part and a buffer subunit, the first shift register group adopts a serial input and parallel output mode, and outputs the characteristic line data to the multiplication and addition subunit through a first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to step length, and outputs the convolution kernel data to a next-stage convolution operation multiplication and addition array and the multiplication and addition subunit through a second multiplexer, the multiplication and addition subunit correspondingly performs multiplication operation on the input characteristic line data and the convolution kernel data and performs accumulation operation on the characteristic line data and the convolution line part and data in the part and the buffer subunit, and when the convolution operation of the convolution kernel data and the characteristic line data of a corresponding convolution window is completed, a plurality of line convolution results of the convolution window are partially accumulated to realize a sliding window convolution operation of a convolution kernel; the convolution operation multiplication and addition arrays of different stages in each group output the row operation results to the accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays of different stages corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

In one embodiment, the data controller is configured to obtain, according to the configuration information and the instruction information, a storage address of the feature data and the corresponding weight data loaded into the neural network computing unit, instruct the multiplexer to turn on and off the adjustment data path, and input the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data buffer unit includes a plurality of feature data buffer groups, each feature data buffer unit buffers part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data buffer unit, each neural network acceleration slice acquires feature map data of one input channel from a corresponding feature data buffer group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices.

In one embodiment, the data controller obtains the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to configuration information and instruction information, the instruction multiplexer outputs the feature data to the neural network computing unit from the first shift register group in a serial input and parallel output mode, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit from the second shift register group in a serial input and one output mode according to step length selection.

The invention also provides a neural network acceleration method, which comprises the following steps: receiving and storing feature map data and weight data of an image to be processed by adopting a main memory; analyzing a neural network program compiling instruction by adopting a main controller, and generating configuration information and an operation instruction according to the structural parameters of the neural network; caching the characteristic data caching unit of the characteristic line data extracted from the characteristic graph data by adopting a characteristic data caching unit of a data caching module, and caching the convolution kernel data extracted from the weight data by adopting a convolution kernel caching unit of the data caching module; a data controller of a neural network computing module is adopted to make and break a data path according to configuration information and instruction information, data streams extracted by a data extractor of the neural network computing module are controlled to flow into corresponding neural network computing units according to the instruction information, the neural network computing units at least complete convolution operation of one convolution kernel and the characteristic diagram data, and accumulation of a plurality of convolution results is completed in at least one period, so that circuit reconstruction and data multiplexing are realized; and accumulating convolution results on the multiple input channel characteristic graphs obtained by the convolution kernel operation unit by adopting an accumulator, and outputting output characteristic graph data corresponding to the convolution kernels.

The invention also provides communication equipment which is characterized by comprising a Central Processing Unit (CPU), a memory DDR SDRAM and the neural network accelerator, wherein the CPU, the memory DDR SDRAM and the neural network accelerator are in communication connection, the CPU is used for controlling the neural network accelerator to start convolution operation, and the DDR SDRAM is used for inputting feature map data and weight data to the data cache module of the neural network accelerator.

Compared with the prior art, the invention has the advantages that: the method has the advantages that the convolution kernel is converted into a row (column) operation from the value acquired from the input image, the convolution operation of the current data can be completed only by reading the data from the main memory once under the condition that the data is not required to be expanded in a matrix form, so that the memory access is reduced, the energy efficiency of the data access is improved, the time characteristic data is multiplexed, the operation speed is optimized, the block technology is further adopted, the characteristic map data of each input channel is subjected to block processing, namely, part of the characteristic map data is obtained, the cache data is updated after the calculation is completed, and the cache requirement of the on-chip characteristic map data is reduced, so that the computation core non-delay operation is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a neural network acceleration device in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a neural network acceleration device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network acceleration device loading characteristic line data and convolution kernel line data according to an embodiment of the present invention;

FIG. 4 is a diagram of a register bank loaded with eigen row data and convolution kernel row data in an embodiment of the invention;

fig. 5 is a timing diagram of a convolution operation performed on pu by a convolution kernel with a step size of 1 for a 3 × 3 convolution kernel in another embodiment of the present invention;

fig. 6 is a timing diagram of a convolution operation performed on pu by 5 × 5 convolution kernels with step size 1 in another embodiment of the present invention;

fig. 7 is a timing diagram of a convolution operation performed on pu by 5 × 5 convolution kernels with a step size of 2 in another embodiment of the present invention;

fig. 8 is a flowchart illustrating a neural network acceleration method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number and aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present application, and the drawings only show the components related to the present application rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

The embodiment of the application provides communication equipment, wherein a hardware architecture of the communication equipment comprises a Central Processing Unit (CPU), a memory DDR SDRAM and a neural network accelerator which are in communication connection. The CPU is used for controlling the neural network accelerator to start convolution operation, and the DDR SDRAM is used for inputting feature map data and weight data to a data cache module of the neural network accelerator. The CPU, the DDRSDRAM and the neural network accelerating device are in communication connection. The CPU controls the neural network accelerating device to start convolution operation, the DDRSDRAM is used for inputting a plurality of convolution data and a plurality of convolution parameters to a data cache module of the neural network accelerating device, then the neural network accelerating device completes the convolution operation according to the obtained convolution data and the convolution parameters to obtain an operation result, the operation result is written back to a memory address appointed by the DDR SDRAM, and the CPU is informed of the completion of the convolution operation.

As shown in fig. 1 and fig. 2, an embodiment of the present application provides a neural network acceleration apparatus 100 including a main memory 102, a main controller 104, a data cache module 106, a neural network computation module 108, and an accumulator 110.

The main memory 102 is used for receiving and storing feature map data and weight data of an image to be processed. The main memory 102 may receive and store the image to be processed and the weight data.

The master controller 104 is configured to analyze the neural network program compiling instruction, and generate configuration information and an operation instruction according to the structural parameters of the neural network. The main controller 104 may generate feature map data and weight parameters according to the structural parameters of the neural network and load the feature map data and the weight parameters into corresponding cache units, and meanwhile, the data controller host of the neural network computing module 108 sends an execution instruction to configuration information of the PE matrix queues in the PU slices.

The data buffer module 106 includes a feature data buffer unit and a convolution kernel buffer unit. The characteristic data caching unit is used for caching characteristic line data extracted from the characteristic diagram data. The convolution kernel buffer unit is used for buffering convolution kernel data extracted from the weight data. After the PU receives the instruction, the data cache module 106 configures the interconnection circuit path on the PU slice according to the configuration and instruction information, and inputs the feature map data to the PE accelerator in the multi-slice PU unit line by line according to the preset data stream rule, so as to complete the convolution operation in the following (for example, what data arrangement mode, data path, block size, and the like the feature data and the weight enter which PU slice and the accelerator PE accelerator).

The neural network computing module 108 includes a data controller 1081, a data extractor 1082, and a neural network computing unit 1083, the data controller adjusts data paths according to configuration information and instruction information, controls data streams extracted by the data extractor to be loaded to the corresponding neural network computing unit according to the instruction information, and the neural network computing unit at least completes convolution operation of one convolution kernel and feature map data, and completes accumulation of a plurality of convolution results in at least one period, thereby implementing circuit reconstruction and data multiplexing.

The accumulator 110 is configured to accumulate convolution results on the multiple input channel feature maps obtained by the convolution kernel operation unit, and output feature map data corresponding to a convolution kernel.

The neural network accelerating device can also be provided with an activation/pooling unit, an output buffer unit and the like, and is matched with the neural network calculating module 108 to complete subsequent processing such as a convolutional neural network and the like.

According to the neural network accelerating device, the data does not need to be expanded in a matrix form, the convolution and accumulation operation of the current data can be completed only by reading the characteristic data and the convolution kernel data from the main memory once, the memory access bandwidth and the storage space are reduced, the data access energy efficiency is improved, efficient characteristic diagram data multiplexing is realized, the operation speed is optimized, the blocking technology is further adopted, the characteristic diagram data of each input channel is subjected to blocking processing, namely, part of the characteristic diagram data is obtained, the cache data is updated after the calculation is completed, and the cache requirement of the on-chip characteristic diagram data is reduced, so that the calculation core non-delay operation is realized.

In one embodiment, as shown in fig. 2, the neural network computing unit 1083 includes a plurality of neural network acceleration slices 1084, each of which includes a plurality of convolution operation multiply-add arrays, each of which performs convolution operation of at least one input channel feature map data and one convolution kernel data, and a plurality of neural network acceleration slices performs convolution operation of a plurality of input channel feature map data and one convolution kernel data.

The neural network acceleration slice (PU slice) may be composed of a plurality of PE acceleration processing units (PE accelerators). The multiple PU fragments can realize data parallel computation of different dimensions through system configuration. The NN neural network acceleration circuit may include a plurality of neural network acceleration slices (PU slices), activation/pooling circuits, and an accumulation unit, among others. The PE accelerator is the most basic unit for accelerating the processing of the neural network, each unit at least comprises a multiplier, an adder, a part and a result buffer, at least one convolution operation of a weight parameter and input characteristic data can be completed, and the accumulation of a plurality of convolution results can be completed in at least one period. In this application, a PU tile may contain PE accelerators arranged in an array. The characteristic data cache unit comprises a plurality of characteristic data cache groups, each characteristic data cache unit caches partial characteristic data of one input channel and is coupled with at least one PU (polyurethane), namely, each PU acquires characteristic graph data of one input channel from the corresponding characteristic data cache group; meanwhile, a plurality of PUs share one convolution kernel data cache unit, namely, the data of the line of the same convolution kernel is broadcasted to the plurality of PUs, so that the parallel of multi-channel input single-channel output convolution operation is realized.

In order to realize data multiplexing and reduce the read-write bandwidth pressure of a main memory, each PU fragment can calculate the convolution results of current single-channel input characteristic line data (data acquired by a convolution kernel in the row direction of an image to be processed is characteristic line data) and a plurality of convolution kernels in the row direction of an input characteristic image respectively; updating the convolution kernel data, and restarting the convolution operation until the single characteristic line data and all convolution kernels complete the convolution operation; after the convolution operation of the characteristic line data of the single input channel and all convolution kernels is completed, updating the input characteristic diagram line, namely, the convolution operation under the convolution kernels is moved, and then circulating the steps until the convolution operation of the current input characteristic diagram and all convolution kernels is completed, and outputting a multi-channel output characteristic diagram of the input characteristic diagram; and after the convolution results of the input feature diagram of the single channel and all convolution kernels are finished, updating the feature diagram input channel, wherein the operation sequence can be flexibly configured according to the actual situation and the configuration information or the instruction information.

In one embodiment, the plurality of neural network acceleration slices form a first neural network operation matrix, and the plurality of first neural network operation matrices are coupled in parallel to form a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of the plurality of input channel characteristic data and one convolution kernel, and the plurality of second neural network acceleration matrices are used for completing parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels. A neural network acceleration slice can compute an acceleration operation between a feature map data and a convolution kernel convolution. A plurality of PU fragments of the same second neural network acceleration matrix can simultaneously calculate convolution results of the same convolution kernel and different input characteristic data, each PU fragment shares a convolution kernel cache unit, and a plurality of fragments share a characteristic data cache unit; a plurality of PU fragments of different second neural network acceleration matrixes can simultaneously calculate convolution results of a plurality of different convolution kernels and the same input feature data, each PU fragment shares one convolution kernel cache unit, and a plurality of fragments share one feature data cache unit.

The plurality of PUs form a first sub-neural network operation PU matrix, and the plurality of first sub-neural network operation PU matrices form a second neural network PU matrix; each submatrix in the second neural network PU matrix is used for completing convolution operation of a plurality of input channel characteristic data and one convolution kernel, and the plurality of submatrixes can complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

In one embodiment, each group of convolution operation multiplication and addition arrays obtains characteristic line data in a parallel input mode; and each group of convolution operation multiplication and addition arrays obtains convolution kernel data in a serial input mode. The PU comprises at least one group of PEs, and each group of PEs is responsible for convolution operation of convolution kernel line data and corresponding feature map line data; the multiple groups of PEs can realize convolution operation of multiple convolution kernel rows and multiple corresponding characteristic map row data, namely each group of PEs forms one row, and the multiple rows of PEs complete convolution operation of at least one convolution kernel row and corresponding characteristic data. Each group of PEs obtains characteristic line data in a parallel input mode, namely each metadata in each characteristic line data is simultaneously broadcast to each level of PE in the current group; meanwhile, each group of PEs obtains convolution kernel line data in a serial input mode, namely, the convolution kernel line metadata flows from the first-level PE to the next-level PE in each clock cycle.

In one embodiment, as shown in FIG. 3, the neural network acceleration slice 1084 comprises a plurality of first multiplexers 1085 coupled in parallel with the convolution multiply-add array in a one-to-one correspondence, and a plurality of second multiplexers 1086 coupled in series with the convolution multiply-add array in a one-to-one correspondence; the first multiplexer acquires characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

The first multiplexers are respectively coupled with the corresponding PE groups in parallel, each PE group can select at least one of the two characteristic line data through a selection signal, and the sixth PE group in fig. 3 can select data of 6 different lines; the second multiplexers are coupled in series with corresponding groups of PEs, respectively, each group of PEs being capable of selecting at least one of the two convolutional kernel data through a selection signal, and the sixth group of PEs in fig. 3 being capable of selecting 6 different convolutional kernel data. The first multiplexer obtains corresponding characteristic line data through a data selection signal (provided by configuration information or data loading instruction information), and inputs the characteristic line data into each stage of PE of a corresponding PE group in parallel, and meanwhile, the second multiplexer selects corresponding convolution kernel line data to input into each stage of PE in series to complete convolution multiply-add operation.

The remaining idle PE groups can acquire the feature line data and the convolution kernel line data used for convolution operation in the feature diagram column direction through the multiplexer, so that multiplexing of input data is realized. For example, for a convolution kernel step of 3 × 3 being 1, taking the matrix of fig. 3 as an example, the first three PE groups complete parallel accelerated convolution operation of a convolution kernel in the row direction (PE 00, PE10, PE20 are one group, PE01, PE11, PE21 are one group, PE02, PE12, PE22 are one group), the fourth PE group can multiplex the eigen data in the second feature extractor 1 and the convolution kernel data of the first weight acquirer 0, the fifth PE group can multiplex the eigen data in the third feature extractor 2 and the convolution kernel data of the second weight acquirer 1, the sixth PE group can multiplex the eigen data in the fourth feature extractor 3 and the convolution kernel data of the third weight acquirer 2, and then send the data pairs into corresponding PE groups, complete sliding convolution operation of a convolution window in the eigen-map column direction, thereby improving the multiplexing of eigen-map data and reducing the occupation of read-write bandwidth, and meanwhile, the whole PE array is in a full-load state of efficient operation.

In one embodiment, the neural network computing module further comprises a first shift register group and a second shift register group, the neural network computing unit comprises a multiplication and addition subunit, a part and a buffer subunit, the first shift register group adopts a serial input and parallel output mode, and characteristic line data are output to the multiplication and addition subunit through the first multiplexer; the second shift register group adopts a serial input and parallel output mode, convolution kernel data are output to a next-stage convolution operation multiplication and addition array and a multiplication and addition subunit through a second multiplexer according to step length signals, the multiplication and addition subunit correspondingly performs multiplication operation on the input characteristic line data and the convolution kernel data, performs accumulation operation on the input characteristic line data and convolution line part and data in part of and a cache subunit, and partially accumulates a plurality of line convolution results of a convolution window when the convolution operation of the convolution kernel data and the characteristic line data of the corresponding convolution window is completed, so that one sliding window convolution operation of a convolution kernel is realized; each group of convolution operation multiplication and addition arrays PE with different stages outputs row operation results to an accumulator in a convolution kernel row period, and the accumulator accumulates convolution kernel row convolution operation results output by convolution operation multiplication and addition arrays with the same stages corresponding to all rows of a current convolution kernel through an addition tree, so that one-time window convolution operation of one convolution kernel is realized; different-stage PEs calculate a plurality of sliding convolution operations of the current convolution kernel row in the characteristic diagram row direction in parallel; in the convolution operation process, the characteristic diagram lines are sequentially input in parallel into each group of each level of PE, and the corresponding convolution kernel line data are loaded in series into each group of each level of PE in a periodic circulation mode.

As shown in fig. 4, the first shift register group 1087 inputs feature diagram line data to each stage of PE in parallel by serial connection and parallel output, and outputs the data to each stage of PE through the multiplexer; the current convolution kernel metadata is simultaneously fed into the multiply-add unit as it is fed into the first stage shift register.

The second shift register group 1088 also outputs the convolution kernel data to the next stage of multiply-add unit through the multiplexer in a serial connection and parallel output manner.

And correspondingly multiplying the characteristic line data and the convolution kernel line data which are sent into the multiplication and addition unit, then performing accumulation operation on the characteristic line data and the convolution kernel line data, and performing summation operation on the convolution line part sum data in part of the convolution kernel line and the part of the convolution data in the cache unit, and after the convolution operation of one convolution kernel line and the corresponding convolution window characteristic line data is completed, performing summation operation on the convolution result part sum output of the convolution line and convolution results of other lines of the convolution kernel, so that the convolution kernel and one sliding window convolution operation is realized.

The characteristic line data (X00, … and X0n) are continuously and sequentially sent into the PE group in parallel according to the line sequence, the convolution kernel line data (F00/F01/F02) are sent into the PE group in series in a circulating sequence in a mode of F00/F01/F02-F00/F01/F02-F00/F01/F02 for convolution operation, and each stage of PE outputs a part and a result corresponding to the current convolution kernel line through one convolution kernel line period (the size of the convolution kernel line is 3, the period of the convolution kernel line is 3). Different stage PEs output partial sum results of convolution kernel lines sliding convolution operation on the characteristic diagram lines according to the step length s; after each group of different peer PEs outputs partial sums in one convolution row period, the partial sums and the results output by the peer PEs of each group of PEs corresponding to all rows of the current convolution kernel are accumulated through the addition tree, so that the convolution operation of one convolution kernel is realized, and the convolution calculation shown in fig. 5 is obtained.

In fig. 5, PE00 outputs the partial sum of the convolution of the first row of convolution kernel and the corresponding window feature line data in three consecutive cycles, PE10 outputs the partial sum of the convolution of the first row of convolution kernel and the window adjacent to the window calculated by PE00 in three consecutive cycles, PE20 outputs the partial sum of the convolution of the first row of convolution kernel and the window adjacent to the window calculated by PE10 in three consecutive cycles, and multiple PE groups can realize the convolution acceleration operation of different rows of convolution kernel and the corresponding feature data row, which is equivalent to the convolution operation of the convolution kernel sliding and traversing in the feature map row direction. The first shift register group selects a signal according to the step length parameter s, selects the output of the corresponding shift register as the input of the next-stage PE, and realizes the traversal convolution operation of the convolution window in the row direction according to the step length s, and the other PE groups realize the window sliding convolution acceleration operation in the characteristic diagram column direction according to the step length by multiplexing part of the input characteristic diagram data and convolution kernel data.

In one embodiment, the data controller is used for acquiring the storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and simultaneously instructing the multiplexer to switch on and off to adjust a data path, and inputting the feature data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information; and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

In one embodiment, the feature data cache unit includes a plurality of feature data cache groups, each feature data cache unit caches a part of feature data of one input channel and is coupled with at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data cache unit, each neural network acceleration slice acquires feature map data of one input channel from a corresponding feature data cache group, and the same convolution kernel data is distributed to the plurality of neural network acceleration slices. The data controller communicates the weight acquirer 0 with the 0 column of the PE unit, and broadcasts the data to the 4 th column of PE unit, that is, the first row weight parameter performs a row convolution operation with the 1 st characteristic data, and also performs a row convolution with the 2 nd characteristic data (wherein, the 2 nd characteristic data is connected to the 4 th column of PE unit through the multiplexer except for the 2 nd characteristic data), that is, the convolution kernel performs a convolution operation on adjacent rows at each time, and the data controller of the neural network calculation module 108 adopts a plurality of neural networks to accelerate fragmentation calculation of the characteristic data input in a single channel and the convolution result of the convolution kernel in the row direction of the image to be processed.

In one embodiment, the data controller acquires the storage addresses of the characteristic data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, the instruction multiplexer outputs the characteristic data to the neural network computing unit from the first shift register group in a serial input and parallel output mode, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit from the second shift register group in a serial input and one-step output mode.

As shown in fig. 6, when the convolution kernel line size (e.g. 5) is larger than the number of PEs in each group of PEs (e.g. 3 PEs), in each convolution kernel line operation period, address access of each stage of PE to the buffered feature line data by the feature line data extractor may conflict (since each stage of PE synchronously obtains the same feature metadata input, and at the next convolution kernel line cycle period, the feature data address required by a part of PE conflicts with the current feature extractor access buffer address), but the data accessed by a part of PE has been accessed in the previous operation period, and is buffered in the first shift register group, so the data controller obtains the storage address of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and at the same time, the instruction multiplexer turns on and off the data path, and inputting the characteristic data and the corresponding weight data into the corresponding neural network computing unit according to the instruction information, and preventing the conflict between part of PE access data addresses and the current characteristic extractor access addresses by selecting the register output data in the corresponding first shift register group. Therefore, the PE unit can control the multiplexer through the data controller to achieve conflict data acquisition, so that the convolution operation of convolution kernels with different sizes is applicable, and the condition that address access conflicts of various levels of PEs under the condition of non-alignment of access data are avoided to cause PE kernel waiting.

As shown in fig. 4, the feature extractor may load data into the PE array in a row-wise order according to the data address; meanwhile, 6 feature extractors are arranged in the PE array, and can be FIFO units or an addressing mode (addressing is carried out from the cache units in sequence and data is loaded to the PE), if two of the FIFO units are empty, the feature extractors which are empty at present can be subjected to signal truncation processing, so that the power consumption is reduced, and the weight acquirer is similar to the weight acquirer; for 5 × 5 or 7 × 7 convolution kernels, all the feature extractors and the weight obtainers can be in a full-load working state; the number of feature extractors can be flexibly configured, such as reduced or increased, i.e., the PE array is correspondingly reduced or increased.

Selecting an input channel by adopting a multi-path input multiplexer according to the configuration information of the data controller, and realizing the convolution operation of a convolution kernel and the corresponding characteristic line data; for example, for a 3 × 3 convolution kernel, the 4 th, 5 th, and 6 th PE columns multiplex the data output by the feature data extractor 2 nd, 3 th, and 4 th, respectively. If 5 × 5 convolution kernels are used, the 4 th and 5 th PE columns are communicated with the corresponding 4 th and 5 th feature data extractors, the 6 th PE column is communicated with the 2 nd feature data extractor, and the sizes of the convolution kernels are not aligned with the numbers of the PE columns, so that after one row of convolution operation is completed, the input path of the input multiplexer needs to be reconfigured, for example, in a first row period, the convolution operation of 5+1 rows (5 × 5 convolution kernels are convoluted on the feature map in one row period, the convolution operation of the first row of the convolution kernels is performed on the feature map, the second row is performed on the 1 st to 4PE columns by shifting down one row, and the convolution operation is performed only on one row of feature data) is completed at one time, so that in a second row period, the 2 nd to 5 th row weights of the 1 st to 4 th feature extractors and the 2 th row of the convolution kernels are used for performing convolution operation on the 1 st to 4PE columns, and the 6 th row PE rows of the data of the 1 st and 2 nd feature extractors; in which fig. 5 only shows some connections, and some connections are not shown.

As shown in fig. 5, the weight parameters are serially flowed into each PE unit in a column direction (as shown in fig. 5, three PE units are arranged in a column direction to accommodate a convolution kernel of 3 × 3), and the PE units in each row in the same column are linked by a group of delay register sets through a multiplexer, and the delay circuit structure is used for adjusting the sliding step of the convolution window; each column of PE units receives the characteristic data of a specific row through the multiplexer, the characteristic data are output to the PE units in the same column and each row in parallel, and because the weight parameters flow into each PE in series, when the characteristic data reach the PE in the 2 nd row and the PE in the 3 rd row in the initial state, the characteristic data need to be delayed for 1 period and 2 periods respectively, and then convolution operation is carried out on the characteristic data and the corresponding weight parameters; meanwhile, a group of characteristic data delay circuits are also arranged in the PE unit, so that firstly, the characteristic data and the weight parameters are aligned in the initial state, and when the convolution kernel exceeds 3 x 3, the corresponding characteristic data are multiplexed through the multi-path input selection circuit, and address conflict during data reading is prevented.

The feature data extractor acquires at least 1 line of continuous feature line data from a feature cache circuit on the PU fragment (as the embodiment adopts a convolution kernel of 3 multiplied by 3, in order to fully utilize the multiplication and addition circuit resource in the PE array, the feature extraction circuit requests the feature cache circuit to load continuous 4 lines of feature line data at one time); and meanwhile, the weight acquirer acquires at least 1 row of weight parameters (directly acquires the three rows of weight parameters and loads the weight parameters to the weight acquirer) at one time, and for inputting 4 rows of characteristic line data and 3 rows of convolution kernel line data, at least two characteristic diagram column direction convolution results are output in each convolution kernel line cycle period.

In the embodiment, the feature extractor and the weight obtainer form a data extractor, which may be a FOFI structure or an addressing circuit, each data extractor extracts the first line data (F00, F01, F02) and the first eigen line data (X00, X01, X02, X03, X04, X05, X06 …) of the weights according to the lines, and extracts the feature data of other lines after each line period.

In the initial state, a first cycle F00 is fed into register 1 and a multiplier in PE00, and feature data X00 is fed into PE00, PE10, PE 20; x00 in PE00 is multiplied by F00 and the result is fed into the row accumulation BUFFER, where F00 arrives in PE01 in the next cycle and X00 is stored in each PE feature register 1; in the second cycle X01, the data are sent to PE00, PE01 and PE20, at this time, X00 shifts to feature register 2, X01 enters feature register 1, at the same time, F01 is sent to PE00, F00 in PE00 is sent to PE01, X01 and F01 perform convolution operation in PE00, and the operation result is sent to an adder to be accumulated with the result of X00F 00 buffered in accumulation BUFFER in the previous cycle. Meanwhile, multiplying the X01 in the PE01 by the F00, and sending the result into an accumulation buffer; a third period F02 enters PE00, F01 in PE00 enters PE01, F00 in PE01 enters PE02, and X02 is synchronously transmitted to each PE and is subjected to multiply-add operation with corresponding weight parameters; after the three periods, the convolution of the row of weights and the corresponding characteristic data can be realized, after the four periods, the convolution kernel row is equivalent to one sliding on the characteristic data row, and after the eight periods, the convolution result of the convolution kernel row sliding on the corresponding characteristic data row for 6 times can be obtained; since 3 rows of input are performed in parallel by the 3 × 3 convolution kernels in the column, the output of results of 6 convolution sliding windows can be completed in eight cycle periods; after the initialization state, the whole multiplication and addition unit is in a full-load state.

In another embodiment, as shown in fig. 6, the convolution kernel is 5 × 5, the step size is 1, the PE level 3 performs parallel accelerated operation on convolution operations of three convolution window convolution kernel lines in the feature map line direction, and after PE00 completes convolution of one convolution kernel line, the convolution operation of the fourth convolution window convolution kernel line is started (the convolution operation of the second and third convolution kernel lines is performed by the same PE10 and PE 20), at this time, the access address of the PE00 feature data needs to be started from X03, but at this time, PE01 and PE02 need to access the data of X05, so that the feature data extractor may conflict with the data access addressing of X03 and X05. To avoid conflicts, the feature data register (feature data of X04, X03, X02, X01 have been buffered for the current 4 clock cycles) selection circuit may be used to obtain X03 data, i.e., X03 from register 2; meanwhile, when the 3 PE register selection signals are consistent, the address pointer of the feature data extractor is updated again, for example, the address pointer is updated to X05 in the 8 th cycle (each stage of PE needs to access X05 data, and the address of the data accessed in the two subsequent cycles is the same), and after the next convolution kernel row cycle (for example, the 11 th clock cycle), the feature data selection circuit is reconfigured; full-load calculation of the 5x5 convolution kernel can be flexibly realized through the circuit configuration.

In another embodiment, as shown in fig. 7, the convolution kernel is 5x5, the convolution operation with step size of 2, after 5 cycles, each PE has loaded data, but for PE00, which has completed the operation of one convolution window convolution kernel row, the row data address of the next convolution window is X06, but now the feature extractor address pointer is occupied by PE10 and PE20 and points to X05 data (which requires X05 data to be retrieved from the feature data cache circuit), therefore, for PE00, it needs to idle for a period, and then when the address pointer of the feature extractor points to X06, it performs the convolution operation of the convolution kernel line of the fourth convolution window, i.e. X06 × F00, similarly, when the feature data loaded by other PE stages conflicts with the access address of another current PE feature extractor, the PE unit with access address conflict is set to be in idle state in the current operation period, and acquiring corresponding feature data for product operation in the next period according to the access address of the feature extractor. From the above, it can be seen that some PE units are not fully loaded in a certain period and are in an idle state in a partial access address collision period, but for the convolution operation with the 5X5 convolution kernel and the step size of 2, the utilization rate of PE units is: 3 × 6-3/3 × 6 × 100% =83%, namely, the number of PE units is in an iteration period, the idle number of each iteration period is/(the number of PE units is in the iteration period, wherein the iteration period is from the first PE idle of cycee 6 to cycee 11 after initialization, namely, each idle period of PE00 is 6, the idle number of each iteration period is the total number of idle times of the group of PEs in the iteration period, and the larger the convolution kernel size is, the higher the utilization rate of the whole PE unit is.

As shown in fig. 8, the present embodiment further provides a neural network acceleration method, including the following steps:

step 602, receiving and storing feature map data and weight data of an image to be processed by adopting a main memory;

step 604, analyzing the neural network program compiling instruction by adopting a main controller, and generating configuration information and an operation instruction according to the structural parameters of the neural network;

step 606, using the characteristic data cache unit of the data cache module to cache the characteristic data cache unit of the characteristic line data extracted from the characteristic map data, and using the convolution kernel cache unit of the data cache module to cache the convolution kernel data extracted from the weight data;

step 608, a data controller of the neural network computing module is adopted to make and break a data path according to configuration information and instruction information, a data stream extracted by a data extractor of the neural network computing module is controlled to flow into a corresponding neural network computing unit according to the instruction information, the neural network computing unit at least completes convolution operation of a convolution kernel and feature map data, and accumulation of a plurality of convolution results is completed in at least one period, so that circuit reconstruction and data multiplexing are realized;

and step 610, accumulating convolution results on the multiple input channel characteristic graphs obtained by the convolution kernel operation unit by adopting an accumulator, and outputting output characteristic graph data corresponding to the convolution kernels.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A neural network acceleration device, comprising:

the main memory is used for receiving and storing feature map data and weight data of the image to be processed;

the main controller is used for analyzing the neural network program compiling instruction and generating configuration information and an operation instruction according to the structural parameters of the neural network;

the data caching module comprises a characteristic data caching unit for caching characteristic line data extracted from the characteristic diagram data and a convolution kernel caching unit for caching convolution kernel data extracted from the weight data;

the neural network computing module comprises a data controller, a data extractor and a neural network computing unit, wherein the data controller adjusts a data channel according to configuration information and instruction information, controls data streams extracted by the data extractor to be loaded to the corresponding neural network computing unit according to the instruction information, and the neural network computing unit at least completes convolution operation of one convolution kernel and the characteristic diagram data and completes accumulation of a plurality of convolution results in at least one period so as to realize circuit reconstruction and data multiplexing;

and the accumulator is used for accumulating convolution results on the plurality of input channel characteristic graphs obtained by the convolution kernel operation unit and outputting output characteristic graph data corresponding to the convolution kernels.

2. The neural network acceleration device according to claim 1, wherein the neural network computing unit includes a plurality of neural network acceleration slices, each neural network acceleration slice includes a plurality of convolution operation multiply-add arrays, each neural network acceleration slice performs convolution operation of at least one input channel feature map data and one convolution kernel data, and a plurality of neural network acceleration slices performs convolution operation of a plurality of input channel feature map data and one convolution kernel data.

3. The neural network acceleration device of claim 2, wherein the plurality of neural network acceleration slices constitute a first neural network operational matrix, and a plurality of first neural network operational matrices are coupled in parallel to constitute a second neural network acceleration matrix; the first neural network operation matrix in the second neural network acceleration matrix is used for completing convolution operation of a plurality of input channel characteristic data and a convolution kernel, and the plurality of second neural network acceleration matrices complete parallel convolution operation of the plurality of input channel characteristic data and the plurality of convolution kernels.

4. The neural network acceleration device according to claim 2, wherein each set of convolution multiply-add arrays obtains characteristic line data by parallel input; and each group of convolution operation multiplication and addition arrays obtains convolution kernel data in a serial input mode.

5. The neural network acceleration device according to claim 2, wherein the neural network acceleration slice comprises a plurality of first multiplexers and a plurality of second multiplexers, the first multiplexers being coupled in parallel with the convolution operation multiply-add arrays in a one-to-one correspondence, and the second multiplexers being coupled in series with the convolution operation multiply-add arrays in a one-to-one correspondence; the first multiplexer acquires the characteristic line data corresponding to the convolution operation multiplication and addition array through a data selection signal and inputs the characteristic line data to the corresponding convolution operation multiplication and addition array at each stage in parallel, and the second multiplexer acquires the convolution kernel line data corresponding to the convolution operation multiplication and addition array and inputs the convolution kernel line data to the convolution operation multiplication and addition array at each stage in series to complete convolution multiplication and addition operation.

6. The neural network accelerator according to claim 5, wherein the neural network computing module further comprises a first shift register set and a second shift register set, the neural network computing unit comprises a multiply-add subunit and a partial sum buffer subunit,

the first shift register group adopts a serial input and parallel output mode, and outputs the characteristic line data to the multiplication and addition subunit through a first multiplexer; the second shift register group adopts a mode of serial input and selecting one output according to step length, and outputs the convolution kernel data to the next convolution operation multiply-add array and the multiply-add subunit through a second multiplexer,

the multiplication and addition subunit correspondingly multiplies the input characteristic line data and the convolution kernel line data, accumulates the multiplication and addition data with the convolution line part and data in the part and cache subunit, and partially accumulates a plurality of line convolution results of the convolution window when the convolution operation of the convolution kernel line data and the characteristic line data corresponding to the convolution window is completed, so that one sliding window convolution operation of a convolution kernel is realized;

the convolution operation multiplication and addition arrays of each group of different stages output row operation results to the accumulator in a convolution row period, and the accumulator accumulates the row operation results output by the convolution operation multiplication and addition arrays of each group of stages corresponding to all rows of the current convolution kernel through an addition tree, so that the convolution operation of one convolution kernel is realized.

7. The neural network acceleration device according to claim 1, wherein the data controller is configured to obtain storage addresses of the feature data and the corresponding weight data loaded into the neural network computing units according to the configuration information and the instruction information, and instruct the multiplexer to switch on/off the adjustment data path, and input the feature data and the corresponding weight data into the corresponding neural network computing units according to the instruction information;

and the data extractor comprises a feature extractor and a convolution kernel extractor, the feature extractor is used for extracting feature line data from the feature data cache unit according to the instruction information, and the convolution kernel extractor is used for extracting convolution kernel data from the convolution kernel cache unit according to the instruction information so as to transmit the convolution kernel data to the neural network computing module.

8. The neural network acceleration device of claim 7, wherein the feature data buffer unit comprises a plurality of feature data buffer sets, each feature data buffer unit buffers a portion of the feature data of one input channel and is coupled to at least one neural network acceleration slice, the plurality of neural network acceleration slices share one convolution kernel data buffer unit,

each neural network acceleration fragment acquires feature map data of one input channel from the corresponding feature data cache group, and the same convolution kernel data is distributed to a plurality of neural network acceleration fragments.

9. The neural network acceleration device according to claim 7, wherein the data controller obtains storage addresses of the feature data and the corresponding weight data loaded into the neural network computing unit according to the configuration information and the instruction information, and simultaneously the instruction multiplexer outputs the feature data to the neural network computing unit from the first shift register group in a serial-in parallel-out manner, and the instruction multiplexer outputs the convolution kernel data to the current convolution operation multiplication and addition array and the next convolution operation multiplication and addition array of the neural network computing unit from the second shift register group in a serial-in parallel-out manner according to a step-size selection output manner.

10. A neural network acceleration method, comprising the steps of:

receiving and storing feature map data and weight data of an image to be processed by adopting a main memory;

analyzing a neural network program compiling instruction by adopting a main controller, and generating configuration information and an operation instruction according to the structural parameters of the neural network;

caching the characteristic data caching unit of the characteristic line data extracted from the characteristic graph data by adopting a characteristic data caching unit of a data caching module, and caching the convolution kernel data extracted from the weight data by adopting a convolution kernel caching unit of the data caching module;

a data controller of a neural network computing module is adopted to make and break a data path according to configuration information and instruction information, data streams extracted by a data extractor of the neural network computing module are controlled to flow into corresponding neural network computing units according to the instruction information, the neural network computing units at least complete convolution operation of one convolution kernel and the characteristic diagram data, and accumulation of a plurality of convolution results is completed in at least one period, so that circuit reconstruction and data multiplexing are realized;

and accumulating convolution results on the multiple input channel characteristic graphs obtained by the convolution kernel operation unit by adopting an accumulator, and outputting output characteristic graph data corresponding to the convolution kernels.

11. A communication device, comprising a central processing unit CPU, a memory DDR SDRAM and the neural network accelerator of any one of claims 1 to 9, which are communicatively connected, wherein the CPU is configured to control the neural network accelerator to start a convolution operation, and the DDR SDRAM is configured to input feature map data and weight data to the data cache module of the neural network accelerator.