CN110807170B

CN110807170B - Method for realizing Same convolution vectorization of multi-sample multi-channel convolution neural network

Info

Publication number: CN110807170B
Application number: CN201911000690.XA
Authority: CN
Inventors: 刘仲; 陈小文; 陈海燕; 田希; 鲁建壮; ***; 吴立; 马媛; 曹坤
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2023-06-27
Anticipated expiration: 2039-10-21
Also published as: CN110807170A

Abstract

The invention discloses a method for realizing multiple-sample multichannel convolutional neural network Same convolution vectorization, which comprises the following steps: step 1: storing the data of the input characteristic data set according to a sample dimension priority mode, and storing the data of the convolution kernel according to a quantity dimension priority mode of the convolution kernel; step 2: dividing an input characteristic data set data matrix into a plurality of matrix blocks according to columns; step 3: transmitting the convolution kernel data matrix to each core SM each time, transmitting the submatrices which are extracted from the input characteristic data matrix according to rows to each core AM, performing vectorization matrix multiplication calculation and parallelization matrix multiplication calculation, and performing 0 supplement in calculation; step 4: storing the output feature matrix calculation result in an off-chip memory; step 5: repeating steps 3 to 4 until all calculations are completed. The method can realize the vectorization of the Same convolution, and has the advantages of simple realization operation, high execution efficiency and precision, small bandwidth requirement and the like.

Description

Method for realizing Same convolution vectorization of multi-sample multi-channel convolution neural network

Technical Field

The invention relates to the technical field of vector processors, in particular to a method for realizing multiple-sample multichannel convolutional neural network Same convolution vectorization.

Background

In recent years, deep learning models based on deep convolutional neural networks have achieved remarkable achievement in aspects of image recognition and classification, target detection, video analysis and the like, become research hotspots in academia and industry, and promote rapid development of related technologies such as artificial intelligence, big data processing, processors and the like. Convolutional neural networks (Convolutional Neural Networks, CNN) are a class of feedforward neural networks (Feedforward Neural Networks) that contain convolutional calculations and have a deep structure, and are one of the representative algorithms for deep learning. The input layer of the convolutional neural network can process multidimensional data, and the convolutional neural network is most widely applied in the field of computer vision, so that three-dimensional input data, namely image pixel data and RGB channels on a two-dimensional plane are usually set when the convolutional neural network structure is designed.

Convolutional neural networks mainly have two types of Valid convolution and a Same convolution, wherein the Valid convolution does not fill 0 elements on an input image, and the Same convolution needs to fill 0 elements on edges of the input image. Since Valid convolution does not fill element 0, if there is a preH×preW image, and a filter convolution of kernelH×kernelW is used, the obtained output image is (preH-kernelH+1) ×preW-kernelW+1, so that a disadvantage of Valid convolution is that after each convolution operation, the image is reduced, especially when the number of layers is large, and the finally obtained image is very small; there is another disadvantage in that the image edge information is used less frequently than the intermediate information, resulting in most of the information of the image edge being lost. Thus, neural network models typically require the use of a combination of Valid convolution and a name convolution, such as employing a name convolution among neural network models.

The Same convolution enlarges the image by filling 0 elements before the convolution operation is performed, so that the size of the output image and the size of the input image after the convolution operation remain the Same. In general, the Same convolution calculation is to apply a new memory area in advance according to the new image size, fill 0 elements in the area filled with 0 elements, copy the image element values before filling in other areas, and then apply the Same calculation method as the Valid convolution for the new image. However, this type of method has the following problems:

(1) At least double the memory overhead is required;

(2) The storage position of the 0 element is discontinuous, so that the operation cost for supplementing the 0 element is high;

(3) The time overhead for copying the original image data is large.

The vector processor is a novel architecture, can have strong computing power while keeping low power consumption, and is particularly suitable for accelerating the computation of a large convolutional neural network. As shown in FIG. 1, vector processors typically include a scalar processing unit (Scalar Processing Unit, SPU) and a vector processing unit (Vector Processing Unit, VPU), the SPU being responsible for scalar task computation and flow control, the VPU being responsible for vector computation, providing the primary computing power, including a number of vector processing units (Vector Processing Element, VPEs), each VPE containing a plurality of arithmetic functional units such as MAC, ALU, BP. A data transmission and exchange mechanism is provided between the SPU and the VPU, so that the sharing and communication of the standard and vector data are realized. The vector data access unit supports Load/Store of vector data, providing a large capacity of dedicated vector Array Memory (AM) and off-chip Memory for target and vector sharing.

Aiming at the architectural characteristics of a vector processor, various vectorization implementation methods of convolution computation exist at present, for example, a vectorization method of convolution neural network operation of the vector processor disclosed in Chinese patent application 201810687639.X, a multi-core parallel computing method of convolution neural network for GPDSP disclosed in patent application 201810689646.3, a vectorization implementation method of two-dimensional matrix convolution of the vector processor disclosed in patent application 201710201589.5 and the like, the schemes all adopt a mode of loading weight data into a vector array memory AM and loading input image feature data into a scalar storage SM of the vector array memory to complete convolution computation, and most of the modes adopt a vectorization implementation method of the Same convolution in the convolution neural network which has not been reordered according to a third dimension sequence, but the following problems exist when the traditional scheme is applied to the Same convolution vectorization implementation at present:

1. the weight data cannot be effectively shared, so that the storage bandwidth is wasted, and the calculation efficiency of the vector processor cannot be fully exerted.

2. The third dimension is uncertain, the number of the processing units of the vector processor is not matched with that of the vector processor, the third dimension of different convolutional neural network models and different convolutional layers are different, and the loading data efficiency of the various schemes is greatly affected and has no universality.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems existing in the prior art, the invention provides the multi-sample multi-channel convolutional neural network Same convolution vectorization realization method which is simple to operate, high in execution efficiency and precision and small in bandwidth requirement, and the computational performance of a vector processor can be fully exerted to realize vectorization of multi-sample multi-channel convolutional neural network Same convolution.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a multi-sample multichannel convolutional neural network name convolution vectorization implementation method comprises the following steps:

step 1: storing data of an input characteristic data set for convolutional neural network calculation in a sample dimension priority mode, and storing data of convolutional kernels in a quantity dimension priority mode of the convolutional kernels;

step 2: the vector processor divides the input characteristic data set data matrix into a plurality of matrix blocks according to columns to obtain a plurality of input characteristic data matrices;

step 3: the vector processor transmits the convolution kernel data matrix to a scalar memory SM of each kernel at a time, and transmits a sub-matrix consisting of V data extracted from the input feature data matrix by rows to a vector array memory AM of each kernel, wherein 0<V < = K, and K is the number of pixel data of a single convolution kernel; the method comprises the steps of performing vectorization matrix multiplication calculation and parallelization matrix multiplication calculation of each core, and performing 0 supplementing operation in the calculation process to obtain an output characteristic data matrix calculation result;

step 4: storing the output feature matrix calculation result in an off-chip memory of a vector processor;

step 5: repeating the steps 3 and 4 until all input characteristic data matrix calculation is completed.

Further, in the step 1, storing the data set data of the input feature calculated by the convolutional neural network according to the sample dimension priority mode includes: when the first layer convolutional neural network calculates, the data of the data set of the input features are reordered, so that the data of the data set of the input features are continuously stored in an off-chip memory of the vector processor according to an N-by-M matrix, the input feature data matrix calculated by other layers is an output feature matrix of a calculation result of the last layer and is stored in the off-chip memory of the vector processor according to a sample dimension priority mode, wherein M is the total sample number of the data set, and N is the input feature number of a single sample.

Further, storing the data of the convolution kernel in the step 1 according to the number-dimensional priority mode of the convolution kernel includes: continuously storing the data of the convolution kernels in an off-chip memory of the vector processor according to a matrix of K times next C orders, wherein next is the number of the convolution kernels, K=kernel H times kernel W times preC is the number of pixel data of a single convolution kernel, and preC is the number of channels.

Further, in step 2, the input feature data set data matrix is specifically divided into num matrix blocks, where the size of each matrix block is n×mb order, where mb=q×p, m=num×mb, q is the number of cores of the target vector processor, and p is the number of vector processing units VPE of each core.

Further, in the step 3, the total number of times of extracting the submatrices composed of V rows of data by rows is nexth×nexdw, where:

nextH＝(preH+2pH-kernelH+1)，nextW＝(preW+2pW-kernelW+1)

N＝preH*preW*preC，K＝kernelH*kernelW*preC

wherein K is the number of pixel data of a single convolution kernel, N is the number of input features of a single sample, nextH is the height of output image data, nextW is the width of output image data, preH is the two-dimensional image input data of a convolution neural network of a current calculation layer, and pH and pW are the number of 0 elements filled in the high and wide directions respectively.

Further, in the step 3, the specific steps of extracting the submatrix and performing the 0-compensating operation in the extracting process are as follows:

step 3.1.1: constructing a vector Z with the length of K, and recording whether the row corresponding to the K convolution kernel data is a 0 element row or not; let execution times t=nexdw×r0+c0, where 0< =r0 < nextH,0< =c0 < nexdw, where the value range of t is {0,1,2, }, nexth×nexdw-1 };

step 3.1.2: let h0=r0;

step 3.1.3: judging whether h0< (r0+kernelH), if so, turning to step 3.1.4, otherwise, ending and exiting;

step 3.1.4: let w0=c0;

step 3.1.5: judging whether w0 is < (c0+kernelW), if so, turning to step 3.1.6, and if not, turning to step 3.1.8;

step 3.1.6: judging whether (h 0> (pH-1) & w0> (pW-1) & h0< (preH+pH) & w0< (preW+pW)) is true, if so, continuously extracting preC rows from the pos row of the input feature data matrix, and simultaneously, setting k0=c0+kernelW r0 to be 1 for continuous preC elements starting from the kth 0 position of the vector Z; if not, extracting the input characteristic data matrix, and simultaneously enabling k0=c0+kernelw r0, and setting 0 to the continuous preC elements starting from the kth 0 position of the vector Z;

step 3.1.7: increasing w0 by 1, and turning to step 3.1.5;

step 3.1.8: h0 is increased by 1, and the process is changed to step 3.1.3.

Further, in the step 3, the specific steps of performing the vectorized matrix multiplication calculation and the parallelized matrix multiplication calculation of each core include:

step 3.2.1: the vector processor transmits the input characteristic data matrix to an input characteristic data buffer area preset in a vector array memory AM of each core of the vector processor, and the size of the input characteristic data matrix transmitted by each core is V x p order;

step 3.2.2: the vector processor transmits the convolution kernel data matrix to a preset convolution kernel data buffer zone in a scalar memory SM of each kernel of the vector processor, and the size of the convolution kernel data matrix transmitted by each kernel is K times next C order;

step 3.2.3: the scalar processing unit SPU of each core of the vector processor sequentially reads one convolution core data from the convolution core data buffer area according to columns into a scalar register, judges whether an input characteristic data matrix row corresponding to the read convolution core data is a 0 element row, if so, continues to read the next column element, and the calculation result is a 0 vector formed by directly assigning all 0 elements; if not, broadcasting to a vector register through a scalar broadcast instruction;

step 3.2.4: the vector processing unit VPU of each core of the vector processor sequentially reads one row of input characteristic data from the input characteristic data buffer area to a vector register, and performs multiply-accumulate calculation on the vector register and the vector register obtained in the step 3.2.3;

step 3.2.5: judging whether to traverse K element data in one column of the convolution kernel data matrix, if not, jumping to the step 3.2.3, and moving the reading position in the step 3.2.3 to the next element and moving the reading position in the step 3.2.4 to the next row; if yes, each core completes the calculation of p output characteristic data corresponding to the column data calculation, p is the number of vector processing units (VPEs) of each core, and step 3.2.6 is skipped;

step 3.2.6: judging whether all next C column data of the convolution kernel data are traversed, if not, jumping to the step 3.2.3, moving the reading position in the step 3.2.3 to the next column head address, and returning the reading position in the step 3.2.4 to the initial address of the input characteristic data buffer area; if yes, traversing the next C column, and finishing calculation of output characteristic data of next C MB order by the vector processor.

Further, in the step 3.2.1, the insufficient number of rows in the v×p-order input feature data matrix transmitted by each core is specifically a (K-V), the (K-V) row is a row of 0 elements, and the calculation result obtained in the corresponding step 3.2.3 is a 0 vector formed by directly assigning all 0 elements; in the step 3.2.2, the input convolution kernel data matrix specifically performs matrix multiplication with a k×p matrix.

Further, two data buffers in the step 3.2.1 and/or two data buffers in the step 3.2.2 are specifically set, and when calculation is performed on one data buffer, data transmission is performed on the other data buffer in the two data buffers.

Further, in step 4, the output feature matrix calculation result is stored in an off-chip memory of the vector processor in a sample dimension priority manner.

Compared with the prior art, the invention has the advantages that:

1. the method for realizing the multi-sample multi-channel convolutional neural network Same convolutional vectorization can realize the multi-sample multi-channel convolutional neural network Same convolutional vectorization based on the structural characteristics of a vector processor and the characteristics of the Same convolutional, the data of the convolutional kernels are stored in a sample dimension priority mode through the input characteristic data set, the data of the convolutional kernels are stored in a quantity dimension priority mode of the convolutional kernels, the convolutional kernel data of each kernel are transmitted through a scalar memory and broadcast to a vector processing unit for calculation, the sharing of the convolutional kernel data can be realized, the transmission quantity of calculated data is greatly reduced, the bandwidth requirement of the convolutional kernel data can be remarkably reduced, and meanwhile, the transmission time of the convolutional kernel data is reduced;

2. according to the method for realizing the Same convolution vectorization of the multi-sample multi-channel convolution neural network, the original image data operation is not required to be copied, the complex multi-cycle Same convolution operation is converted into high-efficiency vectorization and parallelization matrix multiplication calculation, the high efficiency of a vector processor on the matrix multiplication calculation and the characteristic that the vector processor is very suitable for vectorization and parallelization calculation can be fully utilized, and the calculation performance of the vector processor is fully exerted to realize the Same convolution vectorization;

3. according to the method for realizing the Same convolution vector of the multi-sample multi-channel convolution neural network, aiming at the characteristic of the Same convolution, the 0 element is supplemented by realizing in the calculation process, so that the actual storage of the 0 element is not needed, and the real 0 supplementing writing and reading operation is not needed, the extra memory expenditure is not needed, and the calculation corresponding to the 0 element is reduced;

4. according to the method for realizing the Same convolution vectorization of the multi-sample multi-channel convolution neural network, each convolution kernel data is expanded into vector data, vector multiply-accumulate calculation is carried out simultaneously with all input characteristic data, SIMD (single instruction multiple data) and inter-kernel parallelism of a vector processor can be fully exerted, and the calculation efficiency of the convolution neural network is greatly improved;

5. according to the method for realizing the Same convolution vectorization of the multi-sample multi-channel convolution neural network, all input characteristic data of the Same sample are stored on one column, all multiply-accumulate calculations of convolution kernel data and the input characteristic data are operated on the Same VPE processing unit, so that the reduction summation among a plurality of processing units can be avoided, and the overall calculation efficiency of a vector processor is improved;

6. the method for realizing the multiple-sample multiple-channel convolutional neural network Same convolution vectorization can give consideration to efficiency and accuracy, and can also support convenient and flexible setting of Mini-batch size.

Drawings

FIG. 1 is a general architecture diagram of a vector processor.

Fig. 2 is a detailed flowchart of a method for implementing the Same convolution vectorization of the multi-sample multi-channel convolution neural network according to this embodiment.

Fig. 3 is a schematic diagram of the present invention in a specific application embodiment for reordering input feature dataset data.

FIG. 4 is a detailed flowchart of the step 3 sub-matrix extraction and 0-filling operation in an embodiment of the present invention.

Detailed Description

The invention is further described below in connection with the drawings and the specific preferred embodiments, but the scope of protection of the invention is not limited thereby.

Let the number of kernels of the target vector processor be q, the number of vector processing units VPE of each kernel be p, the total number of samples of the data set be M, the Mini-batch size be MB, where mb=q×p, m=num×mb, num is a positive integer, and the two-dimensional image input data of the convolutional neural network of the current calculation layer be preh×prew, preH is an image width, preW is an image height, the number of channels is preC, the convolutional kernel size be kernelh×kernelw×prec, and kernelH, kernelW are all odd numbers, the number of convolutional kernels be nextC, and the step size of the convolutional calculation is 1.

In order to keep the sizes of the output image and the input image after the convolution operation the same, setting the number of 0 elements filled in the height and width directions as pH, pW, pH >0 and pW >0 respectively, enlarging the preH image preW image into (preH+2pH) (preW+2pW), outputting the preH+2pH-kernelH+1 image (preW+2pW-kernelW+1), and if the sizes of the output image and the input image are the same, namely (preH+2pH-kernelH+1) =preH, obtaining pH= (kernelH-1)/2; (prew+2pw-kernelw+1) =prew, to obtain pw= (kernelW-1)/2, in this embodiment kernelH, kernelW, the odd numbers 1×1,3×3,5×5,7×7, 11×11, etc. are taken, so that the output image and the input image can always be ensured to be the same size as long as the proper filling size is selected.

As shown in fig. 2, the detailed steps of the method for implementing the multiple-sample multi-channel convolutional neural network Same convolution vectorization in this embodiment include:

step 1: the data of the input characteristic data set used for the calculation of the convolution neural network are stored in a sample dimension priority mode, and the data of the convolution kernels are stored in a quantity dimension priority mode.

The storing the data set data of the input features calculated by the convolutional neural network according to the sample dimension priority mode specifically comprises the following steps: when the first layer convolutional neural network calculates, the data of the data set of the input features are reordered, so that the data of the data set of the input features are continuously stored in an off-chip memory of the vector processor according to an N-by-M matrix, the input feature data matrix calculated by other layers is an output feature matrix of a calculation result of the last layer and is stored in the off-chip memory of the vector processor according to a sample dimension priority mode, wherein M is the total sample number of the data set, and N is the input feature number of a single sample.

In a specific application embodiment (preh=2, prew=2, the number of channels is prec=3, and the total number of samples is m), the data set data of the input features is reordered in the calculation of the first layer convolutional neural network as shown in fig. 3, where fig. 3 (a) is input feature data of m samples before reordering, and fig. 3 (b) is an input feature data set matrix after reordering, and the rearranged input feature data set data is stored in a mode with priority according to sample dimensions.

Since the input features of the single sample are stored in a sample dimension priority manner, each column of the n×m order matrix stores the input features of the single sample, and the storage sequence in the column is channel preC direction priority, then image width preW direction priority, and finally image height preH direction priority. The elements of the input feature data matrix are specifically expressed as x [ i ] [ M ], wherein the column coordinate M represents the (m+1) th sample, and the value range is {0,1,2, & gt, M-1}; the row coordinate i represents the i+1th input eigenvalue of the sample, with a range of {0,1,2 }, preH x preW x preC-1}.

The storing the data of the convolution kernel according to the number dimension priority mode of the convolution kernel specifically comprises the following steps: continuously storing the data of the convolution kernels in an off-chip memory of the vector processor according to a matrix of K times next C orders, wherein next is the number of the convolution kernels, K=kernel H times kernel W times preC is the number of pixel data of a single convolution kernel, and preC is the number of channels.

Since the pixel data of a single convolution kernel is stored according to the number-dimensional priority mode of the convolution kernel, each column of the K x next c-order matrix stores the pixel data of the single convolution kernel, the storage sequence in the column is the channel preC direction priority, the convolution kernel width kernelW direction priority is followed, and finally the convolution kernel height kernelH direction priority is followed. The elements of the convolution kernel data matrix are specifically denoted w [ j ] [ c ], where column coordinate c represents the c+1th convolution kernel with a value range {0,1,2, & gt, next c-1}, and row coordinate j represents the j+1th pixel data value of the convolution kernel with a value range {0,1,2, & gt, kernelH x kernelW x preC-1}. The element of the offset data column vector is denoted b [ c ], where the coordinate c corresponds to the c+1th convolution kernel, ranging from {0,1,2, & gt, next-1 }, one offset data value for each convolution kernel.

Step 2: the vector processor divides the input characteristic data set data matrix into a plurality of matrix blocks according to columns to obtain a plurality of input characteristic data matrices.

Specifically, dividing the data matrix of the input characteristic data set into num matrix blocks, wherein the size of each matrix block is of order n×mb, wherein mb=q×p, and m=num×mb, so as to obtain num n×mb order input characteristic data matrices.

Step 3: the vector processor transmits the convolution kernel data matrix to the scalar memory SM of each kernel at a time, and transmits a sub-matrix consisting of V data extracted from the input feature data matrix by rows to the vector array memory AM of each kernel, wherein 0<V < = K, and K is the number of pixel data of a single convolution kernel; and performing vectorization matrix multiplication calculation and parallelization matrix multiplication calculation of each core, and performing 0 supplementing operation in the calculation process to obtain an output characteristic data matrix calculation result.

The method specifically extracts V data from an input characteristic data matrix according to rows each time to form a submatrix, and transmits the submatrix to a vector array memory AM of each core, wherein 0<V < = K, and the operation of 0 element compensation is executed in the calculation process. The vector processor transmits the convolution kernel data matrix of K-order to the scalar memory SM of each kernel each time, the submatrices of the input feature data set data of V-order composed of V-line data are extracted from the input feature data set data matrix of N-order by lines and transmitted to the vector array memory AM of each kernel, the operation of supplementing 0 element is realized in the calculation process, and then the output feature data matrix of next-order is obtained through the vector matrix multiplication calculation of standard and vector cooperation and the parallelization matrix multiplication calculation of each kernel, and the parallel addition calculation is carried out on the output feature data matrix of next-order and the bias data column vector of the vector processor, so that the calculation result is the output feature matrix of next-order.

The addition calculation of the output feature data matrix of the next c-MB order and the offset data column vector is specifically performed by adding each column vector of the output feature matrix of the next c-MB order and each corresponding element of the offset data column vector.

The total number of times of the submatrix of v×m-order input feature data set data composed of V line data extracted by line from the n×m-order input feature data set data matrix is specifically next h×next w, where:

nextH＝(preH+2pH-kernelH+1)，nextW＝(preW+2pW-kernelW+1)

N＝preH*preW*preC，K＝kernelH*kernelW*preC

where K is the number of pixel data of a single convolution kernel, N is the number of input features of a single sample, nextH is the height of the output image data, and nexdw is the width of the output image data.

Step 4: the output feature matrix calculation results are stored in an off-chip memory of the vector processor.

The output feature matrix calculation result is stored in an off-chip memory of the vector processor in a sample dimension priority mode.

And (3) storing the output feature matrix calculation result of the next C/MB order obtained in the step (3) in an off-chip memory of a vector processor in a sample dimension priority mode.

The output characteristic data matrix obtained after the steps are completed is of an S-order M-order matrix, wherein M is the total sample number of the data set; s=nexth×nexdw×nextc is the number of output features of a single sample. The output feature matrix of the order s×m is continuously stored in the off-chip memory of the vector processor according to the sample dimension priority mode, that is, each column of the matrix of the order s×m stores the output feature of a single sample, and the storage sequence in the column is the channel next c direction priority, then the image width next w direction priority, and finally the image height next h direction priority. The element of the output feature data matrix is specifically denoted as a [ j ] [ M ], where the column coordinate M represents the m+1th sample, the value range is {0,1,2, & gt, M-1}, the row coordinate j represents the j+1th output feature value of the sample, the value range is {0,1,2, & gt, and nextH is nextW is nextC-1}.

In this embodiment, after the steps 1 to 5 are executed, the obtained calculation result is the calculation result of the convolutional neural network of the layer, and the storage of the calculation result still keeps the sample dimension priority mode required by the step 1, and the output characteristic data matrix provides the input characteristic data matrix for the convolutional neural network calculation of the subsequent layer.

According to the method, the data of the input characteristic data set are stored in a sample dimension priority mode, the data of the convolution kernels are stored in a quantity dimension priority mode of the convolution kernels, the convolution kernel data of each kernel are transmitted through the scalar memory and broadcast to the vector processing unit for calculation, the sharing of the convolution kernel data can be achieved, the transmission quantity of calculation data is greatly reduced, the bandwidth requirement of the convolution kernel data can be remarkably reduced, meanwhile, the transmission time of the convolution kernel data is shortened, the original image data copying operation is not needed, the complex multi-cycle convolution neural network convolution calculation operation is converted into the next H-times of efficient vectorization and parallelization matrix multiplication calculation, the high efficiency of the vector processor for the matrix multiplication calculation and the characteristic that the vector processor is very suitable for vectorization and parallelization calculation can be fully utilized, and the characteristic of the Same convolution is achieved through the 0 supplementing element in the calculation process, the actual storage of the 0 element is not needed, the real memory 0 supplementing and the reading operation are not needed, and therefore the calculation corresponding to the 0 element is not needed.

According to the method, each convolution kernel data is expanded into vector data, vector multiply-accumulate computation is carried out simultaneously with all input feature data, SIMD and inter-kernel parallelism of a vector processor can be fully exerted, computation efficiency of a convolution neural network is greatly improved, all input feature data of the same sample are stored on one column, all multiply-accumulate computation of the convolution kernel data and the input feature data are operated on the same VPE processing unit, reduction summation among a plurality of processing units can be avoided, overall computation efficiency of the vector processor is improved, efficiency and accuracy can be both considered, and convenience and flexibility in setting Mini-batch size can be supported.

As shown in fig. 4, in a specific application embodiment, the steps of extracting the submatrix in the step 3 and performing the element-compensating operation in the extracting process are as follows:

step 3.1.2: let h0=r0;

step 3.1.3: judging whether h0< (r0+kernelH), if so, turning to the step 3.1.4, otherwise, ending;

step 3.1.4: let w0=c0;

step 3.1.7: increasing w0 by 1, and turning to step 3.1.5;

step 3.1.8: h0 is increased by 1, and the process is changed to step 3.1.3.

Through the steps, the V-line data can be extracted from the N-MB-order input characteristic data set data matrix according to the lines, the V-MB-order input characteristic data set data submatrices are formed, and the operation of compensating 0 element is completed in the calculation process. The 0 element is supplemented in the calculation process, so that the actual storage of the 0 element is not needed, and the actual 0 supplementing writing and reading operation is not needed, thereby avoiding extra memory expenditure and reducing the calculation corresponding to the 0 element.

In a specific application embodiment, in the step 3, the specific steps of performing vectorization matrix multiplication computation and parallelization matrix multiplication computation of each core include:

step 3.2.2: the vector processor respectively transmits the convolution kernel data matrix to a preset convolution kernel data buffer area in a scalar memory SM of each kernel of the vector processor, and the size of the convolution kernel data matrix transmitted by each kernel is K times next C order;

step 3.2.3: the scalar processing unit SPU of each core of the vector processor sequentially reads one convolution kernel data from a convolution kernel data buffer area according to columns to a scalar register, judges whether an input characteristic data matrix row corresponding to the read convolution kernel data is a 0 element row, if so, continues to read the next column element, and the calculation result is a 0 vector formed by directly assigning all 0 elements; if not, broadcasting to a vector register through a scalar broadcast instruction;

step 3.2.5: judging whether to traverse K element data in one column of the convolution kernel data matrix, if not, jumping to the step 3.2.3, and moving the reading position in the step 3.2.3 to the next element and moving the reading position in the step 3.2.4 to the next row; if yes, each core completes the calculation of p output characteristic data corresponding to the column data calculation, and the step 3.2.6 is skipped;

step 3.2.6: judging whether all next C columns of data of the convolution kernel data are traversed, if not, jumping to the step 3.2.3, moving the reading position in the step 3.2.3 to the next column head address, and returning the reading position in the step 3.2.4 to the initial address of the input characteristic data buffer area; if yes, traversing the next C column, and finishing calculation of output characteristic data of next C MB order by the vector processor.

In the step 3.2.1, the size of the v×p-order input feature data matrix transmitted by each core is v×p-order, wherein the insufficient number of rows is specifically (K-V), the (K-V) row is a row of 0 elements, the corresponding calculation does not need actual calculation, and the calculation result obtained in the step 3.2.3 is a 0 vector formed by directly assigning all 0 elements; in step 3.2.2, the matrix multiplication is specifically performed on the input convolution kernel data matrix and the k×p matrix.

In this embodiment, two data buffers in the steps 3.2.1 and 3.2.2 may be specifically set, and when one data buffer is calculated, data transmission is performed on the other data buffer, so that the data transmission overlaps with the calculation time, and the calculation efficiency may be further improved.

In the step 3.2.3 of this embodiment, whether the input feature data matrix row corresponding to the convolution kernel data is a 0 element row is determined specifically according to the vector Z, which specifically includes the following steps: and judging according to the element value of the vector Z, setting k as the position of the read convolution kernel data in the column, judging that if Z [ k ] =0, the corresponding input characteristic data matrix acts as an element row 0, and if Z [ k ] =1, the corresponding input characteristic data matrix acts as an element row not 0.

In a specific application embodiment, if the vector processor provides the SIMD instruction and performs parallel data processing at the same time, the method further includes adjusting the p value to d×p according to the data bit number of the image element to be calculated, and the corresponding MB is also changed to d times of the original MB, where d represents that each VPE in the vector processor can process d image element data at the same time through the SIMD instruction. When determining the d value, for example, the word length of the 64-bit processor is 64 bits, if the data bits of the image element to be calculated are 64, 32, 16, and 8 bits, the corresponding d values are 64/64=1, 64/32=2, 64/16=4, and 64/8=8, respectively; the word length of the 32-bit processor is 32 bits, and if the data bit number of the image element to be calculated is 32, 16 and 8 bits, the corresponding d values are 32/32=1, 32/16=2 and 32/8=4 respectively.

The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention shall fall within the scope of the technical solution of the present invention.

Claims

1. The method for realizing the multiple-sample multichannel convolutional neural network Same convolutional vectorization is characterized by comprising the following steps:

step 3: the vector processor transmits the convolution kernel data matrix to the scalar memory SM of each kernel each time, and transmits a sub-matrix consisting of V data extracted from the input feature data matrix according to rows to the vector array memory AM of each kernel, wherein 0<V < = K, K is the number of pixel data of a single convolution kernel, and the output feature data matrix calculation result is obtained by executing vectorization matrix multiplication calculation and parallelization matrix multiplication calculation of each kernel and executing 0 supplementing operation in the calculation process;

step 4: storing the output characteristic data matrix calculation result in an off-chip memory of a vector processor;

step 5: repeating the steps 3 and 4 until all input characteristic data matrix calculation is completed;

in the step 3, the specific steps of performing vectorized matrix multiplication calculation and parallelized matrix multiplication calculation of each core include:

step 3.2.1: the vector processor transmits the input characteristic data matrix to an input characteristic data buffer area preset in a vector array memory AM of each core of the vector processor, the size of the input characteristic data matrix transmitted by each core is V x p order, and p is the number of vector processing units VPE of each core;

step 3.2.2: the vector processor transmits the convolution kernel data matrix to a preset convolution kernel data buffer zone in a scalar memory SM of each kernel of the vector processor, wherein the size of the convolution kernel data matrix transmitted by each kernel is K times of next C orders, and next is the number of convolution kernels;

step 3.2.6: judging whether all next C column data of the convolution kernel data are traversed, if not, jumping to the step 3.2.3, moving the reading position in the step 3.2.3 to the next column head address, and returning the reading position in the step 3.2.4 to the initial address of the input characteristic data buffer area; if yes, traversing the next C column, and finishing counting output characteristic data of next C/MB order by the vector processor.

2. The method for implementing the multiple-sample multiple-channel convolutional neural network Same convolutional vectorization according to claim 1, wherein storing the data set data of the input features calculated by the convolutional neural network in the step 1 according to the sample dimension priority mode comprises: when the first layer convolutional neural network calculates, the data of the data set of the input features are reordered, so that the data of the data set of the input features are continuously stored in an off-chip memory of the vector processor according to an N-by-M matrix, the input feature data matrix calculated by other layers is an output feature matrix of a calculation result of the last layer and is stored in the off-chip memory of the vector processor according to a sample dimension priority mode, wherein M is the total sample number of the data set, and N is the input feature number of a single sample.

3. The method for implementing the multiple-sample multichannel convolutional neural network Same convolutional vectorization according to claim 2, wherein storing the data of the convolutional kernels in a manner of giving priority to the number dimension of the convolutional kernels in step 1 comprises: the data of the convolution kernel are continuously stored in an off-chip memory of the vector processor according to a matrix of K times nextC order, wherein K=kernel H times kernel W times preC is the number of pixel data of a single convolution kernel, and preC is the number of channels.

4. The method of claim 3, wherein in step 2, the input feature data set data matrix is specifically divided into num matrix blocks, and the size of each matrix block is n×mb order, where mb=q×p, m=num×mb, q is the number of cores of the target vector processor, and p is the number of vector processing units VPE of each core.

5. The method for implementing the convolutional vectorization of the multi-sample multi-channel convolutional neural network Same according to any one of claims 1-4, wherein in the step 3, the total number of times of extracting the submatrices consisting of V rows of data by rows is next h×next w, wherein:

nextH＝(preH+2pH-kernelH+1)，nextW＝(preW+2pW-kernelW+1)

N＝preH*preW*preC，K＝kernelH*kernelW*preC

wherein K is the number of pixel data of a single convolution kernel, N is the number of input features of a single sample, next H is the height of output image data, next W is the width of output image data, preH and preW are the image height and the image width of two-dimensional image input data of a convolution neural network of a current calculation layer respectively, and pH and pW are the number of 0 elements filled in the high and wide directions respectively.

6. The method for implementing the multiple-sample multiple-channel convolutional neural network Same convolutional vectorization according to any one of claims 1-4, wherein in the step 3, the specific step of extracting the submatrix is as follows:

step 3.1.2: let h0=r0;

step 3.1.4: let w0=c0;

step 3.1.7: increasing w0 by 1, and turning to step 3.1.5;

step 3.1.8: h0 is increased by 1, and the process is changed to step 3.1.3.

7. The method according to claim 1, wherein in the step 3.2.1, the insufficient number of rows in the v×p-order input feature data matrix transmitted by each core is specifically (K-V), the (K-V) row is a row of 0 elements, and the calculation result obtained in the corresponding step 3.2.3 is a 0 vector directly assigned to all 0 elements; in the step 3.2.2, the input convolution kernel data matrix specifically performs matrix multiplication with a k×p matrix.

8. The method for implementing the multiple-sample multiple-channel convolutional neural network Same convolutional vectorization according to claim 1, wherein two data buffers in step 3.2.1 and/or two data buffers in step 3.2.2 are specifically set, and data transmission is performed on one data buffer while calculation is performed on the other data buffer in the two data buffers.

9. The method for implementing the multiple-sample multiple-channel convolutional neural network Same convolutional vectorization according to any one of claims 1-4, wherein in step 4, the output feature matrix calculation result is stored in an off-chip memory of a vector processor in a sample dimension priority manner.