WO2022067508A1 - 一种神经网络加速器、加速方法以及装置 - Google Patents

一种神经网络加速器、加速方法以及装置 Download PDF

Info

Publication number
WO2022067508A1
WO2022067508A1 PCT/CN2020/118832 CN2020118832W WO2022067508A1 WO 2022067508 A1 WO2022067508 A1 WO 2022067508A1 CN 2020118832 W CN2020118832 W CN 2020118832W WO 2022067508 A1 WO2022067508 A1 WO 2022067508A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
feature map
transformation
winograd
neural network
Prior art date
Application number
PCT/CN2020/118832
Other languages
English (en)
French (fr)
Inventor
辛晨
袁宏辉
李震桁
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2020/118832 priority Critical patent/WO2022067508A1/zh
Priority to CN202080105218.7A priority patent/CN116113941A/zh
Priority to EP20955534.1A priority patent/EP4213070A4/en
Publication of WO2022067508A1 publication Critical patent/WO2022067508A1/zh
Priority to US18/191,134 priority patent/US20230236891A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/144Prime factor Fourier transforms, e.g. Winograd transforms, number theoretic transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/76Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
    • G06F7/78Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data for changing the order of data flow, e.g. matrix transposition or LIFO buffers; Overflow or underflow handling therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the field of neural networks, and in particular, to a neural network accelerator, an acceleration method, and an apparatus.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • Neural network belongs to the connectionism school in the field of artificial intelligence. It is a mathematical model that uses a structure similar to the synaptic connection of the brain for information processing.
  • the computations involved in the neural network mainly include convolution operations, activation operations, and pooling operations, among which the convolution operations take up most of the processing time of the neural network.
  • convolution operation method based on winograd algorithm. This method can complete the equivalent by performing specific matrix transformation on the input feature map and weights.
  • the convolution operation task and greatly reduce the multiplication operation of the convolution operation process.
  • the current accelerator that incorporates the winograd algorithm for acceleration generally requires more modifications to the core operation modules such as the matrix operation module and the vector operation module in the neural network, and the design is complicated.
  • the embodiment of the present application provides a neural network accelerator.
  • the neural network accelerator is based on the winograd algorithm, and the winograd algorithm can be applied to the neural network by using the conventional matrix operation module and vector operation module in the neural network.
  • the winograd algorithm can be applied to the neural network by using the conventional matrix operation module and vector operation module in the neural network.
  • the number of multiplications can be greatly reduced, and the performance and energy efficiency ratio of the accelerator can be improved.
  • a first aspect of the present application provides a neural network accelerator, comprising: a preprocessing module for performing a first winograd positive transformation on a target matrix corresponding to an input feature map to obtain a transformed target matrix, and performing the first winograd positive transformation on the target matrix
  • the transformation can be understood as multiplying the BT matrix on the left by the target matrix, and multiplying the B matrix on the right to obtain the transformed target matrix.
  • the preprocessing module is also used to perform the second winograd forward transformation on the convolution kernel to obtain the transformed convolution kernel.
  • the second winograd forward transformation on the convolution kernel can be understood as multiplying the convolution kernel by the G matrix on the left and multiplying it on the right. G T matrix to get the transformed convolution kernel.
  • the matrix operation module is used to perform matrix multiplication operation on the first matrix and the second matrix to obtain the multiplication result.
  • the first matrix is constructed according to the transformed target matrix
  • the second matrix is constructed according to the transformed convolution kernel.
  • the vector operation module is used to perform winograd inverse transformation on the multiplication result to obtain the output feature map, and the process of performing the winograd inverse transformation on the matrix multiplication result is equivalent to performing vector addition and subtraction operations on the matrix multiplication result.
  • Vector operation module implementation is used to perform matrix multiplication operation on the first matrix and the second matrix to obtain the multiplication result.
  • the first matrix and the second matrix are respectively constructed for the target matrix and convolution kernel after the forward transformation of winograd, and then the existing matrix operation module in the neural network accelerator is used to calculate the first matrix and the first matrix.
  • the two matrices perform matrix multiplication operation, and use the existing vector operation module in the neural network accelerator to perform winograd inverse transformation on the multiplication result, avoiding the need to modify the core operation modules such as the matrix operation module and the vector operation module in the neural network, and the design is simple. , and avoids adding a module for performing the point multiplication operation on the target matrix and the convolution kernel after performing winograd forward transformation in the neural network accelerator, which improves the efficiency of the neural network accelerator performing winograd calculations.
  • the preprocessing module is further configured to traverse the input feature map through a sliding window to obtain a target matrix corresponding to the input feature map. It can be seen from the first possible implementation of the first aspect that a specific way of obtaining the target matrix is provided, which can traverse the input feature map through a sliding window, and the target matrix is the input feature map of the corresponding area of the sliding window.
  • the input feature map is an input feature map that has undergone a padding operation
  • the size of the input feature map is W ⁇ H ⁇ k
  • W and H are all even numbers not less than 4
  • k is an integer greater than 1
  • W is the row of the input feature map
  • H is the column of the input feature map
  • k is the number of channels of the input feature map.
  • Padding can be understood as adding some pixels on the periphery of the input feature map, for example, initializing these pixels to 0 or other specified values.
  • pixels can be added to the periphery of the input feature map, so that the rows and columns of the input feature map are not less than 4. even.
  • the input feature map of It can be seen from the first possible implementation manner of the first aspect that a specific method for determining the target matrix of the input feature map is provided, which increases the diversity of the scheme.
  • the size of the convolution kernel is 3 ⁇ 3 ⁇ k ⁇ n
  • the stride of the convolution kernel is 1
  • n is the number of channels of the output feature map
  • n is an integer greater than 1.
  • the first matrix includes the i-th element in the transformed target matrix, and i is a positive integer not greater than 16.
  • the first matrix is a matrix with m rows and k columns, m is ((W-2)(H-2)/4)
  • the second matrix includes the i-th element of the transformed convolution kernel, and the second matrix is K rows
  • the multiplication result is used to determine the output feature map. It can be known from the third possible implementation manner of the first aspect that a specific way of constructing the first matrix and the second matrix is given.
  • the vector operation module is specifically configured to: compare the multiplication result Perform the inverse winograd transformation to get the third matrix.
  • the elements in the third matrix are reordered according to a preset reordering rule to obtain an output feature map. It can be seen from the second possible implementation of the first aspect that if the input feature map is processed in the convolution layer, the results of 16 matrix multiplications are processed by the vector operation module, and then the processed results are reordered, to get the output feature map.
  • the vector operation module is specifically configured to: Perform the inverse winograd transform to output the third matrix.
  • the elements in the third matrix are summed to obtain the output feature map. It can be seen from the third possible implementation manner of the first aspect that if the input feature map is processed in the pooling layer, the output feature map can be obtained by summing or maximizing the elements in the third matrix.
  • the second winograd forward transformation includes the third winograd forward transformation and
  • the neural network accelerator further includes a storage module, and the storage module is configured to store the first transformation result obtained by performing the third winograd forward transformation on the convolution kernel through the third matrix.
  • the matrix transformation unit is specifically used to perform the fourth winograd forward transformation on the first transformation result through the fourth matrix to obtain the transformed convolution kernel.
  • the third matrix and the fourth matrix are used to decompose the transformation matrix of the second winograd forward transformation
  • the values of the elements in the third matrix are 0 or ⁇ 1
  • the fourth matrix is a matrix other than the third matrix in the decomposed matrix. It can be known from the fourth possible implementation manner of the first aspect that, in order to reduce the calculation amount of the matrix transformation unit in the accelerator, part of the process of the second winograd forward transformation may be performed offline.
  • the matrix transformation unit is further configured to: acquire multiple transformations The M elements of the target matrix, where M is an integer greater than 1.
  • the M elements are processed according to the first preset formula to output a plurality of first matrices.
  • the N elements are processed according to the second preset formula to output a plurality of second matrices. It can be seen from the fifth possible implementation manner of the first aspect that in order to further improve the performance of the accelerator, multiple elements in each transformed target matrix may be extracted at one time, and multiple first matrices may be output at one time. It is also possible to extract multiple elements in the transformed convolution kernel at once, outputting multiple second matrices at once.
  • the vector operation module is further configured to perform inverse quantization processing on the multiplication result.
  • the vector operation module is specifically used to perform inverse winograd transformation on the multiplication result after inverse quantization processing.
  • the vector operation module is also used for quantizing the output feature map to obtain the quantized output feature map. It can be seen from the sixth possible implementation manner of the first aspect that, in order to meet the requirements of fixed-point number operations, a quantization operation and an inverse quantization operation may be added.
  • the vector operation module is further configured to: Multiply the result for an offset operation. It can be seen from the seventh possible implementation manner of the first aspect that the solution provided in this application performs an offset operation on a multiplication result, which may be equivalent to performing an offset operation on the output feature map.
  • a second aspect of the present application provides an acceleration method, including: performing a first winograd forward transformation on a target matrix corresponding to an input feature map to obtain a transformed target matrix.
  • a second winograd forward transformation is performed on the convolution kernel to obtain the transformed convolution kernel.
  • a matrix multiplication operation is performed on the first matrix and the second matrix to obtain the multiplication result, the first matrix is constructed according to the transformed target matrix, and the second matrix is constructed according to the transformed convolution kernel. Perform inverse winograd transformation on the multiplication result to get the output feature map.
  • the input feature map is traversed through a sliding window to obtain a target matrix corresponding to the input feature map.
  • the input feature map is an input feature map that has undergone a padding operation, and the size of the input feature map is W ⁇ H. ⁇ k, W and H are all even numbers not less than 4, k is an integer greater than 1, W is the row of the input feature map, H is the column of the input feature map, and k is the number of channels of the input feature map.
  • the input feature map is traversed through a sliding window with stride 2 and size 4 ⁇ 4 to obtain (((W-2)(H-2)/4) ⁇ k) target matrices.
  • a padding operation is performed on the input feature map, so that the size of the input feature map is W ⁇ H ⁇ k, Both W and H are even numbers not less than 4, k is an integer greater than 1, W is the row of the input feature map, H is the column of the input feature map, and k is the number of channels of the input feature map.
  • the input feature map is traversed through a sliding window with stride 2 and size 4 ⁇ 4 to obtain (((W-2)(H-2)/4) ⁇ k) target matrices.
  • the size of the convolution kernel is 3 ⁇ 3 ⁇ k ⁇ n
  • the stride of the convolution kernel is 1
  • n is the number of channels of the output feature map
  • n is an integer greater than 1.
  • the first matrix includes the i-th element in the transformed target matrix, and i is a positive integer not greater than 16.
  • the first matrix is a matrix with m rows and k columns, m is ((W-2)(H-2)/4)
  • the second matrix includes the i-th element of the transformed convolution kernel, and the second matrix is K rows A matrix of n columns, the multiplication result is used to determine the output feature map.
  • inverse winograd transform is performed on the multiplication result to obtain the output.
  • the feature map includes: performing inverse winograd transformation on the multiplication result to obtain a third matrix.
  • the elements in the third matrix are reordered according to a preset reordering rule to obtain an output feature map.
  • the inverse winograd transform is performed on the multiplication result to obtain the output.
  • the feature map including: performing inverse winograd transformation on the multiplication result to output a third matrix. The elements in the third matrix are summed to obtain the output feature map.
  • the second winograd forward transformation includes the third winograd forward transformation and
  • the fourth winograd forward transformation is to perform the second winograd forward transformation on the convolution kernel with a size of 3 ⁇ 3 ⁇ k ⁇ n and a step size of 1, so as to obtain a transformed convolution kernel, including: performing a third matrix on the convolution kernel.
  • the first transformation result of the third winograd forward transformation is performed.
  • the fourth winograd forward transformation is performed on the first transformation result through the fourth matrix to obtain the transformed convolution kernel.
  • the third matrix and the fourth matrix are the matrices obtained by decomposing the transformation matrix of the second winograd forward transformation.
  • the values of the elements in the three matrices are 0 or ⁇ 1, and the fourth matrix is a matrix other than the third matrix in the decomposed matrices.
  • the method further includes: acquiring M of multiple transformed target matrices. elements, M is an integer greater than 1. The M elements are processed according to the first preset formula to output a plurality of first matrices. Get N elements of multiple transformed convolution kernels, where N is an integer greater than 1. The N elements are processed according to the second preset formula to output a plurality of second matrices.
  • the method further includes performing an inverse quantization process on the multiplication result to obtain an inverse quantized multiplication. result.
  • the performing inverse winograd transformation on the multiplication result to obtain the output feature map includes: performing inverse winograd transformation on the inverse quantized multiplication result to obtain the output feature map.
  • the method further includes: quantizing the output feature map to obtain the quantized output feature map.
  • the method further includes: performing an offset operation on the multiplication result .
  • a third aspect of the present application provides a neural network device, where the neural network device includes a neural network accelerator, where the neural network accelerator is the neural network accelerator described in the first aspect or any possible implementation manner of the first aspect.
  • a fourth aspect of the present application provides a chip system, the chip system includes a processor and a communication interface, the processor obtains program instructions through the communication interface, and the second aspect is implemented when the program instructions are executed by the processor or the method described in any possible implementation manner of the second aspect.
  • a fifth aspect of the present application provides a chip system, the chip system includes a processor and a memory, the memory stores a program, and when the program instructions stored in the memory are executed by the processor, the second aspect or the second aspect is implemented The method described in any of the possible implementations.
  • a sixth aspect of the present application provides a computer-readable storage medium, comprising a program that, when executed by a processing unit, executes the method described in the second aspect or any possible implementation manner of the second aspect.
  • a seventh aspect of the present application provides a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the method described in the second aspect or any possible implementation manner of the second aspect.
  • FIG. 1 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present invention.
  • 3-a is a schematic diagram of the structure of a neural network accelerator based on winograd algorithm provided by an embodiment of the application;
  • 3-b is a schematic diagram of the structure of a neural network accelerator based on winograd algorithm provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of the traversal unit 3012 traversing the input feature map in the accelerator provided by the example of this application;
  • FIG. 5 is a schematic diagram of performing winograd forward transformation on a convolution kernel in an accelerator provided by an embodiment of the present application
  • 6-a is a schematic diagram of a first matrix in an accelerator provided by an embodiment of the present application.
  • 6-b is a schematic diagram of a first matrix in an accelerator provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a second matrix in an accelerator provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of obtaining 16 multiplication results in an accelerator provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of reordering elements in a third matrix according to a preset reordering rule in an embodiment of the present application.
  • FIG. 10 is a schematic diagram that the values of some elements in the transformed target matrix in the embodiment of the application can be calculated in parallel;
  • FIG. 11-a is a schematic diagram that the values of some elements in the transformed target matrix can be calculated in parallel in an embodiment of the present application;
  • FIG. 11-b is a schematic diagram showing that the values of some elements in the transformed target matrix can be calculated in parallel in an embodiment of the present application;
  • FIG. 11-c is a schematic diagram that the values of some elements in the transformed target matrix can be calculated in parallel in an embodiment of the present application;
  • FIG. 13 is a schematic diagram of an offset operation in an embodiment of the present application.
  • FIG. 14 is a schematic diagram of an associated operation in an embodiment of the present application.
  • 15 is a schematic diagram of a matrix transformation unit, a matrix operation module and a vector operation module that can be executed in parallel through a pipeline in an embodiment of the present application;
  • FIG. 16 is a schematic diagram of obtaining an output feature map by performing multiple operations in a solution provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the naming or numbering of the steps in this application does not mean that the steps in the method flow must be executed in the time/logical sequence indicated by the naming or numbering, and the named or numbered process steps can be implemented according to the The technical purpose is to change the execution order, as long as the same or similar technical effects can be achieved.
  • the division of modules in this application is a logical division. In practical applications, there may be other divisions. For example, multiple modules may be combined or integrated into another system, or some features may be ignored. , or not implemented, in addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some ports, and the indirect coupling or communication connection between modules may be electrical or other similar forms. There are no restrictions in the application.
  • modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed into multiple circuit modules, and some or all of them may be selected according to actual needs. module to achieve the purpose of the solution of this application.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs and intercept 1 as inputs, and the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture.
  • a deep learning architecture refers to an algorithm based on machine learning. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230 .
  • the input layer 210 can acquire data to be processed, which involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and force, displacement, liquid level, temperature, etc. Sensory data such as humidity.
  • the data to be processed is an image to be processed
  • the obtained image to be processed is processed by the convolutional layer/pooling layer 220 and the subsequent neural network layer 230 to obtain the processing result of the image.
  • the internal layer structure in the CNN 200 in Figure 1 is described in detail below.
  • the convolutional/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel or a convolution kernel, and its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the operator can essentially be a weight matrix, which is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels) along the horizontal direction on the input image. Then two pixels...depending on the value of stride) are processed to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the convolution feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted convolution feature maps with the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer eg, 221
  • the features extracted by the later convolutional layers eg, 226 become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the neural network layer 230 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 1) and the output layer 240, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multi-layer hidden layers in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 1, the propagation from the direction 210 to 240 is forward propagation) is completed
  • the back propagation (as shown in Fig. 1, the propagation from the 240 to 210 direction is the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230 .
  • CNN convolutional neural network
  • FIG. 2 Compared with FIG. 1 , multiple convolutional layers/pooling layers in the convolutional layer/pooling layer 220 in FIG. 2 are parallel, and the extracted features are input to the full neural network layer 230 for processing.
  • the convolutional neural networks shown in FIG. 1 and FIG. 2 are only examples of two possible convolutional neural networks in the embodiments of the present application.
  • the convolutional neural networks in the embodiments of the present application Neural networks can also exist in the form of other network models.
  • the neural network provided by the embodiments of the present application may also be a deep convolutional neural network (deep convolutional neural network, DCNN), a recurrent neural network (recurrent neural network, RNNS), and the like.
  • the computations involved in the neural network mainly include convolution operations, activation operations, and pooling operations, among which the convolution operations take up most of the processing time of the neural network.
  • the size of the convolution kernel is 3 ⁇ 3 (row ⁇ column), and the convolution layer with stride 1 occupies a large proportion in the convolution calculation. Therefore, there is great value in speeding up this type of convolutional layer.
  • the winograd algorithm uses the winograd algorithm, the number of multiplications of the algorithm of the 3 ⁇ 3 convolutional layer with a stride of 1 can be greatly reduced, which is beneficial to the improvement of hardware performance and energy efficiency ratio. In order to better understand the scheme, the winograd algorithm is introduced below.
  • the input signal D can be regarded as a 4 ⁇ 4 matrix, as shown in the following formula 1-1, and the convolution kernel K can be regarded as a 3 ⁇ 3 matrix, as shown in the following formula 1-2 shown.
  • the matrix multiplication form of the convolution of D and K can be expressed by the following formulas 1-3. Since it is the prior art to transform the convolution operation according to the winograd algorithm, the derivation is no longer carried out in this application, and only the derivation result is listed.
  • Formula 1-3 represents that the matrix D of the input signal is multiplied by the B T matrix to the left, and the B matrix is multiplied by the right to obtain the transformed matrix U.
  • This process is the process of performing winograd forward transformation on the input signal.
  • the size of the matrix U is 4 ⁇ 4, and the matrix K corresponding to the convolution kernel is multiplied by the G matrix on the left and the G T matrix on the right to obtain the transformed matrix V.
  • the size of the matrix V is 4 ⁇ 4.
  • This process is as follows: The process of performing winograd forward transformation on the convolution kernel. Do the dot multiplication operation on the matrix U and the matrix V to obtain the U ⁇ V matrix, then multiply the U ⁇ V matrix to the left by the AT matrix and the right by the A matrix to obtain the matrix corresponding to the final output signal. This process is the inverse of winograd. process of transformation.
  • B T is expressed by formula 1-4
  • B is expressed by formula 1-5
  • G is expressed by formula 1-6
  • G T is expressed by formula 1-7
  • a T is expressed by formula 1-8
  • A is expressed by formula 1- 9 said.
  • the output signal is a 2 ⁇ 2 matrix, which is represented by formula 2-0 in this application.
  • the number of multiplications can be reduced from 36 to 16. If it is extended to a neural network with a convolution kernel of 3 ⁇ 3, the energy efficiency ratio can be improved.
  • the present application comprehensively considers the deficiencies of the existing methods, and utilizes a conventional matrix operation module (matrix unit) and a vector operation module (vector unit) to realize the use of the winograd algorithm in the accelerator of the neural network. No major modifications to the core computing unit, nor dedicated hardware support are required.
  • the input signal D is a 4 ⁇ 4 matrix
  • the input feature map can be of any size.
  • the input feature map can be traversed through a sliding window of size 4 ⁇ 4, and the area corresponding to each sliding window is a 4 ⁇ 4 matrix.
  • the area corresponding to the sliding window is called the target matrix.
  • the step size of the convolution kernel convolved with it is 1 to obtain the output signal, which is a 2 ⁇ 2 matrix
  • the step size of the sliding window of size 4 ⁇ 4 is set to 2.
  • the rows and columns of the input feature map should be even to obtain an integer number of sliding windows. If the rows and columns of the input feature map are not even, you can perform padding on the input feature map first. ) operation so that both the rows and columns of the input feature map are even.
  • the input signal D is a 4 ⁇ 4 matrix, in order to use the winograd algorithm in this application, the rows and columns of the input feature map should also be an even number not less than 4.
  • a matrix transformation unit can be added.
  • winograd forward transformation can be performed on each target matrix to obtain the transformed target matrix.
  • the process of performing winograd positive transformation on the target matrix can be understood by referring to the process of performing positive transformation on the input signal in the winograd algorithm, that is, multiplying the target matrix by the BT matrix on the left and the B matrix on the right to obtain the transformed target matrix.
  • the matrix transformation unit can also perform winograd forward transformation on each convolution kernel to obtain the transformed convolution kernel.
  • the process of performing winograd positive transformation on the convolution kernel can be understood by referring to the process of performing positive transformation on the convolution kernel in the winograd algorithm, that is, multiply the convolution kernel by the G matrix on the left and the G T matrix on the right to obtain the transformed convolution kernel.
  • the input feature map includes multiple image channels, that is, the input feature map has an additional dimension compared to the input signal in the winograd algorithm, and the increased dimension is the number of input channels.
  • the convolution kernel includes the dimension of the number of input channels, and the convolution kernel also includes the dimension of the number of output channels (that is, the number of convolution kernels), that is, compared with the convolution kernel in the winograd algorithm.
  • the convolution kernel in the convolutional neural network also adds two dimensions, which are the number of input channels and the number of output channels. In the winograd algorithm, it is necessary to perform a dot product operation on the matrix U and the matrix V.
  • the input feature map adds an input channel dimension
  • the convolution kernel increases the input channel dimension and the output channel dimension.
  • the winograd algorithm cannot be directly applied to the convolutional neural network.
  • the existing technology generally requires more modifications to the core computing unit of the convolutional neural network, or requires special hardware to support it.
  • the solution provided in this application is in The transformed target matrix is obtained, and on the basis of the transformed convolution kernel, the process of the point multiplication operation is converted into a matrix multiplication operation. With the solution provided in this application, it is only necessary to add a matrix transformation unit, and then the conventional matrix operation module and vector operation module in the convolutional neural network can be used to realize the application of the winograd algorithm to the convolutional neural network.
  • the present application converts the point multiplication operation into the multiplication of the first matrix and the second matrix by constructing the first matrix and the second matrix.
  • the first matrix includes the i-th element in the target matrix of each transformation, i is a positive integer not greater than 16, the first matrix is a matrix with m rows and k columns, and m is ((W-2)(H-2) /4), the second matrix includes the ith element of each transformed convolution kernel, the second matrix is a matrix with K rows and n columns, and the multiplication result is used to determine the output feature map.
  • 16 first matrices and 16 second matrices can be obtained, and 16 first matrices and 16 second matrices are multiplied in one-to-one correspondence to obtain 16 multiplication results.
  • the first matrix includes the first element in the target matrix of each transformation
  • the second matrix includes the first element of the convolution kernel of each transformation
  • the first matrix and the second matrix are multiplied
  • the first matrix includes the second element of the target matrix of each transformation
  • the second matrix includes the second element of the convolution kernel of each transformation
  • the first The matrix and the second matrix are multiplied to obtain the second multiplication result, and so on.
  • i 16th multiplication result can be obtained.
  • the multiplication result is sometimes referred to as the matrix multiplication result.
  • the result of matrix multiplication is subjected to inverse winograd transformation through the vector operation module, and the process of inverse winograd transformation of the result of matrix multiplication is to multiply the result of matrix multiplication by the AT matrix on the left and the A matrix on the right. Since this scheme adopts the method of constructing the first matrix and the second matrix, the result of the dot multiplication operation is converted into 16 matrix multiplication results, and the process of performing the inverse winograd transformation on the matrix multiplication results is equivalent to multiplying 16 matrices.
  • the result of adding and subtracting a vector can be implemented by a conventional vector operation module, and the specific process will be introduced below. After the results of 16 matrix multiplications are processed by the vector operation module, the processed results are reordered, or the processed results are summed or accumulated to obtain the output feature corresponding to the input feature map. picture.
  • the solution provided in this application splits the process of positive transformation of the convolution kernel into two parts, one part of the process is performed offline, and the other part of the process is performed on-chip. Or the result of the positive transformation of the convolution kernel is obtained by offline calculation.
  • the data format of the input feature map and the convolution kernel may be fixed-point numbers.
  • the solution provided in this application can support inverse quantization and quantization processing. The process of inverse quantization can be performed in The inverse transformation operation is performed before, which can save the bit width and increase the computing power.
  • the solution provided in this application performs an offset operation on a multiplication result, which can be equivalent to performing an offset operation on the output feature map.
  • the matrix transformation unit, the matrix operation module and the vector operation module in the solution provided by this application can be executed in parallel through the pipeline.
  • the winograd inverse transformation can be done on-the-fly calculation in the process of moving from the matrix operation module to the vector operation module.
  • FIG. 3-a it is a schematic diagram of the structure of a neural network accelerator based on a winograd algorithm provided by an embodiment of the present application.
  • a neural network accelerator provided by this application includes a preprocessing module 301 , a matrix operation module 302 and a vector operation module 303 .
  • the neural network accelerator provided by the present application only needs to add a preprocessing module, which can realize the application of the winograd algorithm to the neural network.
  • the preprocessing module 301 is configured to perform the first winograd forward transformation on the target matrix corresponding to the input feature map to obtain the transformed target matrix.
  • the preprocessing module 301 is further configured to perform a second winograd forward transformation on the convolution kernel to obtain a transformed convolution kernel.
  • the matrix operation module 302 is configured to perform a matrix multiplication operation on the first matrix and the second matrix to obtain a multiplication result.
  • the first matrix is constructed from the transformed target matrix
  • the second matrix is constructed from the transformed convolution kernel.
  • the matrix operation module 302 includes multiple processing units (process engines, PEs).
  • matrix operation module 302 is a two-dimensional systolic array.
  • the matrix operation module 302 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
  • matrix operation module 302 is a general-purpose matrix processor. For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The matrix operation module fetches the corresponding data of the matrix B from the memory, and buffers it on each PE in the matrix operation module. The matrix operation module fetches matrix A data from the memory and performs matrix operation on matrix B.
  • the vector operation module 303 is configured to perform inverse winograd transformation on the multiplication result to obtain an output feature map. Including multiple operation processing units, if necessary, further processing the output of the matrix operation module, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network computations in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
  • the preprocessing module 301 may include an acquisition unit 3011 , a traversal unit 3012 and a matrix transformation unit 3013 .
  • the obtaining unit 3011 is used to obtain the input feature map after padding is performed, the size of the input feature map is W ⁇ H ⁇ k, W and H are both even numbers not less than 4, k is a positive integer, and W is the input feature map
  • k is a positive integer
  • W is the input feature map
  • the row of , H is the column of the input feature map
  • k is the number of channels of the input feature map.
  • the number of channels of the input feature map is sometimes abbreviated as the number of input channels or the number of input channels, and they have the same meaning.
  • Padding can be understood as adding some pixels on the periphery of the input feature map, and initializing these pixels to 0 or other specified values. For input feature maps whose rows and columns are not even numbered not less than 4, during the padding process, pixels can be added to the periphery of the input feature map, so that the rows and columns of the input feature map are not less than 4. even.
  • the traversal unit 3012 is used to traverse the input feature map through a sliding window with a step size of 2 and a size of 4 ⁇ 4 to obtain (((W-2)(H-2)/4) ⁇ k) target matrices, the target The matrix is the input feature map of the corresponding region of the sliding window.
  • FIG. 4 it is a schematic diagram of traversing the input feature map by the traversing unit 3012 in the accelerator provided for the example of this application.
  • the rows and columns of the input feature map are W and H respectively, by traversing the input feature map with a step size of 2 and a size of 4 ⁇ 4, ((W-2)(H-2)/4) can be obtained by traversing the input feature map.
  • the area corresponding to the sliding window that is, ((W-2)(H-2)/4) 4 ⁇ 4 matrices can be obtained, see Figure 4, it can be considered that the matrix is the area corresponding to a sliding window, the matrix is a target matrix. If considering that each target matrix also includes the dimension of the number of input channels, through the traversal unit 3012, after traversing the input feature map, (((W-2)(H-2)/4) ⁇ k) targets can be obtained matrix.
  • the matrix transformation unit 3013 is configured to perform the first winograd forward transformation on the target matrix to obtain the transformed target matrix. See Figure 4, which shows the target matrix Perform the first winograd forward transformation to obtain the transformed target matrix the process of. That is, for the target matrix Multiply the B T matrix to the left and the B matrix to the right to get the target matrix of the transformation.
  • the matrix transformation unit 3013 is also used to perform the second winograd forward transformation on the convolution kernel with a size of 3 ⁇ 3 ⁇ k ⁇ n and a stride of 1 to obtain a transformed convolution kernel, where n is the number of channels of the output feature map .
  • a pair of convolution kernels is presented Perform the second winograd forward transformation to get the transformed convolution kernel the process of. That is, the convolution kernel Multiply the G matrix on the left and the G T matrix on the right to get the transformed convolution kernel.
  • the matrix operation module 302 is used to determine the multiplication result of the first matrix and the second matrix, the first matrix includes the i-th element in the target matrix of each transformation, i is a positive integer not greater than 16, and the first matrix is A matrix with m rows and k columns, where m is ((W-2)(H-2)/4), the second matrix includes the ith element of each transformed convolution kernel, and the second matrix is K rows and n columns matrix, the multiplication result is used to determine the output feature map.
  • a dot multiplication operation should be performed on the transformed convolution kernel and the transformed target matrix, and the dot multiplication operation of the transformed convolution kernel and the transformed target matrix in this application is converted into a multiplication operation between the two matrices , so that the winograd algorithm can be applied to the convolutional neural network by only using the conventional matrix operation module 302 through such a design.
  • the idea of how to construct the first matrix and the second matrix is described below.
  • the i-th element in the target matrix of each transformation is extracted to form a matrix with m rows and k columns, which is the first matrix.
  • the k-dimension of the target matrix is not shown in the introduction of FIG. 4, that is, the input feature map is not shown to include multiple input channels.
  • each transformed Each element in the target matrix should contain multiple input channels. Referring to Figure 6-a, take i as 1 as an example to illustrate, when i is 1, the first matrix includes the first element in the target matrix of each transformation, considering that the input feature map also includes the dimension of the number of input channels , the first matrix is a matrix with m rows and k columns.
  • the number of rows and columns in the first matrix shown in Figure 6-a is only an exemplary illustration, the value of k should be determined according to the input channel of the input feature map, and the value of m should be determined according to the input channel.
  • the number of rows and columns of the feature map is determined, and the specific m is ((W-2)(H-2)/4), which will not be repeated in this application.
  • i 5 as an example.
  • the first matrix includes the fifth element in the target matrix of each transformation, and the first matrix is A matrix with m rows and k columns. Since the target matrix of each transformation will include 16 elements, a total of 16 first matrices can be obtained.
  • the construction of each first matrix can be understood with reference to FIG. 6-a and FIG. 6-b.
  • the i-th element in the convolution kernel of each transformation is extracted to form a matrix with K rows and n columns, which is the second matrix.
  • the second matrix includes the first element in the convolution kernel of each transformation.
  • the input feature map also includes the dimension of the number of input channels
  • the second matrix is a matrix of K rows and n columns. It should be noted that the number of rows and columns of the second matrix shown in FIG. 7 is only an exemplary illustration, and the value of n should be determined according to the number of output channels. In other words, it should be determined according to the number of convolution kernels. The application will not repeat this description. Since each transformed convolution kernel will include 16 elements, a total of 16 second matrices can be obtained. The construction of each second matrix can be understood with reference to FIG. 7 .
  • the dot multiplication operation between the transformed target matrix and the transformed convolution kernel can be converted into the multiplication of the first matrix and the second matrix, see Figure 8, that is, the result of the dot multiplication operation is equivalent to 16 Multiplication result of matrix multiplication.
  • the multiplication results of 16 matrix multiplications are matrix S1, matrix S2, matrix S3, matrix S4, matrix S5, matrix S6, matrix S7, matrix S8, matrix S9, matrix S10, matrix S11, matrix S12, matrix S13, Matrix S14, Matrix S15 and Matrix S16.
  • the accelerator provided by the present application also includes a vector operation module 303.
  • element wise refers to performing operations on corresponding elements in at least two matrices, such as performing operations on the i-th element in one matrix and the i-th element in another matrix, where operations can include addition operations, subtraction operations, etc. Wait.
  • Q1, Q2, Q3 and Q4 can be used to determine the output feature map corresponding to the input feature map.
  • performing the inverse winograd transformation on the 16 multiplication results can be converted into adding or subtracting the multiplication results of the 16 matrix multiplications by the conventional vector operation module 303 to output a third matrix, and the third matrix may include Q1, Q2, Q3 and Q4.
  • the third matrix can be processed to obtain an output feature map.
  • the Q1 and Q2 included in the third matrix can be processed.
  • Q3 and Q4 four matrices to find the maximum value or sum, output (Q1+Q2+Q3+Q4)/4 during mean pooling, and output MAX (Q1, Q2, Q3, Q4) during maximum pooling.
  • (Q1+Q2+Q3+Q4)/4 and MAX(Q1, Q2, Q3, Q4) can be used as an expression form of the output feature map.
  • the elements in the third matrix need to be reordered according to a preset reordering rule to obtain the output feature map.
  • a preset reordering rule to obtain the output feature map.
  • the first element Q1.1 in the Q1 matrix, the first element 2.1 in the Q2 matrix, the first element 3.1 in the Q3 matrix and the first element in the Q4 matrix Q4.1 Extract the matrix to form a 2 ⁇ 2, the second element Q1.1 in the Q1 matrix, the second element 2.1 in the Q2 matrix, the second element in the Q3 matrix 3.1 and the second element in the Q4 matrix.
  • Q4.1 is extracted to form a 2 ⁇ 2 matrix, and so on, until the elements in the four matrices of Q1, Q2, Q3 and Q4 are reordered.
  • the elements in the third matrix may be rearranged within rows by the vector operation module 303 , and then the elements in the third matrix may be rearranged between rows by the vector operation module 303 .
  • the vector operation module 303 can perform in-row rearrangement on the elements in the third matrix, and then perform inter-row rearrangement through direct memory access (direct memory access, DMA) transfer.
  • DMA direct memory access
  • each element in the matrix of 2 includes a number of output channels, that is, each element has the dimension of the number of output channels.
  • the 2 ⁇ 2 matrix 1 is the output feature map corresponding to the input feature map of the region where the first sliding window is located.
  • the second element of each matrix in S1 to S16 is extracted to form a matrix, such as a matrix 2, after the winograd inverse transformation is performed on the matrix 2, a 2 ⁇ 2 matrix can be output, and Each element in the 2x2 matrix includes a number of output channels.
  • the 2 ⁇ 2 matrix 2 is the output feature map corresponding to the input feature map of the region where the second sliding window is located, and the second sliding window refers to the sliding window with a step size of 2 sliding once.
  • the operation process of obtaining the i-th element in the 2 ⁇ 2 matrix corresponding to matrix 1 and obtaining the i-th element in the 2 ⁇ 2 matrix corresponding to matrix 2 is the same, and so on, to obtain the 2 ⁇ 2 corresponding to matrix i.
  • the operation flow of the i-th element in the matrix is the same, and the matrix i is a matrix formed by extracting the i-th element in each of the matrices from S1 to S16.
  • Q1 includes the first element of matrix 1 to matrix 16
  • Q2 includes the second element of matrix 1 to matrix 16.
  • Q3 includes the 3rd element of matrix 1 to matrix 16
  • Q4 includes the fourth element of matrix 1 to matrix 16 . Therefore, after obtaining Q1, Q2, Q3 and Q4, it is necessary to reorder the elements in the third matrix according to the preset reordering rules to obtain the output feature map. The way of reordering can be understood with reference to FIG. 9 .
  • An accelerator provided by the embodiment of the present application has been introduced above.
  • the solution provided by the present application can realize the application of the winograd algorithm to the convolutional neural network by using the conventional matrix operation module and vector operation module in the general convolutional neural network.
  • a 3 ⁇ 3 convolutional layer or a pooling layer with a stride of 1 can greatly reduce the number of multiplications and improve the performance and energy efficiency of the accelerator.
  • each transformed target matrix is extracted to form a matrix with m rows and k columns, which is the first matrix.
  • multiple elements in each transformed target matrix may be extracted at one time, and multiple first matrices may be output at one time. Exemplarily, the following description will be given in conjunction with several specific implementation manners.
  • m00 P00-P20-P02+P22
  • m10 P10+P20+P12+P22
  • m20 P20-P10-P22+P12
  • m30 P10-P30-P12+P32, visible, m00, m10, m30
  • the operations of m40 all use the first and third columns of the target matrix.
  • m01 P01-P21+P02-P22
  • m11 P11+P21+P12+P22
  • m21 P21-P11+P22-P12
  • m31 P11-P31+P12-P32, visible, m01, m11, m21 and m31
  • Both operations use the second and third columns of the target matrix.
  • m02 P02-P22-P01+P21
  • m12 P22+P12-P11-P21
  • m22 P22-P12-P21+P11
  • m32 P12-P32-P11+P31, visible, m02, m12, m22 and m32
  • Both operations use the second and third columns of the target matrix.
  • m03 P01-P21-P03+P23
  • m13 P11+P21-P13-P23
  • m23 P21-P11-P23+P13
  • m33 P11-P31-P13+P33, visible, m03, m13, m23 and m33
  • Both operations use the second and fourth columns of the target matrix.
  • the values of some elements in the target matrix that more intuitively reflect the transformation can be calculated in parallel.
  • multiple first matrices may be output according to the multiple acquired elements, or some elements of the multiple first matrices may be output.
  • the elements of the first column and the third column corresponding to each sliding window are obtained.
  • the sliding window slides once three columns of elements can be obtained.
  • the first-column elements of the two transformed target matrices can be output respectively. For another example, obtain all the elements corresponding to each sliding window, and when the sliding window slides once, all the elements in the two target matrices can be obtained.
  • Performing a unified operation on all elements in the two target matrices can output the two transformed target matrices at the same time. It can be considered that in order to maximize the utilization rate of the matrix operation module, the number of transformed target matrices output by the matrix transformation unit each time can be determined according to the actual bandwidth and the storage capacity of the matrix operation module. For example, the matrix transformation unit outputs one at a time. Transformed target matrices, 2 transformed target matrices, 4 transformed target matrices, 8 transformed target matrices, or 16 transformed target matrices, etc.
  • the calculation of the elements in the first column of the transformed target matrix uses the first and third columns in the target matrix.
  • 11-a elements of odd-numbered columns in multiple target matrices can be obtained, and according to the elements of odd-numbered columns in the target matrix obtained one or more times, determine the number of elements in one or more transformed target matrices.
  • the first column of elements For example, as shown in Figure 11-a, if odd-numbered column elements in the three-column target matrix are obtained, the first column elements in the two transformed target matrices can be obtained.
  • the elements in the second and third columns of the transformed target matrix are calculated using the elements in the second and third columns of the target matrix.
  • the elements in multiple target matrices can be obtained.
  • Multi-column elements according to the multi-column elements in the target matrix acquired one or more times, determine the second column and third column elements in one or more transformed target matrices.
  • the elements in the 4-column target matrix are obtained, and the elements in the second column and the third column in the two transformed target matrices can be obtained.
  • the calculation of the elements in the fourth column of the transformed target matrix uses the elements in the second column and the fourth column of the target matrix.
  • the even-numbered columns in multiple target matrices can be obtained.
  • element according to the elements of the even-numbered columns in the target matrix obtained one or more times, to determine the fourth column element in one or more transformed target matrices.
  • two transformed target matrices can be output according to the elements of 4 rows and 6 columns.
  • the dimension of the input channel is not shown in Figure 11-a to Figure 11-c, but it should be clear that each element of each target matrix includes multiple input channels, and the Each element also includes multiple input channels.
  • each convolution kernel can be extracted at one time, and multiple second matrices can be output at one time. There is an intersection part in the calculation process of each element in the transformed convolution kernel, which is described below in conjunction with formula 2-3.
  • q01 (k′ 00 +k′ 01 +k′ 02 )/2
  • q11 (k′ 00 +k′ 01 +k′ 02 +k′ 10 +k′ 11 +k′ 12 +k′ 20 +k ′ 21 +k′ 22 )/4
  • q21 (k′ 00 +k′ 01 +k′ 02 -k′ 10 -k′ 11 -k′ 12 +k′ 20 +k′ 21 +k′ 22 )/ 4.
  • q31 (k' 20 +k' 21 +k' 22 )/2. It can be seen that the operations of q00, q10, q20 and q30 all use each column of the convolution kernel.
  • q02 (k′ 00 -k′ 01 +k′ 02 )/2
  • q12 (k′ 00 -k′ 01 +k′ 02 +k′ 10 +k′ 11 +k′ 12 +k′ 20 -k ′ 21 +k′ 22 )/4
  • q22 (k′ 00 -k′ 01 +k′ 02 -k′ 10 +k′ 11 -k′ 12 +k′ 20 -k′ 21 +k′ 22 )/ 4.
  • q32 (k' 20 -k' 21 +k' 22 )/2. It can be seen that the operations of q02, q12, q22 and q32 all use each column of the convolution kernel.
  • each convolution kernel has all or part of the number of input channels and output channels.
  • the second winograd forward transformation process in order to reduce the calculation amount of the matrix transformation unit in the accelerator, can be performed offline, that is, the accelerator provided by this application also includes a storage module, and the storage module is used for storing
  • the storage module is used for storing
  • other modules in the accelerator can directly call the second winograd forward transformation result pre-stored in the storage module.
  • part of the second winograd forward transformation process may also be performed on-chip, and part of the second winograd forward transformation process is performed offline. An example will be given below.
  • the second winograd forward transformation includes a third winograd forward transformation and a fourth winograd forward transformation
  • the neural network accelerator further includes a storage module, the storage module is used to store the first transformation obtained by performing the third winograd forward transformation on the convolution kernel through the third matrix result.
  • the matrix transformation unit is specifically used to perform the fourth winograd forward transformation on the first transformation result through the fourth matrix to obtain the transformed convolution kernel, and the third matrix and the fourth matrix are the transformation matrix of the second winograd forward transformation.
  • the values of the elements in the third matrix are 0 or ⁇ 1
  • the fourth matrix is the matrix other than the third matrix in the decomposed matrix.
  • the transformation matrix G of the second winograd forward transformation is split into a 3 ⁇ 3 matrix GR(2-5) and a 4 ⁇ 3 matrix GL(2-6). It should be noted that there may be other splitting methods, so that all elements in one matrix in the split transformation matrix are 0 or ⁇ 1.
  • the solution provided in this application can support inverse quantization and quantization processing.
  • the vector operation module may support inverse quantization (De-quantization) and quantization (Quantization) operations to meet the requirements of fixed-point number operations.
  • inverse quantization can be used to convert fixed-point numbers into floating-point numbers or other fixed-point numbers that are conducive to the operation of the vector operation module, such as: s32->f16, s32->s16; quantization is used to convert the rearranged results of the vector operation module into Fixed-point input for the next layer operation, for example: s16->s8, f16->s8.
  • inverse quantization may be performed before inverse Winograd transform, and quantization may be performed after inverse Winograd transform.
  • the process of inverse quantization can be performed before the inverse transformation operation, which can save the bit width and increase the computing power. It should be noted that the embodiments of the present application do not limit the specific manners of quantization and inverse quantization.
  • FIG. 12 it is a schematic structural diagram of an accelerator provided in the present application.
  • the accelerator provided by this application is based on a conventional matrix operation module and a vector operation module, and realizes the application of the winograd algorithm to the acceleration algorithm of the neural network through less architectural modifications.
  • the accelerator performs traversal processing on the input feature map obtained by the acquisition unit through the traversal unit and the matrix transformation unit, and performs winograd forward transformation processing to output 16 first matrices.
  • the accelerator performs winograd forward transformation processing on the convolution kernel through the matrix transformation unit , to output 16 second matrices.
  • the manner and principle of acquiring the first matrix and the second matrix have been described above, and will not be repeated here.
  • the vector operation module is used for post-processing.
  • the post-processing includes data rearrangement operation, or sum operation, or accumulation and operation. , if the input feature map is processed in the convolution layer, the data can be rearranged through the data moving function of the vector operation module to obtain the output image features. If the input feature map is processed in the pooling layer, the data can be accumulated or accumulated to obtain the output image image features.
  • the accelerator supports different data formats such as floating point, fixed point, etc. When the calculation process involves fixed-point operations, the vector operation module can perform inverse quantization and quantization operations to support fixed-point convolution operations.
  • an offset operation may be performed on at least one multiplication result.
  • performing an offset operation on a multiplication result can be equivalent to performing an offset operation on the output feature map. This is demonstrated below:
  • b represents the offset, and one of the values of c can be obtained through 2-7 above.
  • performing an offset operation on the fifth multiplication result can be equivalent to performing an offset operation on the output feature map.
  • FIG. 13 which is a schematic diagram of a possible way of performing an offset operation on a multiplication result
  • performing an offset operation on a multiplication result can be equivalent to performing an offset operation on an output feature map.
  • the operations in the matrix transformation unit and the vector operation module may be an on-path operation.
  • the function of the matrix transformation unit can be solidified into an instruction for calling, and the matrix transformation unit can be included in the data transfer process from the upper-layer memory to the matrix operation module, that is, in the process of transferring data from the upper-layer memory to the matrix operation module , to process the data, and the process of processing can be understood with reference to the operations performed by the matrix transformation unit.
  • the offset operation, the inverse quantization operation, or a part of the winograd inverse transformation of the vector operation module can be completed by the path-by-path operation. Referring to FIG.
  • FIG. 14 it is a schematic diagram showing the position of the associated operation in the whole operation flow in the solution provided by the present application.
  • the offset operation, the inverse quantization operation, or a part of the winograd inverse transformation can be performed along the way during the transfer process from the matrix operation module to the vector operation module.
  • the matrix transformation unit, the matrix operation module and the vector operation module can be executed in parallel through a pipeline to improve operation efficiency. That is, the matrix transformation unit obtains a part of the result of the winograd forward transformation, and can send this part of the result to the matrix operation module, so that the matrix operation module obtains a part of the multiplication result, and the matrix operation unit obtains a part of the multiplication result. The multiplication result of this part is sent to the vector operation unit, so that the vector operation unit can perform inverse winograd transformation on the multiplication result of this part.
  • the number of matrices output by the matrix transformation unit each time can be determined according to the bandwidth and the storage capacity of the matrix operation unit to output one or more first matrices or second matrices each time, and the description will not be repeated here.
  • the following description will be given in conjunction with the pseudo code, assuming that the size of the input feature map is 56 ⁇ 56 ⁇ k, and the size of the convolution kernel is 3 ⁇ 3 ⁇ k ⁇ n, the following is that the matrix transformation unit only outputs 4 pieces at a time. Pseudocode when the first matrix, 4 second matrices.
  • the input feature map and the convolution kernel are sliced to obtain a schematic diagram of the output feature map in multiple operations.
  • the specific process can be understood by referring to the pseudocode. , which will not be repeated here.
  • An embodiment of the present application further provides an acceleration method, which may include the following steps: performing a first winograd forward transformation on a target matrix corresponding to an input feature map to obtain a transformed target matrix.
  • a second winograd forward transformation is performed on the convolution kernel to obtain the transformed convolution kernel.
  • a matrix multiplication operation is performed on the first matrix and the second matrix to obtain the multiplication result, the first matrix is constructed according to the transformed target matrix, and the second matrix is constructed according to the transformed convolution kernel.
  • the method further includes performing a padding operation on the input feature map, so that the size of the input feature map is W ⁇ H ⁇ k, where both W and H are even numbers not less than 4, and k is greater than 1
  • W is the row of the input feature map
  • H is the column of the input feature map
  • k is the number of channels of the input feature map.
  • the input feature map is traversed through a sliding window with stride 2 and size 4 ⁇ 4 to obtain (((W-2)(H-2)/4) ⁇ k) target matrices.
  • a padding operation is performed on the input feature map, so that the size of the input feature map is W ⁇ H ⁇ k, W and H are both even numbers not less than 4, k is an integer greater than 1, and W is the row of the input feature map, H is the column of the input feature map, and k is the number of channels of the input feature map.
  • the input feature map is traversed through a sliding window with stride 2 and size 4 ⁇ 4 to obtain (((W-2)(H-2)/4) ⁇ k) target matrices.
  • the size of the convolution kernel is 3 ⁇ 3 ⁇ k ⁇ n
  • the stride of the convolution kernel is 1
  • n is the number of channels of the output feature map
  • n is an integer greater than 1.
  • the first matrix includes the i-th element in the transformed target matrix, i is a positive integer not greater than 16, the first matrix is a matrix with m rows and k columns, m is ((W-2 )(H-2)/4), the second matrix includes the i-th element of the transformed convolution kernel, the second matrix is a matrix with K rows and n columns, and the multiplication result is used to determine the output feature map.
  • performing inverse winograd transformation on the multiplication result to obtain the output feature map includes: performing inverse winograd transformation on the multiplication result to obtain the third matrix.
  • the elements in the third matrix are reordered according to a preset reordering rule to obtain an output feature map.
  • performing inverse winograd transformation on the multiplication result to obtain the output feature map includes: performing inverse winograd transformation on the multiplication result to output the third matrix. The elements in the third matrix are summed to obtain the output feature map.
  • the second winograd forward transformation includes a third winograd forward transformation and a fourth winograd forward transformation
  • the second winograd forward transformation is performed on a convolution kernel with a size of 3 ⁇ 3 ⁇ k ⁇ n and a stride of 1.
  • Transforming to obtain a transformed convolution kernel includes: performing a first transformation result of the third winograd forward transformation on the convolution kernel through a third matrix.
  • the fourth winograd forward transformation is performed on the first transformation result through the fourth matrix to obtain the transformed convolution kernel.
  • the third matrix and the fourth matrix are the matrices obtained by decomposing the transformation matrix of the second winograd forward transformation.
  • the values of the elements in the three matrices are 0 or ⁇ 1, and the fourth matrix is a matrix other than the third matrix in the decomposed matrices.
  • the method further includes: acquiring M elements of multiple transformed target matrices, where M is an integer greater than 1.
  • the M elements are processed according to the first preset formula to output a plurality of first matrices.
  • the N elements are processed according to the second preset formula to output a plurality of second matrices.
  • the method further includes: performing an offset operation on a multiplication result.
  • the embodiments of the present application also provide a computer-readable storage medium, where a program for acceleration is stored in the computer-readable storage medium, and when the computer is running on the computer, the computer is made to execute the programs shown in the aforementioned FIG. 3-a to FIG. 15 .
  • Example embodiments describe steps performed by a neural network accelerator.
  • the neural network accelerator in this application can also be implemented by a digital processing chip or chip, the chip includes a processing unit and a communication interface, the processing unit acquires program instructions through the communication interface, the program instructions are executed by the processing unit, and the processing unit is used to execute the aforementioned Fig. 3- a or the method steps performed by the neural network accelerator shown in any of the embodiments in FIG. 15 .
  • the embodiments of the present application also provide a digital processing chip.
  • the digital processing chip implements the actions performed by the neural network accelerator in the above embodiment according to the program codes stored in the external memory.
  • Embodiments of the present application also provide a computer program product that, when driving on a computer, causes the computer to execute the steps performed by the neural network accelerator in the method described in the embodiments shown in the foregoing FIG. 3-a to FIG. 15 .
  • the neural network accelerator provided in this embodiment of the present application may be a chip, and the chip includes a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the server executes the steps performed by the neural network accelerator described in the embodiments shown in FIG. 3-a to FIG. 15 .
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or it may be any conventional processor or the like.
  • FIG. 17 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip can be represented as a neural network processor NPU.
  • the NPU is mounted on the host CPU (Host CPU) as a co-processor, and the Host CPU allocation tasks.
  • the core part of the NPU is the matrix operation module 302, which is controlled by the controller 308 to extract the matrix data in the memory and perform multiplication operations. It should be noted that the controller 308 can also control other modules in the NPU.
  • the specific steps performed by the matrix operation module 302 can be understood from the steps performed by the matrix operation module 302 described in any of the embodiments in FIG. 3-a to FIG. 15 .
  • It also includes a preprocessing module 301, and the specific steps performed by the preprocessing module can be understood from the steps performed by the preprocessing module described in any of the embodiments in FIG. 3-a to FIG. 15 .
  • the preprocessing module can be understood with reference to the actions performed by the acquisition unit 3011, the traversal unit 3012 and the matrix transformation unit 3013 in FIG. 3-a to FIG. 15 .
  • a bus interface unit (BIU) 310 is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 309.
  • the bus interface unit 310 (bus interface unit, BIU) is used for the instruction fetch memory 309 to obtain instructions from the external memory, and is also used for the storage unit access controller 306 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the specific steps performed by the vector operation module 303 can be understood from the steps performed by the vector operation module 303 described in any of the embodiments in FIG. 3-a to FIG. 15 .
  • the vector operation module 303 can store the vector of processed outputs to the unified memory 307 .
  • the vector operation module 303 may apply a linear function and/or a non-linear function to the output of the matrix operation module 302, such as linear interpolation of the feature planes extracted by the convolutional layers, such as a vector of accumulated values, to generate activation values .
  • the vector operation module 303 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to matrix operation module 302, eg, for use in subsequent layers in a neural network.
  • An instruction fetch buffer 309 connected to the controller 308 is used to store the instructions used by the controller 308 .
  • the unified memory 307, the input memory 305, the weight memory 304 and the instruction fetch memory 309 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the cyclic neural network can be performed by the matrix operation module 302 or the vector operation module 303 .
  • the processor mentioned in any one of the above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the above-mentioned methods in FIGS. 3-a to 15 . .
  • the data flow is to obtain data from the external memory through the bus interface unit 310, which may include input feature maps and weights, and store the obtained data in the unified memory.
  • the storage unit accesses the controller to control the unified memory, so that the data in the unified memory is transmitted.
  • the data output by the matrix transformation unit is transmitted to the weight memory 304 and the input memory, the weight memory 304 and the input memory output data to the matrix operation module, the data output by the matrix operation module is transmitted to the vector operation module, the output of the vector operation module
  • the results are stored in unified memory and the results can be output to an external bus.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk U disk
  • mobile hard disk ROM
  • RAM random access memory
  • disk or CD etc.
  • a computer device which can be a personal computer, server, or network device, etc. to execute the methods described in the various embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Discrete Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)
  • Feedback Control In General (AREA)

Abstract

一种神经网络加速器,包括:预处理模块(301),用于对输入特征图对应的目标矩阵进行第一winograd正变换,以得到变换的目标矩阵。预处理模块(301),还用于对卷积核进行第二winograd正变换,以得到变换的卷积核。矩阵运算模块(302),用于对第一矩阵和第二矩阵进行矩阵乘法运算,以得到相乘结果,第一矩阵是根据变换的目标矩阵构建的,第二矩阵是根据变换的卷积核构建的。向量运算模块(303),用于对相乘结果进行winograd反变换,以得到输出特征图。利用常规的矩阵运算模块(302)以及向量运算模块(303)可以实现将winograd算法应用到神经网络中。

Description

一种神经网络加速器、加速方法以及装置 技术领域
本申请涉及神经网络领域,尤其涉及一种神经网络加速器、加速方法以及装置。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
神经网络隶属于人工智能领域的连接主义学派,是一种应用类似于大脑神经突触连接的结构进行信息处理的数学模型。神经网络中涉及的计算主要包括卷积操作、激活操作和池化操作等,其中,卷积操作占用了神经网络处理的大部分时间。为了在有限的面积内获得较高的性能和能效比,目前研究者们提出了基于winograd算法的卷积运算方式,该方式通过对输入特征图与权值进行特定的矩阵转换,能够完成等效的卷积运算任务并大量减少卷积运算过程的乘法运算。
但是,当前合入了winograd算法进行加速的加速器,一般需要对神经网络中的矩阵运算模块、向量运算模块等核心运算模块进行较多的修改,设计复杂。
发明内容
本申请实施例提供一种神经网络加速器,该神经网络加速器基于winograd算法,利用神经网络中常规的矩阵运算模块以及向量运算模块就可以实现将winograd算法应用到神经网络中。针对3×3(行×列),步长为1的卷积层或者池化层,可以大量减少其中的乘法次数,提升加速器的性能以及能效比。
为达到上述目的,本申请实施例提供如下技术方案:
本申请第一方面提供一种神经网络加速器,包括:预处理模块,用于对输入特征图对应的目标矩阵进行第一winograd正变换,以得到变换的目标矩阵,对目标矩阵进行第一winograd正变换可以理解为对目标矩阵左乘B T矩阵,并右乘B矩阵,以得到变换后的目标矩阵。预处理模块,还用于对卷积核进行第二winograd正变换,以得到变换的卷积核,对卷积核进行第二winograd正变换可以理解为对卷积核左乘G矩阵,右乘G T矩阵,以得到变换后的卷积核。矩阵运算模块,用于对第一矩阵和第二矩阵进行矩阵乘法运算,以得到相乘结果,第一矩阵是根据变换的目标矩阵构建的,第二矩阵是根据变换的卷积核构建的。向量运算模块,用于对相乘结果进行winograd反变换,以得到输出特征图,对矩阵乘结果进行winograd反变换的过程,相当于对矩阵乘的结果进行向量的加减运算,可以通过常规的向量运算模块实现。
由第一方面可知,对执行过winograd正变换后的目标矩阵和卷积核来分别构建第一矩阵和第二矩阵,然后利用神经网络加速器中已有的矩阵运算模块来对第一矩阵和第二矩阵 执行矩阵乘法运算,利用神经网络加速器中已有的向量运算模块对相乘结果进行winograd反变换,避免了对神经网络中的矩阵运算模块和向量运算模块等核心运算模块进行修改,设计简单,且避免了在神经网络加速器中增加用于对执行过winograd正变换后的目标矩阵和卷积核进行点乘运算的模块,提高了神经网络加速器执行winograd计算的效率。
可选地,结合上述第一方面,在第一种可能的实施方式中,预处理模块还用于通过滑窗遍历输入特征图,以得到输入特征图对应的目标矩阵。由第一方面第一种可能的实施方式可知,给出了一种具体的获取目标矩阵的方式,可以通过滑窗遍历输入特征图,目标矩阵是滑窗对应区域的输入特征图。
可选地,结合上述第一方面第一种可能的实施方式中,在第二种可能的实现方式中,输入特征图是执行过填充padding操作的输入特征图,输入特征图的尺寸为W×H×k,W和H均为不小于4的偶数,k为大于1的整数,W为输入特征图的行,H为输入特征图的列,k为输入特征图的通道数目。可以将padding理解为在输入特征图的***补充一些像素点,比如,把这些像素点初始化为0或者其他指定的数值。对于行和列不是不小于4的偶数的输入特征图,可以在padding处理的过程中,通过在输入特征图的***补充像素点的方式,使输入特征图的行和列均为不小于4的偶数。通过步长为2,尺寸为4×4的滑窗遍历输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵,目标矩阵是滑窗对应区域的输入特征图。由第一方面第一种可能的实现方式可知,给出了一种具体的确定输入特征图的目标矩阵的方式,增加了方案的多样性。
可选地,结合上述第一方面第二种可能的实施方式,在第三种可能的实施方式中,卷积核的尺寸为3×3×k×n,卷积核的步长为1,n为输出特征图的通道数目,n为大于1的整数。本申请提供的方案,针对3×3(行×列),步长为1的卷积层或者池化层,可以大量减少其中的乘法次数,提升加速器的性能以及能效比。
可选地,结合上述第一方面第三种可能的实施方式,在第四种可能的实施方式中,第一矩阵包括变换的目标矩阵中的第i个元素,i为不大于16的正整数,第一矩阵是m行k列的矩阵,m为((W-2)(H-2)/4),第二矩阵包括变换的卷积核的第i个元素,第二矩阵是K行n列的矩阵,相乘结果用于确定输出特征图。由第一方面第三种可能的实施方式可知,给出了一种具体的构建第一矩阵和第二矩阵的方式。
可选地,结合上述第一方面或第一方面第一种至第一方面第四种可能的实施方式,在第五种可能的实施方式中,向量运算模块,具体用于:对相乘结果进行winograd反变换,以得到第三矩阵。通过预设的重排序规则对第三矩阵中的元素进行重排序,以得到输出特征图。由第一方面第二种可能的实施方式可知,如果在卷积层对输入特征图进行处理,通过向量运算模块对16个矩阵乘的结果进行处理后,再对处理后的结果进行重排序,以得到输出特征图。
可选地,结合上述第一方面或第一方面第一种至第一方面第四种可能的实施方式,在第六种可能的实施方式中,向量运算模块,具体用于:对相乘结果进行winograd反变换,以输出第三矩阵。对第三矩阵中的元素进行求和运算,以得到输出特征图。由第一方面第三种可能的实施方式可知,如果在池化层对输入特征图进行处理,可以通过对第三矩阵中 的元素进行求和或者求最大值运算,以得到输出特征图。
可选地,结合上述第一方面或第一方面第一种至第一方面第六种可能的实施方式,在第七种可能的实施方式中,第二winograd正变换包括第三winograd正变换以及第四winograd正变换,神经网络加速器还包括存储模块,存储模块,用于存储通过第三矩阵对卷积核进行第三winograd正变换的第一变换结果。矩阵变换单元,具体用于通过第四矩阵对第一变换结果进行第四winograd正变换,以得到变换的卷积核,第三矩阵和第四矩阵是对第二winograd正变换的变换矩阵进行分解后得到的矩阵,第三矩阵中的元素的取值为0或者±1,第四矩阵是分解后的矩阵中除第三矩阵之外的矩阵。由第一方面第四种可能的实施方式可知,为了减少加速器中的矩阵变换单元的计算量,可以将第二winograd正变换的部分过程离线执行。
可选地,结合上述第一方面或第一方面第一种至第一方面第七种可能的实施方式,在第八种可能的实施方式中,矩阵变换单元,还用于:获取多个变换的目标矩阵的M个元素,M为大于1的整数。按照第一预设公式对M个元素进行处理,以输出多个第一矩阵。获取多个变换的卷积核的N个元素,N为大于1的整数。按照第二预设公式对N个元素进行处理,以输出多个第二矩阵。由第一方面第五种可能的实施方式可知,为了能够进一步的提升加速器的性能,可以一次提取每个变换的目标矩阵中的多个元素,一次输出多个第一矩阵。也可以一次提取变换的卷积核中的多个元素,一次输出多个第二矩阵。
可选地,结合上述第一方面第一种至第一方面第八种可能的实施方式,在第九种可能的实施方式中,向量运算模块,还用于对相乘结果进行反量化处理。向量运算模块,具体用于对进行了反量化处理后的相乘结果进行winograd反变换。向量运算模块,还用于对输出特征图进行量化处理,以得到量化后的输出特征图。由第一方面第六种可能的实施方式可知,为了满足定点数运算的需求,可以增加量化操作和反量化操作。
可选地,结合上述第一方面或第一方面第一种至第一方面第九种可能的实施方式,在第十种可能的实施方式中,向量运算模块,还用于:对至少一个相乘结果进行偏移操作。由第一方面第七种可能的实施方式可知,本申请提供的方案对一个相乘结果进行偏移操作,可以等效为对输出特征图进行偏移操作。
本申请第二方面提供一种加速方法,包括:对输入特征图对应的目标矩阵进行第一winograd正变换,以得到变换的目标矩阵。对卷积核进行第二winograd正变换,以得到变换的卷积核。对第一矩阵和第二矩阵进行矩阵乘法运算,以得到相乘结果,第一矩阵是根据变换的目标矩阵构建的,第二矩阵是根据变换的卷积核构建的。对相乘结果进行winograd反变换,以得到输出特征图。
可选地,结合上述第二方面,在第一种可能的实施方式中,通过滑窗遍历输入特征图,以得到所述输入特征图对应的目标矩阵。
可选地,结合上述第二方面第一种可能的实施方式,在第二种可能的实施方式中,输入特征图为执行过填充padding操作的输入特征图,输入特征图的尺寸为W×H×k,W和H均为不小于4的偶数,k为大于1的整数,W为输入特征图的行,H为输入特征图的列,k为输入特征图的通道数目。通过步长为2,尺寸为4×4的滑窗遍历输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵。
可选地,结合上述第二方面第二种可能的实施方式,在第三种可能的实施方式中,对输入特征图执行填充padding操作,以使输入特征图的尺寸为W×H×k,W和H均为不小于4的偶数,k为大于1的整数,W为输入特征图的行,H为输入特征图的列,k为输入特征图的通道数目。通过步长为2,尺寸为4×4的滑窗遍历输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵。
可选地,结合上述第二方面第三种可能的实施方式,在第四种可能的实施方式中,卷积核的尺寸为3×3×k×n,卷积核的步长为1,n为输出特征图的通道数目,n为大于1的整数。
可选地,结合上述第二方面第四种可能的实施方式,在第五种可能的实施方式中,第一矩阵包括变换的目标矩阵中的第i个元素,i为不大于16的正整数,第一矩阵是m行k列的矩阵,m为((W-2)(H-2)/4),第二矩阵包括变换的卷积核的第i个元素,第二矩阵是K行n列的矩阵,相乘结果用于确定输出特征图。
可选地,结合上述第二方面或第二方面第一种至第二方面第五种可能的实施方式,在第六种可能的实施方式中,对相乘结果进行winograd反变换,以得到输出特征图,包括:对相乘结果进行winograd反变换,以得到第三矩阵。通过预设的重排序规则对第三矩阵中的元素进行重排序,以得到输出特征图。
可选地,结合上述第二方面或第二方面第一种至第二方面第六种可能的实施方式,在第七种可能的实施方式中,对相乘结果进行winograd反变换,以得到输出特征图,包括:对相乘结果进行winograd反变换,以输出第三矩阵。对第三矩阵中的元素进行求和运算,以得到输出特征图。
可选地,结合上述第二方面或第二方面第一种至第二方面第七种可能的实施方式,在第八种可能的实施方式中,第二winograd正变换包括第三winograd正变换以及第四winograd正变换,对尺寸为3×3×k×n,步长为1的卷积核进行第二winograd正变换,以得到变换的卷积核,包括:通过第三矩阵对卷积核进行第三winograd正变换的第一变换结果。通过第四矩阵对第一变换结果进行第四winograd正变换,以得到变换的卷积核,第三矩阵和所第四矩阵是对第二winograd正变换的变换矩阵进行分解后得到的矩阵,第三矩阵中的元素的取值为0或者±1,第四矩阵是分解后的矩阵中除第三矩阵之外的矩阵。
可选地,结合上述第二方面或第二方面第一种至第二方面第八种可能的实施方式,在第九种可能的实施方式中,还包括:获取多个变换的目标矩阵的M个元素,M为大于1的整数。按照第一预设公式对M个元素进行处理,以输出多个第一矩阵。获取多个变换的卷积核的N个元素,N为大于1的整数。按照第二预设公式对N个元素进行处理,以输出多个第二矩阵。
可选地,结合上述第二方面或第一方面第一种至第二方面第八种可能的实施方式,该方法还包括对所述相乘结果进行反量化处理,以得到反量化的相乘结果。所述对所述相乘结果进行winograd反变换,以得到输出特征图,包括:对所述反量化的相乘结果进行winograd反变换,以得到所述输出特征图。所述方法还包括:对所述输出特征图进行量化处理,以得到量化的所述输出特征图。
可选地,结合上述第二方面或第二方面第一种至第二方面第九种可能的实施方式,在第十一种可能的实施方式中,还包括:对相乘结果进行偏移操作。
本申请第三方面提供一种神经网络装置,该神经网络装置包括神经网络加速器,该神经网络加速器为第一方面或第一方面任意一种可能的实施方式中描述的神经网络加速器。
本申请第四方面提供一种芯片***,该芯片***包括处理器和通信接口,所述处理器通过所述通信接口获取程序指令,当所述程序指令被所述处理器执行时实现第二方面或第二方面任意一种可能的实施方式中描述的方法。
本申请第五方面提供一种芯片***,该芯片***包括处理器和存储器,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现第二方面或第二方面任意一种可能的实施方式中描述的方法。
本申请第六方面提供一种计算机可读存储介质,包括程序,当其被处理单元所执行时,执行如第二方面或第二方面任意一种可能的实施方式中描述的方法。
本申请第七方面提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得计算机执行如第二方面或第二方面任意一种可能的实施方式中描述的方法。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种卷积神经网络结构示意图;
图2为本发明实施例提供的一种卷积神经网络结构示意图;
图3-a为本申请实施例提供的一种基于winograd算法的神经网络加速器的结构的示意图;
图3-b为本申请实施例提供的一种基于winograd算法的神经网络加速器的结构的示意图;
图4为本申请实例提供的加速器中遍历单元3012遍历输入特征图的示意图;
图5为本申请实施例提供的加速器中对卷积核进行winograd正变换的示意图;
图6-a为本申请实施例提供的加速器中第一矩阵的示意图;
图6-b为本申请实施例提供的加速器中第一矩阵的示意图;
图7为本申请实施例提供的加速器中第二矩阵的示意图;
图8为本申请实施例提供的加速器中获取16个相乘结果的示意图;
图9为本申请实施例中根据预设的重排序规则对第三矩阵中的元素进行重排序的示意图;
图10为本申请实施例中变换的目标矩阵中的部分元素的取值可以并行计算的示意图;
图11-a为本申请实施例中变换的目标矩阵中的部分元素的取值可以并行计算的示意图;
图11-b为本申请实施例中变换的目标矩阵中的部分元素的取值可以并行计算的示意图;
图11-c为本申请实施例中变换的目标矩阵中的部分元素的取值可以并行计算的示意图;
图12为本申请提供的一种加速器的结构示意图;
图13为本申请实施例中偏移操作的示意图;
图14为本申请实施例中随路运算的示意图;
图15为本申请实施例中矩阵变换单元,矩阵运算模块以及向量运算模块可以通过流水线并行执行的示意图;
图16为本申请实施例提供的方案分多次运算得到输出特征图的示意图;
图17为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”,“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程,方法,***,产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程,方法,产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的模块的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个模块可以结合成或集成在另一个***中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些端口,模块之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的模块或子模块可以是也可以不是物理上的分离,可以是也可以不是物理模块,或者可以分布到多个电路模块中,可以根据实际的需要选择其中的部分或全部模块来实现本申请方案的目的。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对神经网络的相关概念进行介绍。
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2020118832-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输 入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
由于CNN是一种非常常见的神经网络,下面结合图1重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
本申请实施例涉及的神经网络的结构可以如图1所示。在图1中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。其中,输入层210可以获取待处理数据,该数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有***的业务数据以及力、位移、液位、温度、湿度等感知数据。以下以该待处理数据是待处理图像进行说明,并将获取到的待处理图像交由卷积层/池化层220以及后面的神经网络层230进行处理,可以得到图像的处理结果。下面对图1中的CNN 200中内部的层结构进行详细的介绍。
卷积层/池化层220:
卷积层:
如图1所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核或者卷积核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩 阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图1中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐含层(如图1所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦 整个卷积神经网络200的前向传播(如图1由210至240方向的传播为前向传播)完成,反向传播(如图1由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
本申请实施例的神经网络的结构可以如图2所示。在图2中,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。与图1相比,图2中的卷积层/池化层220中的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层230进行处理。
需要说明的是,图1和图2所示的卷积神经网络仅作为一种本申请实施例的两种可能的卷积神经网络的示例,在具体的应用中,本申请实施例的卷积神经网络还可以以其他网络模型的形式存在。此外,本申请实施例提供的神经网络还可以是深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络(recurrent neural network,RNNS)等等。
神经网络中涉及的计算主要包括卷积操作、激活操作和池化操作等,其中,卷积操作占用了神经网络处理的大部分时间。并且,卷积核的尺寸为3×3(行×列),步长为1的卷积层在卷积计算中占有很大的比重。因此,对这种类型的卷积层进行加速具有较大价值。使用winograd算法,可以将3×3,步长为1的卷积层的算法的乘法次数大量减少,有利于硬件性能提升以及能效比的提升。为了更好的理解方案,下面对winograd算法进行介绍。
对于winograd算法,可以将输入的信号D看做是4×4的矩阵,如下面的公式1-1所示,将卷积核K看做是3×3的矩阵,如下面的公式1-2所示。
Figure PCTCN2020118832-appb-000002
Figure PCTCN2020118832-appb-000003
根据winograd算法,D和K的卷积的矩阵乘形式可以通过如下公式1-3表示。由于根据winograd算法对卷积运算进行变换是现有技术,本申请不再进行推导,只列出推导后的结果。公式1-3表示对输入信号的矩阵D左乘B T矩阵,并右乘B矩阵,得到变换后矩阵U,这一过程为对输入信号进行winograd正变换的过程。矩阵U的尺寸为4×4,对卷积核对应的矩阵K左乘G矩阵,右乘G T矩阵,以得到变换后的矩阵V,矩阵V的尺寸为4×4,这一过程,为对卷积核进行winograd正变换的过程。对矩阵U和矩阵V进行点乘运算,得到U×V矩阵,再将U×V矩阵左乘A T矩阵,右乘A矩阵,以得到最后的输出信号对应的矩阵,这一过程为winograd反变换的过程。
Figure PCTCN2020118832-appb-000004
其中,B T通过公式1-4表示,B通过公式1-5表示,G通过公式1-6表示,G T通过公式1-7表示,A T通过公式1-8表示,A通过公式1-9表示。输出的信号为2×2的矩阵,本申请 通过公式2-0表示。
Figure PCTCN2020118832-appb-000005
Figure PCTCN2020118832-appb-000006
Figure PCTCN2020118832-appb-000007
Figure PCTCN2020118832-appb-000008
Figure PCTCN2020118832-appb-000009
Figure PCTCN2020118832-appb-000010
Figure PCTCN2020118832-appb-000011
经过winograd变换,乘法次数可以从36降低至16,如果推广到卷积核为3×3的神经网络中,可以提升能效比。
当前大多数基于矩阵运算的CNN加速器未合入2D winograd进行加速,在能效比和算力上存在瓶颈。合入了winograd算法进行加速的加速器,一般需要对核心计算单元进行较多修改,比如,要对神经网络中的矩阵运算模块,向量运算模块进行较多修改,或者需要专用的硬件支持,必要需要增加执行点乘运算的硬件模块。目前将winograd算法用于神经网络的加速器中的方案均不理想。
本申请综合考虑现有方法的不足,利用常规的矩阵运算模块(matrix unit)以及向量运算模块(vector unit),实现将winograd算法用于神经网络的加速器中。不需要对核心计算单元进行较多修改,也不需要专用的硬件支持。
为了便于更好的理解本申请,下面具体阐述本申请所描述的技术方案的研究思路:
通过上面对winograd算法的介绍可知,在winograd算法中,输入信号D是4×4的矩阵,而在实际应用中,输入特征图可以是任意尺寸。为了解决这一问题,可以通过尺寸为4×4的滑窗遍历该输入特征图,则每一个滑窗对应的区域都是4×4的矩阵,本申请中将滑窗对应的区域称为目标矩阵。此外,在winograd算法中,对于尺寸为4×4的输入信号, 与其卷积的卷积核的步长为1,以得到输出信号,该输出信号是一个2×2的矩阵,则本方案中,为了输出与该输入特征图对应的输出特征图,将该尺寸为4×4的滑窗的步长设置为2。确定滑窗的步长为2后,输入特征图的行和列应当是偶数,以获得整数个滑窗,如果输入特征图的行和列不是偶数,则可以先对输入特征图执行填充(padding)操作,以使输入特征图的行和列均是偶数。由于在winograd算法中,输入信号D是4×4的矩阵,本申请为了利用winograd算法,输入特征图的行和列还应当是不小于4的偶数。
在本申请提供的方案中,可以增加矩阵变换单元。通过矩阵变换单元可以对每一个目标矩阵进行winograd正变换,以得到变换后的目标矩阵。其中,对目标矩阵进行winograd正变换的过程可以参照winograd算法中对输入信号进行正变换的过程进行理解,即对目标矩阵左乘B T矩阵,并右乘B矩阵,得到变换后的目标矩阵。通过矩阵变换单元还可以对每一个卷积核进行winograd正变换,以得到变换后的卷积核。其中,对卷积核进行winograd正变换的过程可以参照winograd算法中对卷积核进行正变换的过程进行理解,即对卷积核左乘G矩阵,右乘G T矩阵,以得到变换后的卷积核。
此外,在卷积神经网络中,输入特征图包括多个图像通道,即输入特征图相比于winograd算法中的输入信号还增加了一个维度,该增加的维度是输入通道数目。在卷积神经网络中,卷积核包括输入通道数目的维度,卷积核还包括输出通道数目的维度(也就是卷积核的个数),即相比于winograd算法中的卷积核,卷积神经网络中的卷积核还增加了两个维度,分别是输入通道数和输出通道数。在winograd算法中,需要对矩阵U和矩阵V进行点乘运算,由于在卷积神经网络中,输入特征图增加了一个输入通道数维度,卷积核增加了输入通道数维度和输出通道数维度,所以无法将winograd算法直接应用到卷积神经网络中,现有技术一般需要对卷积神经网络的核心计算单元进行较多的修改,或者需要专用的硬件进行支持,本申请提供的方案则在得到变换后的目标矩阵,变换后的卷积核的基础上,将点乘运算的过程转换为矩阵乘运算。通过本申请提供的方案,仅需要增加矩阵变换单元,再利用卷积神经网络中常规的矩阵运算模块以及向量运算模块就可以实现将winograd算法应用到卷积神经网络中。其中,关于如何将点乘运算转换为矩阵乘运算,本申请通过构建第一矩阵和第二矩阵的方式,将点乘运算转换为第一矩阵和第二矩阵的相乘。第一矩阵包括每个变换的目标矩阵中的第i个元素,i为不大于16的正整数,第一矩阵是m行k列的矩阵,m为((W-2)(H-2)/4),第二矩阵包括每个变换的卷积核的第i个元素,第二矩阵是K行n列的矩阵,相乘结果用于确定输出特征图。通过上述过程,可以得到16个第一矩阵和16个第二矩阵,16个第一矩阵和16个第二矩阵一一对应相乘,以得到16个相乘结果。比如i为1时,第一矩阵包括每个变换的目标矩阵中的第1个元素,第二矩阵包括每个变换的卷积核的第1个元素,第一矩阵和第二矩阵相乘,以得到第1个相乘结果,i为2时,第一矩阵包括每个变换的目标矩阵中的第2个元素,第二矩阵包括每个变换的卷积核的第2个元素,第一矩阵和第二矩阵相乘,以得到第2个相乘结果,依次类推,i为16时,可以得到第16个相乘结果,本申请有时也将相乘结果称为矩阵乘结果,二者表示相同的意思。再通过向量运算模块对矩阵乘的结果进行winograd反变换,对矩阵乘的结果进行winograd反变换的过程即对矩阵乘的结果左乘A T矩阵,右乘A矩阵。由于本方案采用了构建第一矩阵和第二矩阵的方式,将点乘运算后的结果转换为16个矩阵乘结果,对矩阵乘结果进行 winograd反变换的过程,相当于对16个矩阵乘的结果进行向量的加减运算,可以通过常规的向量运算模块实现,具体的过程将在下文展开介绍。通过向量运算模块对16个矩阵乘的结果进行处理后,再对处理后的结果进行重排序,或者对处理后的结果进行求和或者求累加和即可得到与该输入特征图对应的输出特征图。
此外,在上述研究思路的基础上,为了减小加速器的面积,本申请提供的方案将对卷积核的正变换的过程拆分为两部分,一部分过程离线执行,另一部分过程在片上执行,或者卷积核的正变换的结果均通过离线计算获取。此外,输入特征图以及卷积核的数据格式可能是定点数,为了满足定点数的卷积运算的需求,本申请提供的方案可以支持反量化和量化处理,其中,反量化的过程可以是在反变换运算之前执行的,可以节省位宽,算力更大。此外,本申请提供的方案对一个相乘结果进行偏移操作,可以等效为对输出特征图进行偏移操作。此外,为了提升加速器的运算效率,本申请提供的方案中的矩阵变换单元,矩阵运算模块以及向量运算模块可以通过流水线并行执行,本申请提供的方案中的一部分计算可以是随路运算,比如一部分的winograd反变换可以在从矩阵运算模块到向量运算模块的搬运过程中随路运算(on-the-fly calculation)完成。
基于上面的研究思路,下面对本申请提供的技术方案进行具体的介绍。
参阅图3-a,为本申请实施例提供的一种基于winograd算法的神经网络加速器的结构的示意图。本申请提供的一种神经网络加速器包括预处理模块301,矩阵运算模块302以及向量运算模块303。
本申请提供的神经网络加速器,相比于现有技术中已有的神经网络加速器,仅需要增加预处理模块,既可以实现将winograd算法应用到神经网络中。
预处理模块301,用于对输入特征图对应的目标矩阵进行第一winograd正变换,以得到变换的目标矩阵。
预处理模块301,还用于对卷积核进行第二winograd正变换,以得到变换的卷积核。
矩阵运算模块302,用于对第一矩阵和第二矩阵进行矩阵乘法运算,以得到相乘结果。第一矩阵是根据变换的目标矩阵构建的,第二矩阵是根据变换的卷积核构建的。在一些实现中,矩阵运算模块302内部包括多个处理单元(process engine,PE)。在一些实现中,矩阵运算模块302是二维脉动阵列。矩阵运算模块302还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,矩阵运算模块302是通用的矩阵处理器。举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。矩阵运算模块从存储器中取矩阵B相应的数据,并缓存在矩阵运算模块中每一个PE上。矩阵运算模块从存储器中取矩阵A数据与矩阵B进行矩阵运算。
向量运算模块303,用于对相乘结果进行winograd反变换,以得到输出特征图。包括多个运算处理单元,在需要的情况下,对矩阵运算模块的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特征平面进行上采样等。
在一些可能的实施方式中,参阅图3-b,预处理模块301可以包括获取单元3011,遍历单元3012以及矩阵变换单元3013。
获取单元3011,用于获取执行过填充padding后的输入特征图,输入特征图的尺寸为 W×H×k,W和H均为不小于4的偶数,k为正整数,W为输入特征图的行,H为输入特征图的列,k为输入特征图的通道数目。本申请有时也将输入特征图的通道数目简称为输入通道数或者输入通道数目,他们表示相同的意思。
可以将padding理解为在输入特征图的***补充一些像素点,把这些像素点初始化为0或者其他指定的数值。对于行和列不是不小于4的偶数的输入特征图,可以在padding处理的过程中,通过在输入特征图的***补充像素点的方式,使输入特征图的行和列均为不小于4的偶数。
需要说明的是,相关技术中关于padding的计算方式本申请实施例均可以采用。
遍历单元3012,用于通过步长为2,尺寸为4×4的滑窗遍历输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵,目标矩阵是滑窗对应区域的输入特征图。
参阅图4,为本申请实例提供的加速器中遍历单元3012遍历输入特征图的示意图。在图4中只展示了行和列两个维度,并未展示输入通道数目这一维度。由于输入特征图的行和列分别是W和H,通过步长为2,尺寸为4×4的滑窗遍历输入特征图,可以获得((W-2)(H-2)/4)个滑窗对应的区域,即可以获取((W-2)(H-2)/4)个4×4的矩阵,参阅图4,可以认为矩阵
Figure PCTCN2020118832-appb-000012
为一个滑窗对应的区域,矩阵
Figure PCTCN2020118832-appb-000013
即为一个目标矩阵。如果考虑到每个目标矩阵还包括输入通道数这一维度,则通过遍历单元3012,遍历输入特征图后,可以得到(((W-2)(H-2)/4)×k)个目标矩阵。
矩阵变换单元3013,用于对目标矩阵进行第一winograd正变换,以得到变换的目标矩阵。参阅图4,展示了对目标矩阵
Figure PCTCN2020118832-appb-000014
进行第一winograd正变换,得到变换的目标矩阵
Figure PCTCN2020118832-appb-000015
的过程。即对目标矩阵
Figure PCTCN2020118832-appb-000016
左乘B T矩阵,右乘B矩阵,以得到变换的目标矩阵。
矩阵变换单元3013,还用于对尺寸为3×3×k×n,步长为1的卷积核进行第二winograd正变换,以得到变换的卷积核,n为输出特征图的通道数目。参阅图5,给出了一种对卷积 核
Figure PCTCN2020118832-appb-000017
进行第二winograd正变换,得到变换的卷积核
Figure PCTCN2020118832-appb-000018
的过程。即对卷积核
Figure PCTCN2020118832-appb-000019
左乘G矩阵,右乘G T矩阵,得到变换的卷积核。
矩阵运算模块302,用于确定第一矩阵和第二矩阵的相乘结果,第一矩阵包括每个变换的目标矩阵中的第i个元素,i为不大于16的正整数,第一矩阵是m行k列的矩阵,m为((W-2)(H-2)/4),第二矩阵包括每个变换的卷积核的第i个元素,第二矩阵是K行n列的矩阵,相乘结果用于确定输出特征图。
在winograd算法中,应该对变换的卷积核和变换的目标矩阵进行点乘运算,本申请的将变换的卷积核和变换的目标矩阵的点乘运算转换为两个矩阵之间的乘法运算,以通过这样的设计仅采用常规的矩阵运算模块302就可以实现将winograd算法应用到卷积神经网络中。下面对如何构建第一矩阵和第二矩阵的思路进行说明。
将每个变换的目标矩阵中的第i个元素提取出来,形成m行k列的矩阵,该矩阵为第一矩阵。在图4的介绍中没有展示目标矩阵的k维度,即没有展示输入特征图包括多个输入通道,在构建第一矩阵的过程中考虑到输入特征图包括多个输入通道,则每个变换的目标矩阵中的每个元素应当包括多个输入通道。参阅图6-a,以i为1为例进行说明,i为1时,第一矩阵包括每个变换的目标矩阵中的第1个元素,考虑到输入特征图还包括输入通道数目这一维度,第一矩阵是m行k列的矩阵。需要说明的是,图6-a中展示的第一矩阵中的行数和列数仅为示例性的说明,k的取值应该根据输入特征图的输入通道确定,m的取值应当根据输入特征图的行数和列数确定,具体的m为((W-2)(H-2)/4),本申请对此不再重复说明。为了更好的理解方案,下面在以i为5为例进行说明,参阅图6-b,i为5时,第一矩阵包括每个变换的目标矩阵中的第5个元素,第一矩阵是m行k列的矩阵。由于每个变换的目标矩阵都会包括16个元素,所以一共可以获得16个第一矩阵。其中每个第一矩阵的构建方式可以参照图6-a和图6-b进行理解。
将每个变换的卷积核中的第i个元素提取出来,形成K行n列的矩阵,该矩阵为第二矩阵。参阅图7,以i为1为例进行说明,i为1时,第二矩阵包括每个变换的卷积核中的第1个元素,考虑到输入特征图还包括输入通道数目这一维度,第二矩阵是K行n列的矩阵。需要说明的是,图7展示的第二矩阵的行数和列数仅为示例性的说明,n的取值应当根据输出通道数确定,换句话说,应当根据卷积核的数目确定,本申请对此不再重复说明。由于每个变换的卷积核都会包括16个元素,所以一共可以获得16个第二矩阵。其中每个第二矩阵的构建方式可以参照图7进行理解。
通过以上的方式,变换后的目标矩阵与变换后的卷积核之间的点乘运算可以转换为第一矩阵和第二矩阵的乘法,参阅图8,即点乘运算的结果相当于16个矩阵乘的相乘结果。假设16个矩阵乘的相乘结果分别是矩阵S1,矩阵S2,矩阵S3,矩阵S4,矩阵S5,矩阵S6,矩阵S7,矩阵S8,矩阵S9,矩阵S10,矩阵S11,矩阵S12,矩阵S13,矩阵S14,矩 阵S15和矩阵S16。本申请提供的加速器还包括向量运算模块303,因为winograd反变换的变换矩阵A T和A中的元素为0或者±1,所以对相乘结果进行winograd反变换相当于通过向量运算模块对16个矩阵乘的相乘结果进行element wise操作。其中A T和A通过如下公式表示:
Figure PCTCN2020118832-appb-000020
Figure PCTCN2020118832-appb-000021
其中,element wise是指对至少两个矩阵中对应的元素进行运算,比如将一个矩阵中的第i个元素和另一个矩阵中的第i元素进行运算,其中运算可以包括加法运算,减法运算等等。
具体的,对16个相乘结果进行相加或者相减通过winograd反变换公式可以确定Q1=P1+P2+P3,Q2=P2-P3-P4,Q3=P5+P6+P7,Q4=P6-P7-P8,其中P1=S0+S4+S8,P2=S1+S5+S9,P3=S2+S6+S10,P4=S3+S7+S11,P5=S4-S8-S12,P6=S5-S9-S13,P7=S6-S10-S14,P8=S7-S11-S15。
其中,Q1,Q2,Q3以及Q4可以用于确定与输入特征图对应的输出特征图。
可见,对16个相乘结果进行winograd反变换可以转换为通过常规的向量运算模块303对16个矩阵乘的相乘结果进行相加或者相减运算,以输出第三矩阵,第三矩阵可以包括Q1,Q2,Q3以及Q4。可以对第三矩阵进行处理,以获得输出特征图。
在一个可能的实施方式中,如果在池化层对输入特征图进行处理,由于池化层的常见操作通常包含最大值池化,均值池化,所以可以对第三矩阵中包括的Q1,Q2,Q3以及Q4四个矩阵求最大值或者求和,均值池化时输出(Q1+Q2+Q3+Q4)/4,最大值池化时输出MAX(Q1,Q2,Q3,Q4)。根据本申请提供的方案中输出的数据,比如(Q1+Q2+Q3+Q4)/4以及MAX(Q1,Q2,Q3,Q4)可以作为输出特征图的一种表达形式。
在一个可能的实施方式中,如果在卷积层对输入特征图进行处理,还需要根据预设的重排序规则对第三矩阵中的元素进行重排序,以得到输出特征图。参阅图9,将Q1矩阵中的第i个元素,Q2矩阵中的第i个元素,Q3矩阵中的第i个元素以及Q4矩阵中的第i个元素提取出来组成2×2的矩阵,重排序后得到输出特征图。参阅图9举例说明,将Q1矩阵中的第1个元素Q1.1,Q2矩阵中的第1个元素2.1,Q3矩阵中的第1个元素3.1以及Q4矩阵中的第1个元素Q4.1提取出来组成2×2的矩阵,将Q1矩阵中的第2个元素Q1.1,Q2矩阵中的第2个元素2.1,Q3矩阵中的第2个元素3.1以及Q4矩阵中的第2个元素Q4.1提取出来组成2×2的矩阵,以此类推,直到将Q1,Q2,Q3以及Q4四个矩阵中的元素都按照进行了重排序。在一个可能的实施方式中,可以通过向量运算模块303对第三矩阵中的元素进行行内重排,再通过向量运算模块303对第三矩阵中的元素进行行间重排。在一个可能的实施方式中,可以通过向量运算模块303对第三矩阵中的元素进行行内重排吗,然后通过直接内存存取(direct memory access,DMA)搬运进行行间重排。需要说明的是,第三矩阵中的每个元素均包括多个输出通道。
下面对根据预设的重排序规则对第三矩阵中的元素进行重排序,以得到输出特征图的原理进行说明。将每个变换的目标矩阵中的第1个元素提取出来,形成的m行k列的第一矩阵,将每个变换的卷积核中的第1个元素提取出来,形成K行n列的第二矩阵,i为1时,第一矩阵和第二矩阵的相乘结果是S1,将每个变换的目标矩阵中的第2个元素提取出来,形成的m行k列的第一矩阵,将每个变换的卷积核中的第2个元素提取出来,形成K行n列的第二矩阵,i为2时,第一矩阵和第二矩阵的相乘结果是S2,以此类推。如果将S1至S16中的每个矩阵中的第1个元素都提取出来组成矩阵,比如组成矩阵1,则对该矩阵1进行winograd反变换后,可以输出2×2的矩阵,且该2×2的矩阵中的每个元素包括多个输出通道数,即每个元素都具有输出通道数这一维度。该2×2的矩阵1即为第一个滑窗所在区域的输入特征图对应的输出特征图。再比如,如果将S1至S16中的每个矩阵中的第2个元素都提取出来组成矩阵,比如组成矩阵2,则对该矩阵2进行winograd反变换后,可以输出2×2的矩阵,且该2×2的矩阵中的每个元素包括多个输出通道数。该2×2的矩阵2即为第二个滑窗所在区域的输入特征图对应的输出特征图,第二个滑窗是指步长为2的滑窗滑动一次。获取矩阵1对应的2×2的矩阵中的第i个元素,以及获取矩阵2对应的2×2的矩阵中的第i个元素的运算流程相同,依次类推,获取矩阵i对应的2×2的矩阵中的第i个元素的运算流程都相同,矩阵i是将S1至S16中的每个矩阵中的第i个元素都提取出来组成的矩阵。所以,对16个相乘结果进行winograd反变换以输出Q1,Q2,Q3以及Q4,Q1包括矩阵1至矩阵16中的第1个元素,Q2包括矩阵1至矩阵16中的第2个元素,Q3包括矩阵1至矩阵16中的第3个元素,Q4包括矩阵1至矩阵16中的第4个元素。所以得到Q1,Q2,Q3以及Q4后,需要根据预设的重排序规则对第三矩阵中的元素进行重排序,以得到输出特征图,其中重排序的方式可以参照图9进行理解。
以上对本申请实施例提供的一种加速器进行了介绍,本申请提供的方案利用通用卷积神经网络中常规的矩阵运算模块以及向量运算模块就可以实现将winograd算法应用到卷积神经网络中,针对3×3,步长为1的卷积层或者池化层,可以大量减少其中的乘法次数,提升加速器的性能以及能效比。
上文提到将每个变换的目标矩阵中的第i个元素提取出来,形成m行k列的矩阵,该矩阵为第一矩阵。为了能够进一步的提升加速器的性能,可以一次提取每个变换的目标矩阵中的多个元素,一次输出多个第一矩阵。示例性的,下面结合几个具体的实施方式进行说明。
对于每一个目标矩阵进行winograd正变换,将其转换为变换的目标矩阵的方式,都可以通过如下公式2-2表示。
Figure PCTCN2020118832-appb-000022
Figure PCTCN2020118832-appb-000023
其中,m00=P00-P20-P02+P22,m10=P10+P20+P12+P22,m20=P20-P10-P22+P12,m30=P10-P30-P12+P32,可见,m00,m10,m30以及m40的运算均用到了目标矩阵的第一列和第三列。m01=P01-P21+P02-P22,m11=P11+P21+P12+P22,m21=P21-P11+P22-P12,m31=P11-P31+P12-P32,可见,m01,m11,m21以及m31的运算均用到了目标矩阵的第二列和第三列。m02=P02-P22-P01+P21,m12=P22+P12-P11-P21,m22=P22-P12-P21+P11,m32=P12-P32-P11+P31,可见,m02,m12,m22以及m32的运算均用到了目标矩阵的第二列和第三列。m03=P01-P21-P03+P23,m13=P11+P21-P13-P23,m23=P21-P11-P23+P13,m33=P11-P31-P13+P33,可见,m03,m13,m23以及m33的运算均用到了目标矩阵的第二列和第四列。参阅图10,更直观的体现变换的目标矩阵中的部分元素的取值可以并行计算。当一次性获取目标矩阵中的多个元素时,可以根据获取的多个元素输出多个第一矩阵,或者输出多个第一矩阵中的部分元素。比如获取每个滑窗对应的第一列和第三列元素,当滑窗滑动一次时,可以获得三列元素。根据获取的三列元素,可以分别输出两个变换的目标矩阵的第一列元素。再比如,获取每个滑窗对应的全部元素,当滑窗滑动一次时,可以获得两个目标矩阵中的全部元素。对两个目标矩阵中的全部元素进行统一的运算,可以同时输出两个变换的目标矩阵。可以认为为了让矩阵运算模块的利用率最大化,矩阵变换单元每次输出的变换的目标矩阵的个数可以根据实际带宽,矩阵运算模块的存储量来确定,比如矩阵变换单元每次输出1个变换的目标矩阵,2个变换的目标矩阵,4个变换的目标矩阵,8个变换的目标矩阵或者16个变换的目标矩阵等等。
下面结合几个实施例进行说明。根据上文的介绍可知,变换的目标矩阵中的各个元素的计算过程存在交叉部分,比如变换的目标矩阵中的第一列元素的计算都用到了目标矩阵中的第一列元素和第三列元素,则参阅图11-a,可以获取多个目标矩阵中的奇数列的元素,根据一次或者多次获取到的目标矩阵中的奇数列的元素,确定一个或者多个变换的目标矩阵中的第一列元素。比如如图11-a所示,获取了三列目标矩阵中奇数列元素,可以获取两个变换的目标矩阵中的第一列元素。再比如,变换的目标矩阵中的第二列和第三列元素的计算都用到了目标矩阵中的第二列和第三列元素,则参阅图11-b,可以获取多个目标矩阵中的多列元素,根据一次或者多次获取到的目标矩阵中的多列元素,确定一个或者多个变换的目标矩阵中的第二列和第三列元素。比如如图11-b所示,获取了4列目标矩阵中的元素,可以获取两个变换的目标矩阵中的第二列和第三列元素。再比如,变换的目标矩阵中的第四列元素的计算都用到了目标矩阵中的第二列元素和第四列元素,则参阅图11-c,可以获取多个目标矩阵中的偶数列的元素,根据一次或者多次获取到的目标矩阵中的偶数列的元素,确定一个或者多个变换的目标矩阵中的第四列元素。如图11-a至11-c所示,当获取4行6列元素后,可以根据该4行6列元素输出两个变换的目标矩阵。需要说明的是,在图11-a至图11-c中并未展示输入通道这一维度,但是应当明确每个目标矩阵的每个元素均包括多个输入通道,每个变换的目标矩阵的每个元素也包括多个输入通道。
为了能够进一步的提升加速器的性能,可以一次提取每个卷积核中的多个元素,一次输出多个第二矩阵。变换的卷积核中的各个元素的计算过程存在交叉部分,下面结合公式2-3进行说明。
Figure PCTCN2020118832-appb-000024
其中,q00=k′ 00,q10=(k′ 00+k′ 10+k′ 20)/2,q20=(k′ 00-k′ 10+k′ 20)/2,q30=k′ 20,可见,q00,q10,q20以及q30的运算均用到了卷积核的第一列。q01=(k′ 00+k′ 01+k′ 02)/2,q11=(k′ 00+k′ 01+k′ 02+k′ 10+k′ 11+k′ 12+k′ 20+k′ 21+k′ 22)/4,q21=(k′ 00+k′ 01+k′ 02-k′ 10-k′ 11-k′ 12+k′ 20+k′ 21+k′ 22)/4,q31=(k′ 20+k′ 21+k′ 22)/2。可见,q00,q10,q20以及q30的运算均用到了卷积核的每一列。q02=(k′ 00-k′ 01+k′ 02)/2,q12=(k′ 00-k′ 01+k′ 02+k′ 10+k′ 11+k′ 12+k′ 20-k′ 21+k′ 22)/4,q22=(k′ 00-k′ 01+k′ 02-k′ 10+k′ 11-k′ 12+k′ 20-k′ 21+k′ 22)/4,q32=(k′ 20-k′ 21+k′ 22)/2。可见,q02,q12,q22以及q32的运算均用到了卷积核的每一列。q03=k′ 02,q13=(k′ 02+k′ 12+k′ 22)/2,q23=(k′ 02-k′ 12+k′ 22)/2,q33=k′ 22,可见,q00,q10,q20以及q30的运算均用到了卷积核的第三列。
对于每一个卷积核进行winograd正变换,将其转换为变换的卷积核的方式,都可以通过公式2-3表示。变换的卷积核中的各个元素的计算过程存在交叉部分,可以通过对卷积核的每个元素之间进行向量加减来进行运算,以输出多个变换的卷积核,或者输出多个变换的卷积核中的部分元素。为提高并行度,每个点具有全部或部分的输入通道数目和输出通道数目。
需要说明的是,对应不同的带宽和存储需求,在输出16个第一矩阵,或者16个第二矩阵时,可以有多种计算顺序。
在一个可能的实施方式中,为了减少加速器中的矩阵变换单元的计算量,可以将第二winograd正变换的过程离线执行,即本申请提供的加速器中还包括存储模块,该存储模块用于存储第二winograd正变换的结果,加速器中的其他模块可以直接调用存储模块中预先存储的第二winograd正变换的结果。在一个可能的实施方式中,也可以部分第二winograd正变换的过程在片上执行,部分第二winograd正变换的过程是离线执行。下面对此进行举例说明。
第二winograd正变换包括第三winograd正变换以及第四winograd正变换,神经网络加速器还包括存储模块,存储模块,用于存储通过第三矩阵对卷积核进行第三winograd正变换的第一变换结果。矩阵变换单元,具体用于通过第四矩阵对第一变换结果进行第四winograd正变换,以得到变换的卷积核,第三矩阵和所第四矩阵是对第二winograd正变换 的变换矩阵进行分解后得到的矩阵,第三矩阵中的元素的取值为0或者±1,第四矩阵是分解后的矩阵中除第三矩阵之外的矩阵。下面举例说明,G×K×G T=V可以转换为公式2-4:
V=G=K×G T=GL×(GR×K×GR T)×GL T=GL×Wm×GL T  (2-4)
其中,Wm=GR×K×GR T可以是离线执行,该结果可以预选存储在存储模块中,GL×Wm×GL T可以是片上执行。将第二winograd正变换的变换矩阵G拆分为一个3×3的矩阵GR(2-5)和一个4×3的矩阵GL(2-6)。需要说明的,可以还有其他的拆分方式,使拆分后的变换矩阵中的一个矩阵中的全部元素均为0或者±1。
Figure PCTCN2020118832-appb-000025
Figure PCTCN2020118832-appb-000026
为了满足定点数的卷积运算的需求,本申请提供的方案可以支持反量化和量化处理。在一个可能的实施方式中,向量运算模块可以支持反量化(De-quantization)和量化(Quantization)操作,以满足定点数运算的需求。其中,反量化可以用于把定点数转为浮点数或其他利于向量运算模块运算的定点数,例如:s32->f16、s32->s16;量化用于把向量运算模块重排后的结果转为下一层运算的定点数输入,例如:s16->s8、f16->s8。在一个可能的实施方式中,反量化可以位于Winograd反变换之前,量化可以位于Winograd反变换之后。其中,反量化的过程可以是在反变换运算之前执行的,可以节省位宽,算力更大。需要说明的是,关于量化和反量化的具体方式本申请实施例并不进行限定。
参阅图12,为本申请提供的一种加速器的结构示意图。本申请提供的加速器基于常规矩阵运算模块和向量运算模块,通过较少的架构修改,实现了将winograd算法应用到神经网络的加速算法中。该加速器通过遍历单元和矩阵变换单元对获取单元获取的输入特征图进行遍历处理,和winograd正变换处理,以输出16个第一矩阵,该加速器通过矩阵变换单元对卷积核进行winograd正变换处理,以输出16个第二矩阵。关于获取第一矩阵和第二矩阵的方式以及原理已经在上文进行了说明,这里不再重复赘述。在矩阵运算模块中进行16个独立的矩阵乘法运算,生成16个相乘结果。16个相乘结果在向量运算模块中,进行winograd反变换处理,生成4个矩阵结果,最后通过向量运算模块进行后处理,后处理包括数据重排操作,或者求和操作,或者求累加和操作,如果在卷积层对输入特征图进行处理,可以通过向量运算模块的数据搬移功能对数据进行重排操作,以得到输出图像特征。如果在池化层对输入特征图进行处理,可以对数据进行累加运算或者累加和运算以得到输出图像图像特征。此外,该加速器支持浮点、定点等不同数据格式。当计算过程涉及定点运算时,向量运算模块可以进行反量化和量化(Quantization)运算,用于支持定点数卷积运算。
在一个可能的实施方式中,可以对至少一个相乘结果进行偏移操作。在本申请提供的方案中,对一个相乘结果进行偏移操作,可以等效为对输出特征图进行偏移操作。下面对此进行证明:
Figure PCTCN2020118832-appb-000027
上述公式中b代表偏置,通过上述通过2-7可以得到c的其中一个取值为
Figure PCTCN2020118832-appb-000028
可见,对第5个相乘结果进行偏移操作,可以等效为对输出特征图进行偏移操作。参阅图13,为对相乘结果进行偏移操作的一种可能方式的示意图,对一个相乘结果进行偏移操作,可以等效为对输出特征图进行偏移操作。
在一个可能的实施方式中,为了减小加速器的计算器时间,矩阵变换单元,向量运算模块中的运算可以是随路运算。比如,矩阵变换单元的功能可以被固化为一个指令以供调用,矩阵变换单元可以包含在从上层存储器到矩阵运算模块的数据搬运过程中,即在上层存储器的数据搬运到矩阵运算模块的过程中,对数据进行处理,处理的过程参照矩阵变换单元执行的操作进行理解。再比如,向量运算模块的偏移操作、反量化操作、或者一部分的winograd反变换可以通过随路运算完成。参阅图14,展示了本申请提供的方案中的随路运算在整个运算流程中所处位置的示意图。如图14所示,偏移操作、反量化操作、或者一部分的winograd反变换可以在从矩阵运算模块到向量运算模块的搬运过程中随路运算完成。
在一个可能的实施方式中,如图15所示,矩阵变换单元,矩阵运算模块以及向量运算模块可以通过流水线并行执行,以提高运算效率。即矩阵变换单元获取了一部分的winograd正变换的结果,可以将这部分的结果发送至矩阵运算模块,使矩阵运算模块获取一部分的相乘结果,矩阵运算单元获取了一部分的相乘结果后可以将这一部分的相乘结果发送至向量运算单元,使向量运算单元可以对这一部分的相乘结果进行winograd反变换。前文已经介绍了矩阵变换单元每次输出的矩阵的数目可以根据带宽和矩阵运算单元的存储量确定每次输出1个或者多个第一矩阵或者第二矩阵,这里不再重复说明。示例性的,下面结合伪代码进行说明,假设输入特征图的尺寸为56×56×k,假设卷积核的尺寸为3×3×k×n,以下为矩阵变换单元每次只输出4个第一矩阵,4个第二矩阵时的伪代码。
Figure PCTCN2020118832-appb-000029
Figure PCTCN2020118832-appb-000030
参阅图16,考虑到实际带宽,以及矩阵运算模块的存储量,对输入特征图和卷积核进行切块处理,以分多次运算得到输出特征图的示意图,具体过程可以参照伪代码进行理解,这里不再重复赘述。
本申请实施例还提供一种加速方法,可以包括以下步骤:对输入特征图对应的目标矩阵进行第一winograd正变换,以得到变换的目标矩阵。对卷积核进行第二winograd正变换,以得到变换的卷积核。对第一矩阵和第二矩阵进行矩阵乘法运算,以得到相乘结果,第一矩阵是根据变换的目标矩阵构建的,第二矩阵是根据变换的卷积核构建的。对相乘结果进行winograd反变换,以得到输出特征图。
在一个可能的实施方式中,该方法还包括对输入特征图执行填充padding操作,以使输入特征图的尺寸为W×H×k,W和H均为不小于4的偶数,k为大于1的整数,W为输入特征图的行,H为输入特征图的列,k为输入特征图的通道数目。通过步长为2,尺寸为4×4的滑窗遍历输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵。
在一个可能的实施方式中,对输入特征图执行填充padding操作,以使输入特征图的尺寸为W×H×k,W和H均为不小于4的偶数,k为大于1的整数,W为输入特征图的行,H为输入特征图的列,k为输入特征图的通道数目。通过步长为2,尺寸为4×4的滑窗遍历输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵。
在一个可能的实施方式中,卷积核的尺寸为3×3×k×n,卷积核的步长为1,n为输出特征图的通道数目,n为大于1的整数。
在一个可能的实施方式中,第一矩阵包括变换的目标矩阵中的第i个元素,i为不大于16的正整数,第一矩阵是m行k列的矩阵,m为((W-2)(H-2)/4),第二矩阵包括变换的卷积核的第i个元素,第二矩阵是K行n列的矩阵,相乘结果用于确定输出特征图。
在一个可能的实施方式中,对相乘结果进行winograd反变换,以得到输出特征图,包括:对相乘结果进行winograd反变换,以得到第三矩阵。通过预设的重排序规则对第三矩阵中的元素进行重排序,以得到输出特征图。
在一个可能的实施方式中,对相乘结果进行winograd反变换,以得到输出特征图,包括:对相乘结果进行winograd反变换,以输出第三矩阵。对第三矩阵中的元素进行求和运算,以得到输出特征图。
在一个可能的实施方式中,第二winograd正变换包括第三winograd正变换以及第四 winograd正变换,对尺寸为3×3×k×n,步长为1的卷积核进行第二winograd正变换,以得到变换的卷积核,包括:通过第三矩阵对卷积核进行第三winograd正变换的第一变换结果。通过第四矩阵对第一变换结果进行第四winograd正变换,以得到变换的卷积核,第三矩阵和所第四矩阵是对第二winograd正变换的变换矩阵进行分解后得到的矩阵,第三矩阵中的元素的取值为0或者±1,第四矩阵是分解后的矩阵中除第三矩阵之外的矩阵。
在一个可能的实施方式中,还包括:获取多个变换的目标矩阵的M个元素,M为大于1的整数。按照第一预设公式对M个元素进行处理,以输出多个第一矩阵。获取多个变换的卷积核的N个元素,N为大于1的整数。按照第二预设公式对N个元素进行处理,以输出多个第二矩阵。
在一个可能的实施方式中,还包括:对一个相乘结果进行偏移操作。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于加速的程序,当其在计算机上行驶时,使得计算机执行如前述图3-a至图15所示实施例描述神经网络加速器执行的步骤。
本申请中的神经网络加速器也可以通过数字处理芯片或者芯片实现,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图3-a或图15中任一实施例所示的神经网络加速器执行的方法步骤。
本申请实施例还提供一种数字处理芯片。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中神经网络加速器执行的动作。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上行驶时,使得计算机执行如前述图3-a至图15所示实施例描述的方法中神经网络加速器所执行的步骤。
本申请实施例提供的神经网络加速器可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述图3-a至图15所示实施例描述的神经网络加速器所执行的步骤。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
请参阅图17,图17为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU,NPU作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为矩阵运算模块302,通过控制器308控制矩阵运算模块302提取存储器中的矩阵数据并进行乘法运算。需要说明的是,控制器308还可以控制NPU中的其他模块。
矩阵运算模块302具体执行的步骤可以图3-a至图15中任一实施例所描述的矩阵运算模块302执行的步骤进行理解。
还包括预处理模块301,预处理模块具体执行的步骤可以图3-a至图15中任一实施例所描述的预处理模块执行的步骤进行理解。比如,可以参照图3-a至图15中的获取单元3011,遍历单元3012以及矩阵变换单元3013执行的动作进行理解。
总线接口单元(bus interface unit,BIU)310,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)309的交互。
总线接口单元310(bus interface unit,BIU),用于取指存储器309从外部存储器获取指令,还用于存储单元访问控制器306从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
向量运算模块303向量运算模块具体执行的步骤可以图3-a至图15中任一实施例所描述的向量运算模块303执行的步骤进行理解。
在一些实现中,向量运算模块303能将经处理的输出的向量存储到统一存储器307。例如,向量运算模块303可以将线性函数和/或非线性函数应用到矩阵运算模块302的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量运算模块303生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到矩阵运算模块302的激活输入,例如用于在神经网络中的后续层中的使用。
控制器308连接的取指存储器(instruction fetch buffer)309,用于存储控制器308使用的指令。
统一存储器307,输入存储器305,权重存储器304以及取指存储器309均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由矩阵运算模块302或向量运算模块303执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述图3-a至图15的方法的程序执行的集成电路。
数据流为通过总线接口单元310从外部存储器获取数据,可以包括输入特征图和权重,并将获取到的数据存储到统一存储器中,存储单元访问控制器控制统一存储器,使统一存储器中的数据传输至矩阵变换单元,矩阵变换单元输出的数据传输至权重存储器304和输入存储器,权重存储器304和输入存储器输出数据至矩阵运算模块,矩阵运算模块输出的数据传输至向量运算模块,向量运算模块的输出结果存储在统一存储器中,结果可以输出至外部总线。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软 件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。

Claims (27)

  1. 一种神经网络加速器,其特征在于,包括:
    预处理模块,用于对输入特征图对应的目标矩阵进行第一winograd正变换,以得到变换的目标矩阵;
    所述预处理模块,还用于对卷积核进行第二winograd正变换,以得到变换的卷积核;
    矩阵运算模块,用于对第一矩阵和第二矩阵进行矩阵乘法运算,以得到相乘结果,所述第一矩阵是根据所述变换的目标矩阵构建的,所述第二矩阵是根据所述变换的卷积核构建的;
    向量运算模块,用于对所述相乘结果进行winograd反变换,以得到输出特征图。
  2. 根据权利要求1所述的神经网络加速器,其特征在于,所述预处理模块,还用于:
    通过滑窗遍历所述输入特征图,以得到所述输入特征图对应的目标矩阵。
  3. 根据权利要求2所述的神经网络加速器,其特征在于,所述输入特征图为执行过填充操作的输入特征图,所述输入特征图的尺寸为W×H×k,所述W和所述H均为不小于4的偶数,所述k为大于1的整数,所述W为所述输入特征图的行,所述H为所述输入特征图的列,所述k为所述输入特征图的通道数目,
    所述预处理模块,具体用于:通过步长为2,尺寸为4×4的滑窗遍历所述输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵。
  4. 根据权利要求3所述的神经网络加速器,其特征在于,所述卷积核的尺寸为3×3×k×n,所述卷积核的步长为1,所述n为输出特征图的通道数目,所述n为大于1的整数。
  5. 根据权利要求4所述的神经网络加速器,其特征在于,所述第一矩阵包括所述变换的目标矩阵中的第i个元素,所述i为不大于16的正整数,所述第一矩阵是m行k列的矩阵,所述m为((W-2)(H-2)/4),所述第二矩阵包括所述变换的卷积核的第i个元素,所述第二矩阵是K行n列的矩阵。
  6. 根据权利要求1至5任一项所述的神经网络加速器,其特征在于,所述向量运算模块,具体用于:
    对所述相乘结果进行所述winograd反变换,以得到第三矩阵;
    通过预设的重排序规则对所述第三矩阵中的元素进行重排序,以得到所述输出特征图。
  7. 根据权利要求1至5任一项所述的神经网络加速器,其特征在于,所述向量运算模块,具体用于:
    对所述相乘结果进行所述winograd反变换,以得到第三矩阵;
    对所述第三矩阵中的元素进行求和运算或者求最大值运算,以得到所述输出特征图。
  8. 根据权利要求1至7任一项所述的神经网络加速器,其特征在于,所述第二winograd正变换包括第三winograd正变换以及第四winograd正变换,所述神经网络加速器还包括存储模块,
    所述存储模块,用于存储通过第三矩阵对所述卷积核进行第三winograd正变换的第一变换结果;
    所述预处理模块,具体用于通过第四矩阵对所述第一变换结果进行第四winograd正变换,以得到所述变换的卷积核,所述第三矩阵和所述第四矩阵是对所述第二winograd正变换的变换矩阵进行分解后得到的矩阵,所述第三矩阵中的元素的取值为0或者±1,所述第四矩阵是所述分解后的矩阵中除所述第三矩阵之外的矩阵。
  9. 根据权利要求1至8任一项所述的神经网络加速器,其特征在于,所述预处理模块,还用于:
    获取多个所述变换的目标矩阵的M个元素,所述M为大于1的整数;
    按照第一预设公式对所述M个元素进行处理,以输出多个所述第一矩阵;
    获取多个所述变换的卷积核的N个元素,所述N为大于1的整数;
    按照第二预设公式对所述N个元素进行处理,以输出多个所述第二矩阵。
  10. 根据权利要求1至9任一项所述的神经网络加速器,其特征在于,
    所述向量运算模块,还用于对所述相乘结果进行反量化处理,以得到反量化的相乘结果;
    所述向量运算单元,具体用于对所述反量化的相乘结果进行所述winograd反变换,以得到所述输出特征图;
    所述向量运算单元,还用于对所述输出特征图进行量化处理,以得到量化的所述输出特征图。
  11. 根据权利要求1至10任一项所述的神经网络加速器,其特征在于,所述向量运算模块,还用于:
    对所述相乘结果进行偏移操作。
  12. 一种加速方法,其特征在于,包括:
    对输入特征图对应的目标矩阵进行第一winograd正变换,以得到变换的目标矩阵;
    对卷积核进行第二winograd正变换,以得到变换的卷积核;
    对第一矩阵和第二矩阵进行矩阵乘法运算,以得到相乘结果,所述第一矩阵是根据所述变换的目标矩阵构建的,所述第二矩阵是根据所述变换的卷积核构建的;
    对所述相乘结果进行winograd反变换,以得到输出特征图。
  13. 根据权利要求12所述的加速方法,其特征在于,所述方法还包括:
    通过滑窗遍历所述输入特征图,以得到所述输入特征图对应的目标矩阵。
  14. 根据权利要求13所述的加速方法,其特征在于,所述方法还包括:
    所述输入特征图为执行过填充操作的输入特征图,所述输入特征图的尺寸为W×H×k,所述W和所述H均为不小于4的偶数,所述k为大于1的整数,所述W为所述输入特征图的行,所述H为所述输入特征图的列,所述k为所述输入特征图的通道数目;
    通过步长为2,尺寸为4×4的滑窗遍历所述输入特征图,以得到(((W-2)(H-2)/4)×k)个目标矩阵。
  15. 根据权利要求14所述的加速方法,其特征在于,所述卷积核的尺寸为3×3×k×n,所述卷积核的步长为1,所述n为输出特征图的通道数目,所述n为大于1的整数。
  16. 根据权利要求15所述的加速方法,其特征在于,所述第一矩阵包括所述变换的目标矩阵中的第i个元素,所述i为不大于16的正整数,所述第一矩阵是m行k列的矩阵, 所述m为((W-2)(H-2)/4),所述第二矩阵包括所述变换的卷积核的第i个元素,所述第二矩阵是K行n列的矩阵,所述相乘结果用于确定所述输出特征图。
  17. 根据权利要求12至16任一项所述的加速方法,其特征在于,所述对所述相乘结果进行winograd反变换,以得到输出特征图,包括:
    对所述相乘结果进行所述winograd反变换,以得到第三矩阵;
    通过预设的重排序规则对所述第三矩阵中的元素进行重排序,以得到所述输出特征图。
  18. 根据权利要求12至16任一项所述的加速方法,其特征在于,所述对所述相乘结果进行winograd反变换,以得到输出特征图,包括:
    对所述相乘结果进行所述winograd反变换,以得到第三矩阵;
    对所述第三矩阵中的元素进行求和运算或者求最大值运算,以得到所述输出特征图。
  19. 根据权利要求12至18任一项所述的加速方法,其特征在于,所述第二winograd正变换包括第三winograd正变换以及第四winograd正变换,所述对卷积核进行第二winograd正变换,以得到变换的卷积核,包括:
    通过第四矩阵对预先存储的第一变换结果进行第四winograd正变换,以得到所述变换的卷积核,所述第一变换结果是通过第三矩阵对所述卷积核进行第三winograd正变换的结果,所述第三矩阵和所述第四矩阵是对所述第二winograd正变换的变换矩阵进行分解后得到的矩阵,所述第三矩阵中的元素的取值为0或者±1,所述第四矩阵是所述分解后的矩阵中除所述第三矩阵之外的矩阵。
  20. 根据权利要求12至19任一项所述的加速方法,其特征在于,所述方法还包括:
    获取多个所述变换的目标矩阵的M个元素,所述M为大于1的整数;
    按照第一预设公式对所述M个元素进行处理,以输出多个所述第一矩阵;
    获取多个所述变换的卷积核的N个元素,所述N为大于1的整数;
    按照第二预设公式对所述N个元素进行处理,以输出多个所述第二矩阵。
  21. 根据权利要求12至20任一项所述的加速方法,其特征在于,所述方法还包括:
    对所述相乘结果进行反量化处理,以得到反量化的相乘结果;
    所述对所述相乘结果进行winograd反变换,以得到输出特征图,包括:
    对所述反量化的相乘结果进行所述winograd反变换,以得到所述输出特征图;
    所述方法还包括:
    对所述输出特征图进行量化处理,以得到量化的所述输出特征图。
  22. 根据权利要求12至21任一项所述的加速方法,其特征在于,所述方法还包括:
    对一个所述相乘结果进行偏移操作。
  23. 一种神经网络装置,其特征在于,所述神经网络装置包括神经网络加速器,所述神经网络加速器为权利要求1至11任一项所述的神经网络加速器。
  24. 一种芯片***,其特征在于,所述芯片***包括处理器和通信接口,所述处理器通过所述通信接口获取程序指令,当所述程序指令被所述处理器执行时实现权利要求12至22中任一项所述的方法。
  25. 一种芯片***,其特征在于,所述芯片***包括处理器和存储器,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现权利要求12至22中任 一项所述的方法。
  26. 一种计算机可读存储介质,其特征在于,包括程序,当其被处理单元所执行时,执行如权利要求12至22中任一项所述的方法。
  27. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得计算机执行如权利要求12至22中任一项所述的方法。
PCT/CN2020/118832 2020-09-29 2020-09-29 一种神经网络加速器、加速方法以及装置 WO2022067508A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2020/118832 WO2022067508A1 (zh) 2020-09-29 2020-09-29 一种神经网络加速器、加速方法以及装置
CN202080105218.7A CN116113941A (zh) 2020-09-29 2020-09-29 一种神经网络加速器、加速方法以及装置
EP20955534.1A EP4213070A4 (en) 2020-09-29 2020-09-29 ACCELERATOR OF A NEURONAL NETWORK AND ACCELERATION METHOD AND DEVICE
US18/191,134 US20230236891A1 (en) 2020-09-29 2023-03-28 Neural network accelerator, acceleration method, and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/118832 WO2022067508A1 (zh) 2020-09-29 2020-09-29 一种神经网络加速器、加速方法以及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/191,134 Continuation US20230236891A1 (en) 2020-09-29 2023-03-28 Neural network accelerator, acceleration method, and apparatus

Publications (1)

Publication Number Publication Date
WO2022067508A1 true WO2022067508A1 (zh) 2022-04-07

Family

ID=80949248

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118832 WO2022067508A1 (zh) 2020-09-29 2020-09-29 一种神经网络加速器、加速方法以及装置

Country Status (4)

Country Link
US (1) US20230236891A1 (zh)
EP (1) EP4213070A4 (zh)
CN (1) CN116113941A (zh)
WO (1) WO2022067508A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114904901A (zh) * 2022-06-09 2022-08-16 清华大学 稳定化材料选择方法、装置、计算机设备、介质和产品
CN114995782A (zh) * 2022-08-03 2022-09-02 上海登临科技有限公司 数据处理方法、装置、设备和可读存储介质
CN115391727A (zh) * 2022-08-18 2022-11-25 上海燧原科技有限公司 一种神经网络模型的计算方法、装置、设备及存储介质
CN115600062A (zh) * 2022-12-14 2023-01-13 深圳思谋信息科技有限公司(Cn) 卷积处理方法、电路、电子设备及计算机可读存储介质
CN116152520A (zh) * 2023-04-23 2023-05-23 深圳市九天睿芯科技有限公司 用于神经网络加速器的数据处理方法、芯片及电子设备
CN116167424A (zh) * 2023-04-23 2023-05-26 深圳市九天睿芯科技有限公司 基于cim的神经网络加速器、方法、存算处理***与设备
WO2023231559A1 (zh) * 2022-05-31 2023-12-07 华为技术有限公司 一种神经网络加速器、加速方法以及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767000A (zh) * 2019-01-16 2019-05-17 厦门美图之家科技有限公司 基于Winograd算法的神经网络卷积方法及装置
CN110533164A (zh) * 2019-08-05 2019-12-03 西安交通大学 一种面向卷积神经网络加速器的Winograd卷积拆分方法
US20200019851A1 (en) * 2018-07-10 2020-01-16 The George Washington University Optical convolutional neural network accelerator
CN110807513A (zh) * 2019-10-23 2020-02-18 中国人民解放军国防科技大学 一种基于Winograd稀疏算法的卷积神经网络加速器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3557484B1 (en) * 2016-12-14 2021-11-17 Shanghai Cambricon Information Technology Co., Ltd Neural network convolution operation device and method
CN108765247B (zh) * 2018-05-15 2023-01-10 腾讯科技(深圳)有限公司 图像处理方法、装置、存储介质及设备
CN111260020B (zh) * 2018-11-30 2024-04-16 深圳市海思半导体有限公司 卷积神经网络计算的方法和装置
KR20200091623A (ko) * 2019-01-23 2020-07-31 삼성전자주식회사 위노그라드 변환에 기반한 뉴럴 네트워크의 컨볼루션 연산을 수행하는 방법 및 장치

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019851A1 (en) * 2018-07-10 2020-01-16 The George Washington University Optical convolutional neural network accelerator
CN109767000A (zh) * 2019-01-16 2019-05-17 厦门美图之家科技有限公司 基于Winograd算法的神经网络卷积方法及装置
CN110533164A (zh) * 2019-08-05 2019-12-03 西安交通大学 一种面向卷积神经网络加速器的Winograd卷积拆分方法
CN110807513A (zh) * 2019-10-23 2020-02-18 中国人民解放军国防科技大学 一种基于Winograd稀疏算法的卷积神经网络加速器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4213070A4 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023231559A1 (zh) * 2022-05-31 2023-12-07 华为技术有限公司 一种神经网络加速器、加速方法以及装置
CN114904901A (zh) * 2022-06-09 2022-08-16 清华大学 稳定化材料选择方法、装置、计算机设备、介质和产品
CN114904901B (zh) * 2022-06-09 2024-01-12 清华大学 稳定化材料选择方法、装置、计算机设备、介质和产品
CN114995782A (zh) * 2022-08-03 2022-09-02 上海登临科技有限公司 数据处理方法、装置、设备和可读存储介质
CN115391727A (zh) * 2022-08-18 2022-11-25 上海燧原科技有限公司 一种神经网络模型的计算方法、装置、设备及存储介质
CN115391727B (zh) * 2022-08-18 2023-08-18 上海燧原科技有限公司 一种神经网络模型的计算方法、装置、设备及存储介质
CN115600062A (zh) * 2022-12-14 2023-01-13 深圳思谋信息科技有限公司(Cn) 卷积处理方法、电路、电子设备及计算机可读存储介质
CN115600062B (zh) * 2022-12-14 2023-04-07 深圳思谋信息科技有限公司 卷积处理方法、电路、电子设备及计算机可读存储介质
CN116152520A (zh) * 2023-04-23 2023-05-23 深圳市九天睿芯科技有限公司 用于神经网络加速器的数据处理方法、芯片及电子设备
CN116167424A (zh) * 2023-04-23 2023-05-26 深圳市九天睿芯科技有限公司 基于cim的神经网络加速器、方法、存算处理***与设备
CN116152520B (zh) * 2023-04-23 2023-07-07 深圳市九天睿芯科技有限公司 用于神经网络加速器的数据处理方法、芯片及电子设备
CN116167424B (zh) * 2023-04-23 2023-07-14 深圳市九天睿芯科技有限公司 基于cim的神经网络加速器、方法、存算处理***与设备

Also Published As

Publication number Publication date
EP4213070A4 (en) 2023-10-25
CN116113941A (zh) 2023-05-12
CN116113941A8 (zh) 2024-05-24
US20230236891A1 (en) 2023-07-27
EP4213070A1 (en) 2023-07-19

Similar Documents

Publication Publication Date Title
WO2022067508A1 (zh) 一种神经网络加速器、加速方法以及装置
CN105512723B (zh) 一种用于稀疏连接的人工神经网络计算装置和方法
WO2020221200A1 (zh) 神经网络的构建方法、图像处理方法及装置
CN109903221B (zh) 图像超分方法及装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2021190127A1 (zh) 一种数据处理方法和数据处理设备
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2021018163A1 (zh) 神经网络的搜索方法及装置
WO2023010244A1 (zh) 神经网络加速器及神经网络加速器的数据处理方法
WO2022111617A1 (zh) 一种模型训练方法及装置
WO2023231794A1 (zh) 一种神经网络参数量化方法和装置
CN112789627B (zh) 一种神经网络处理器、数据处理方法及相关设备
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2022179588A1 (zh) 一种数据编码方法以及相关设备
CN113065997B (zh) 一种图像处理方法、神经网络的训练方法以及相关设备
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
CN113627163A (zh) 一种注意力模型、特征提取方法及相关装置
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置
WO2020042770A9 (zh) 图像识别处理方法和装置
WO2023109748A1 (zh) 一种神经网络的调整方法及相应装置
WO2023122896A1 (zh) 一种数据处理方法和装置
WO2021120036A1 (zh) 数据处理装置和数据处理方法
CN115146757A (zh) 一种神经网络模型的训练方法及装置
WO2023231559A1 (zh) 一种神经网络加速器、加速方法以及装置
WO2024078376A1 (zh) 一种模型剪枝方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20955534

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020955534

Country of ref document: EP

Effective date: 20230412

NENP Non-entry into the national phase

Ref country code: DE