WO2023231559A1 - Neural network accelerator, and acceleration method and apparatus - Google Patents

Neural network accelerator, and acceleration method and apparatus Download PDF

Info

Publication number
WO2023231559A1
WO2023231559A1 PCT/CN2023/085884 CN2023085884W WO2023231559A1 WO 2023231559 A1 WO2023231559 A1 WO 2023231559A1 CN 2023085884 W CN2023085884 W CN 2023085884W WO 2023231559 A1 WO2023231559 A1 WO 2023231559A1
Authority
WO
WIPO (PCT)
Prior art keywords
operation circuit
circuit
pooling
convolution
input
Prior art date
Application number
PCT/CN2023/085884
Other languages
French (fr)
Chinese (zh)
Inventor
肖延南
刘根树
张怡浩
左文明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023231559A1 publication Critical patent/WO2023231559A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the field of neural networks, and in particular, to a neural network accelerator, acceleration method and device.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • Neural network belongs to artificial intelligence and is a mathematical model that uses a structure similar to synaptic connections in the brain for information processing. Neural networks involve a large amount of calculations, mainly including convolution operations, activation operations, pooling operations and quantization operations, etc., which take up most of the time of neural network processing.
  • This application provides a neural network accelerator, an acceleration method, a device, a computer-readable storage medium, and a computer program product.
  • the solution provided by this application can reduce the power consumption generated by the neural network during operation and improve the processing performance of the neural network.
  • a neural network accelerator which includes a first arithmetic circuit and a second arithmetic circuit.
  • the second arithmetic circuit includes at least one of the following circuits: an activation arithmetic circuit and a quantization arithmetic circuit. Or pooling operation circuit.
  • the second operation circuit includes at least one of an activation operation circuit, a quantization operation circuit, or a pooling operation circuit.
  • the first operation circuit is used to perform a convolution operation on the input of the first operation circuit.
  • the first operation circuit includes two inputs and one output.
  • the two inputs may include a convolution kernel and an input image.
  • the convolution kernel may be represented by a weight matrix.
  • the input image may be represented by an input image matrix.
  • the input image may also be represented by an input image vector. express.
  • the output end of the first operation circuit is directly connected to the input end of the second operation circuit. That is, after performing the convolution operation, the first operation circuit directly inputs the output to the input interface of the second operation circuit through the output interface.
  • the output terminal does not need to set up a memory or buffer for caching the output results.
  • the activation operation circuit is used to perform an activation operation on the input of the activation operation circuit, and the input of the activation operation circuit is obtained from the first operation circuit, the quantization operation circuit, or the pooling operation circuit. of. Assume that the input of the activation operation circuit is a 1*N vector, and the output is still a 1*N vector after activation processing.
  • the second operation circuit includes a quantization operation circuit
  • the quantization operation circuit is used to perform a quantization operation on the input of the quantization operation circuit.
  • the input of the quantization operation circuit is obtained from the first operation circuit or the activation operation circuit or the pool. obtained from the arithmetic circuit.
  • the input of the quantization operation circuit is a 1*N vector
  • the output is still a 1*N vector after quantization processing.
  • the data format of the output vector includes but is not limited to fp16, s16, s8 and s4.
  • the pooling operation circuit is used to perform a pooling operation on the input of the pooling operation circuit, and the input of the pooling operation circuit is obtained from the first operation circuit or the activation operation circuit or the quantization Obtained from the arithmetic circuit. That is, the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation.
  • the current operation circuit is any one of the second operation circuits, and the previous operation circuit is the first operation circuit or the second operation circuit.
  • the previous arithmetic circuit inputs the operation result to the current arithmetic circuit after performing the corresponding operation.
  • the multiple operations performed by the neural network are configured as path-dependent operations. Each time one operation is completed, the operation results do not need to be stored in the memory, but are directly used in other operations. Since there is no need to store the operation result in the memory every time an operation is completed, and then read it from the memory when the operation result needs to be called, it saves the need to store the operation result in the memory and then read it from the memory. The power consumption of reading the operation results helps to improve the performance of the neural network.
  • the first operation circuit when the input end of the activation operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the activation operation circuit, or when the input end of the activation operation circuit is connected to the output end of the quantization operation circuit, the quantization operation circuit performs a quantization operation and then inputs the output data to the activation operation circuit.
  • the pooling operation circuit performs the pooling operation and inputs the output data to the activation operation The input terminal of the circuit;
  • the first operation circuit When the input end of the quantization operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the quantization operation circuit, or when When the input end of the quantization operation circuit is connected to the output end of the activation operation circuit, the activation operation circuit performs the activation operation and then inputs the output data to the input end of the activation operation circuit, or when the quantization operation When the input end of the circuit is connected to the output end of the pooling operation circuit, the pooling operation circuit performs the pooling operation and inputs the output data to the input end of the quantization operation circuit;
  • the first operation circuit When the input end of the pooling operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the pooling operation circuit, or , when the input end of the pooling operation circuit is connected to the output end of the activation operation circuit, the activation operation circuit performs the activation operation and then inputs the output data to the input end of the pooling operation circuit, or when When the input end of the pooling operation circuit is connected to the output end of the quantization operation circuit, the quantization operation circuit performs a quantization operation and then inputs the output data to the input end of the pooling operation circuit.
  • each circuit in the second operation circuit has multiple connection methods, which can be adapted to most neural networks that need to be quantized, activated or pooled, and can be used for most neural networks by following the path. Provide acceleration and improve the operating efficiency of neural networks.
  • the first operation circuit is specifically configured to use a convolution kernel to traverse the feature map to convolve the elements in the convolution kernel and the elements in the feature map in the traversed area. product operation to obtain multiple convolution results.
  • the pooling operation circuit is specifically used to obtain multiple convolution results in the order in which the pooling operation circuit performs pooling operations on the multiple convolution results.
  • the order in which convolution kernels perform convolution operations in convolution operations is determined based on the order in which pooling windows perform pooling operations in pooling operations. In other words, in order to ensure that the pooling operation circuit can normally perform the path-dependent operation, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation.
  • the first operation circuit is further configured to perform addition operations, subtraction operations, and multiplication operations on elements at corresponding positions of the two tensors input to the first operation circuit.
  • Convolution operations or element wise operations are performed through the same circuit.
  • the first operation circuit can also be used to perform element wise operations. This is because the essence of convolution operation is element multiplication and accumulation (element multiplication, element addition), and the essence of element wise operation is to add, subtract, multiply, divide, take the maximum value, take the minimum value and other operations on the elements. Therefore, in essence, the two operations overlap and can be done with one hardware, that is, the hardware resources of the two can be reused, reducing the area overhead of the hardware and reducing the complexity of the design.
  • the input of the activation operation circuit is specifically obtained from the first operation circuit
  • the input of the quantization operation circuit is specifically obtained from the activation operation circuit
  • the input of the pooling operation circuit Specifically, it is obtained from the quantization operation circuit.
  • a specific accelerator structure is given, which can support the operation processes of most different structures in neural networks applied in the wearable field.
  • the size of the pooling window in the pooling operation is w ⁇ h
  • the pooling step size is stride, where w, h, and stride have the same values, and w is greater than 1 positive integer.
  • the accelerator is applied to a convolutional neural network CNN.
  • the accelerator is applied to a recurrent neural network RNN.
  • the accelerator is deployed on a wearable device.
  • the applicant found that by deploying the same hardware device in a wearable device and configuring the various operations performed by the neural network running on the hardware device as random operations, the hardware device can support most applications in the wearable field with different structures. neural network. Through this solution, it can not only ensure that a set of hardware devices can support multiple neural networks with different structures to perform multiple operations, but also improve the performance of the neural network.
  • the activation operation is implemented through a sigmoid function, tanh function, prelu function, leave function or relu function.
  • the pooling operation includes a maximum pooling operation or an average pooling operation.
  • the convolution operation includes a depthwise separable convolution operation, a convolution GEMV operation of matrix-vector multiplication, or a convolution GEMM operation of matrix-matrix multiplication.
  • embodiments of the present application provide an acceleration method, including: performing a convolution operation on the input of the first operation circuit.
  • the output interface of the first operation circuit is directly connected to the input interface of the second operation circuit.
  • the first operation circuit directly inputs the output to the input of the second operation circuit through the output interface of the first operation circuit.
  • the second operation circuit includes at least one of the following circuits: an activation operation circuit, a quantization operation circuit, or a pooling operation circuit.
  • the second operation circuit includes an activation operation circuit, the activation operation is performed on the input of the activation operation circuit, and the input of the activation operation circuit is obtained from the first operation circuit, the quantization operation circuit, or the pooling operation circuit.
  • the quantization operation is performed on the input of the quantization operation circuit, and the input of the quantization operation circuit is obtained from the first operation circuit, the activation operation circuit, or the pooling operation circuit.
  • the arithmetic circuit includes a pooling arithmetic circuit
  • the pooling arithmetic operation is performed on the input of the pooling arithmetic circuit, and the input of the pooling arithmetic circuit is obtained from the first arithmetic circuit, the activation arithmetic circuit or the quantization arithmetic circuit. That is, the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation.
  • the current operation circuit is any one of the second operation circuits, and the previous operation circuit is the first operation circuit or the second operation circuit.
  • the previous arithmetic circuit inputs the operation result to the current arithmetic circuit after performing the corresponding operation.
  • performing a convolution operation on the input of the first operation circuit includes: using a convolution kernel to traverse the feature map to compare the elements in the convolution kernel and the features in the traversal area. The elements in the graph are convolved to obtain multiple convolution results.
  • Performing a pooling operation on the input of the pooling operation circuit includes: a pooling operation circuit, specifically configured to obtain multiple convolution results in the order in which the pooling operation circuit performs pooling operations on the multiple convolution results.
  • the method further includes: performing addition operations, subtraction operations, multiplication operations, and division on elements at corresponding positions of the two tensors input to the first operation circuit. operation, maximum value operation or minimum value operation.
  • the input of the activation operation circuit is specifically obtained from the first operation circuit
  • the input of the quantization operation circuit is specifically obtained from the activation operation circuit
  • the input of the pooling operation circuit Specifically, it is obtained from the quantization operation circuit.
  • the size of the pooling window in the pooling operation is w ⁇ h
  • the pooling step size is stride, where w, h, and stride have the same values, and w is greater than 1 positive integer.
  • the activation operation is implemented through any one of the sigmoid function, tanh function, prelu function, leave function or relu function.
  • the pooling operation includes a maximum pooling operation or an average pooling operation.
  • the convolution operation includes depthwise separable convolution (depthwise separable convolution), convolution of matrix and matrix multiplication (general matrix to matrix multiplication, GEMM), and matrix and vector convolution.
  • depthwise separable convolution depthwise separable convolution
  • convolution of matrix and matrix multiplication general matrix to matrix multiplication, GEMM
  • matrix and vector convolution general matrix to vector multiplication, GEMV
  • the present application provides an acceleration device, including: a processor and a memory, wherein the processor and the memory are interconnected through lines, and the processor calls the program code in the memory to execute any of the steps shown in the second aspect. Processing-related functions in accelerated methods.
  • the acceleration device may be a chip.
  • embodiments of the present application provide a computer-readable storage medium that includes instructions that, when run on a computer cluster, cause the computer cluster to execute the method described in the second aspect or any possible implementation of the second aspect. .
  • embodiments of the present application provide a computer program product containing instructions that, when run on a computer cluster, cause the computer cluster to execute the method described in the second aspect or any possible implementation of the second aspect.
  • embodiments of the present application provide a chip system, including: a processor, the processor is configured to call and run the computer program stored in the memory from the memory, and execute the second aspect or its corresponding possible implementation. provided method.
  • embodiments of the present application provide a wearable device, on which the neural network accelerator described in the first aspect or any possible implementation of the first aspect is deployed.
  • the wearable device may include at least one of glasses, a television, a vehicle-mounted device, a watch, or a bracelet.
  • Figure 1 is a schematic structural diagram of the convolutional neural network CNN
  • FIG. 2 is a schematic structural diagram of the recurrent neural network RNN
  • Figure 3 is a schematic structural diagram of a neural network accelerator provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of another neural network accelerator provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of another neural network accelerator provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of the operation flow of the first operation circuit provided by the embodiment of the present application.
  • Figure 7a is a schematic diagram of the relu activation function
  • Figure 7b is a schematic diagram of the sigmoid activation function
  • Figure 7c is a schematic diagram of the tanh activation function
  • Figure 8a is a schematic diagram of an operation flow of the pooling operation circuit provided by the embodiment of the present application.
  • Figure 8b is a schematic diagram of another operation flow of the pooling operation circuit provided by the embodiment of the present application.
  • Figure 9 is a schematic diagram of inputting data to the pooling operation circuit provided by the embodiment of the present application.
  • Figure 10a is a schematic diagram of an operation flow of the convolution operation circuit provided by an embodiment of the present application.
  • Figure 10b is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application.
  • Figure 10c is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application.
  • Figure 10d is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application.
  • Figure 11 is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application.
  • Figure 12 is a schematic diagram of the operation flow of the neural network accelerator provided by the embodiment of the present application.
  • Figure 13 is a flow chart when providing instructions for configuring a neural network accelerator according to an embodiment of the present application
  • Figure 14 is a schematic structural diagram of another neural network accelerator provided by an embodiment of the present application.
  • Figure 15 is a flow chart when providing instructions for configuring a neural network accelerator according to an embodiment of the present application
  • Figure 16 is a schematic flowchart of an acceleration method provided by an embodiment of the present application.
  • Figure 17 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the embodiments of this application provide a neural network accelerator, acceleration method and device.
  • the neural network can be regarded as a machine learning model composed of neural units.
  • the neural unit can refer to an operation unit that takes xs and intercept 1 as input.
  • the output of the operation unit can be:
  • a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • DNN deep neural network
  • DCNN deep convolutional neural network
  • Different types of neural networks often have different structures. Even the same type of neural network has many different structures. As an example, the neural network is introduced below with several typical neural network structures.
  • the input data of the CNN network can involve images, text, voice, and IoT data, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • the following description takes the input data as an image as an example.
  • the CNN network passes the acquired image to the convolution layer, pooling layer and subsequent neural network layer (not shown in the figure) for processing, and the image processing result can be obtained.
  • the convolution layer (such as the convolution layer 101, the convolution layer 102 and the convolution layer 103 shown in Figure 1) can include many convolution operators.
  • the convolution operator is also called a kernel or a convolution kernel.
  • the role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image , the weight matrix is usually processed on the input image one pixel after another (or two pixels after two pixels...it depends on the value of the stride) along the horizontal direction to complete the extraction of specific features from the image. work.
  • the size of the weight matrix should be related to the size of the image.
  • the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same size (rows ⁇ columns) are applied, That is, multiple matrices of the same type.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image.
  • the dimension here can be understood as being determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image.
  • one weight matrix is used to extract edge information of the image
  • another weight matrix is used to extract specific colors of the image
  • another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc.
  • the multiple weight matrices have the same size (row ⁇ column), and the convolution feature maps extracted by the multiple weight matrices with the same size are also the same size. Then, the multiple extracted convolution feature maps with the same size are Combined to form the output of the convolution operation.
  • the essence of the convolution operation is to perform a multiplication and accumulation operation on the elements in the matrix. Specifically, the elements in the weight matrix and the elements in the input image matrix are multiplied and accumulated.
  • the output of the previous convolutional layer needs to be activated and used as the input of the next convolutional layer. This is because if the activation function is not introduced, the input of the next convolutional layer is a linear function of the output of the upper convolutional layer. No matter how many convolutional layers the neural network has, the output of each convolutional layer is the same as the previous one.
  • the approximation ability of the neural network is quite limited, so the expression ability of the neural network can be increased by introducing a nonlinear function as the activation function, so that the neural network can arbitrarily approximate any nonlinear function, so Neural networks can be applied to many nonlinear models.
  • element-wise operations can be introduced after the convolutional layer.
  • Element-wise essentially operates on two tensors. It operates on the corresponding elements in the corresponding tensors.
  • the corresponding elements can be added (add), subtracted (sub), multiplied (mul), and divided. (div), take the maximum (max), take the minimum (min), take the absolute value (abs) and other operations.
  • Two elements are said to correspond if they occupy the same position within a tensor. The position is determined by the index used to locate each element.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator (which can be understood as performing an average pooling operation or a maximum pooling operation) to sample the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • Figure 1 shows only one possible structure of the CNN network.
  • the CNN network may also include other structures. For example, it may not include a pooling layer, or a pooling layer may be connected after each convolutional layer. layer and so on.
  • the CNN network may include more layers. For example, after being processed by the convolutional layer/pooling layer, the convolutional neural network is not enough to output the required Output information. Because as mentioned before, the convolutional layer/pooling layer will only extract features and reduce the parameters brought by the input image.
  • the convolutional neural network needs to use neural network layers to generate an output or a set of required number of classes. Therefore, the neural network layer can include multiple hidden layers, and the parameters contained in the multiple hidden layers can be pre-trained based on relevant training data of a specific task type.
  • the task type can include image recognition, Image classification, image super-resolution reconstruction, etc.
  • the multi-layer hidden layer in the neural network layer that is, the last layer of the entire convolutional neural network can also include an output layer, which has a loss function similar to classification cross entropy, specifically used to calculate the prediction error.
  • FIG 2 is a schematic structural diagram of a recurrent neural network RNN (hereinafter referred to as RNN network).
  • the RNN network may include multiple structures shown in Figure 2.
  • the structure shown in Figure 2 is called the RNN module.
  • the RNN module shown in Figure 2 includes three inputs and two outputs. The three inputs are c t-1 , x t and the output of the convolution layer (not shown in Figure 2).
  • the two outputs are respectively is c t and the final output h t of this RNN module, where c t serves as the input of the next RNN module c t-1 and x t is a constant.
  • the final output of all RNN modules is used to obtain the output results of the RNN network.
  • a neural network includes a convolution module and a pooling module.
  • the output of the convolution module is used as the input of the pooling module.
  • the convolution module is used to perform convolution operations and can be understood with reference to the convolution layer described above.
  • the pooling module is used to perform pooling operations and can be understood with reference to the pooling layer described above. If the convolution module is configured to be non-path-dependent, each time the convolution module completes the convolution operation, the result of the convolution operation will be stored in the memory. The pooling module needs to retrieve the convolution result from the memory and perform the convolution operation. The convolution result is pooled. If the convolution module is configured as path-dependent, the output of the convolution module is directly used as the input of the pooling module, and there is no need to write the output of the convolution module to the memory.
  • the solution provided by the embodiments of the present application configures various operations performed by the neural network running on the hardware device as path-dependent operations.
  • the embodiments of the present application can also reuse hardware resources, which can be used in harsh environments. On the basis of reducing area and meeting power consumption constraints, it takes into account performance and effectively improves the computing speed of neural networks.
  • a neural network accelerator provided by an embodiment of the present application may include a first arithmetic circuit and a second arithmetic circuit.
  • the output end of the first arithmetic circuit is directly connected to the input end of the second circuit.
  • the first arithmetic circuit The output of is the input of the second operation circuit.
  • the first operation circuit is used to perform the convolution operation, and after performing the convolution operation, the output is directly input to the input of the second operation circuit through the interface connected to the second operation circuit. end.
  • the second operation circuit is configured to execute the activation operation along the way. At least one of calculation, quantization operation and pooling operation.
  • the second operation circuit may include at least one of the following circuits: a quantization operation circuit, an activation operation circuit, and a pooling operation circuit.
  • the quantization operation circuit is used to perform quantization operations
  • the activation operation circuit is used to perform activation operations
  • the pooling operation circuit is used to perform pooling operations.
  • the quantization operation circuit, the activation operation circuit, and the pooling operation circuit are all configured to perform path-following operations. Operation, that is, the interconnected operation circuits are directly connected to each other.
  • the previous operation circuit connected to each other directly inputs the output result to the next operation circuit after performing the corresponding operation. There is no need to set up additional modules or storage media for storing data. .
  • the aforementioned direct connection means that the output interface of the previous operation circuit is directly electrically connected to the input interface of the next operation circuit. After the previous operation circuit performs the corresponding operation, the output is directly input to the input interface of the next operation circuit through the output interface.
  • data can be transmitted directly between computing circuits without being stored in a memory or buffer, and the on-path configuration of each computing circuit can be realized.
  • the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation.
  • the current operation circuit is any one of the second operation circuits
  • the previous operation circuit is the first operation circuit or the second operation circuit.
  • the input end of the current arithmetic circuit is directly connected to the output end of the previous arithmetic circuit.
  • the previous arithmetic circuit directly inputs the operation result to the current arithmetic circuit without going through additional processing. Modules or storage media used to store data to achieve more efficient data transmission.
  • the arithmetic circuit in the neural network accelerator can include the use of a variety of components.
  • addition, bitwise AND, or circuits can be realized through a combination of transistors, resistors, diodes and other electronic components. These circuits can The circuits can be combined to implement the aforementioned operations such as convolution, pooling, quantization, or activation.
  • the specific arrangement of electronic components can be connected according to the actual required circuits, which will not be described in detail in this application.
  • each circuit included in the second operation circuit can be configured in a variety of ways.
  • the output of the first operation circuit is specifically the input of the quantization operation circuit, and the quantization operation
  • the output of the circuit is the input of the activation operation circuit
  • the output of the activation operation circuit is the input of the pooling operation circuit.
  • the second operation circuit is configured to first perform the quantization operation along the path, then perform the activation operation along the path, and finally perform the pooling operation along the path.
  • the output of the first operation circuit is specifically the input of the activation operation circuit
  • the output of the activation operation circuit is the input of the pooling operation circuit
  • the output of the pooling operation circuit is the input of the quantization operation circuit.
  • the second operation circuit is configured to first perform the activation operation along the path, then perform the pooling operation along the path, and finally perform the quantization operation along the path.
  • the convolution operation circuit (for the convenience of describing the solution, the first operation circuit is called the convolution operation circuit in Figures 3 to 12) includes two inputs and one output.
  • the two inputs may include a convolution kernel and an input image, where the convolution
  • the kernel is represented by a weight matrix
  • the input image can be represented by an input image matrix
  • the input image can also be represented by an input image vector.
  • the following introduction takes the input image as an input image vector representation as an example. Refer to Figure 6.
  • the input of the convolution operation circuit includes a vector and a matrix.
  • the convolution operation circuit mainly completes the multiplication and accumulation operations.
  • the output is still a vector.
  • K and N shown in Figure 6 are both positive integers greater than 0. .
  • the embodiments of the present application do not limit the types of convolution operations.
  • the types of convolution operations include but are not limited to depthwise separable convolution (depthwise separable convolution), matrix-matrix multiplication convolution (general matrix to matrix multiplication, GEMM) and the convolution of matrix and vector multiplication (general matrix to vector multiplication, GEMV).
  • the quantization operation circuit is used to perform quantization operations and can be configured to perform any kind of quantization operations.
  • This application implements Examples do not limit the method of quantification. Assume that the input of the quantization operation circuit is a 1*N vector, and the output is still a 1*N vector after quantization processing.
  • the data format of the output vector includes but is not limited to fp16, s16, s8 and s4.
  • the activation operation circuit performs an activation operation through an activation function and can be configured to use any activation function.
  • the embodiment of the present application does not limit the type of activation function.
  • Exemplary activation functions include but are not limited to relu, sigmoid, and tanh. Referring to Figure 7a to Figure 7c, a schematic diagram of the relu activation function, the sigmoid activation function and the tanh activation function is given, in which the horizontal and vertical axis represent the input of the activation operation circuit, and the vertical axis represents the output of the activation operation circuit.
  • the pooling operation circuit is used to perform pooling operations and can be configured to perform any kind of pooling operations.
  • the embodiments of the present application do not limit the pooling method. Examples include but are not limited to maximum pooling ( max pooling) and average pooling (average pooling). Assume that the size of the pooling window in the pooling operation is w ⁇ h, the pooling step size is stride, and w, h, and stride are all positive integers greater than 1. In a preferred implementation, in order to save the amount of calculation, the values of w, h, and stride can be made the same. In order to better understand the advantages of this preferred implementation, an example is given below in conjunction with Figure 8a and Figure 8b.
  • Figure 8a the values of w, h, and stride of the pooling window are different.
  • Figure 8a uses 3 ⁇ 3 ⁇ 2 is shown as an example, where the values of w and h are 3, and the value of stride is 2. It can be seen that there are duplicate elements in the objects of the two pooling operations. Specifically, the first pooling operation The elements processed include 1, 1, 4, 4, 3, 5, 2, 2, 8, 7. The elements processed by the second pooling operation include 4, 5, 3, 5, 7, 8, 7, 8, 9 , the elements processed by the two pooling operations have duplicate elements 4, 5, and 7.
  • Figure 8b make the values of w, h, and stride the same.
  • Figure 8b takes 3 ⁇ 3 ⁇ 3 as an example.
  • the values of w, h, and stride are all 3. It can be seen that the values of the two pooling operations are There are no duplicate elements in the object. Specifically, the elements processed by the first pooling operation include 1, 4, 4, 3, 5, 2, 8, 7, and the elements processed by the second pooling operation include 5, 3. ,3,7,8,1,8,9,2.
  • the pooling operation on the input of the pooling operation circuit can be completed faster, saving calculation amount and improving performance.
  • the order in which the convolution kernels perform convolution operations in the convolution operation is determined based on the order in which the pooling windows in the pooling operation perform pooling operations. .
  • data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Refer to Figure 9 to illustrate this using the maximum pooling method as an example. As shown in Figure 9, assume that a 4 ⁇ 4 ⁇ N tensor is obtained after the convolution operation, the size of the pooling window is 2 ⁇ 2, and the step size is 2.
  • the elements processed by the pooling window for the first time are 1,1,5,6, the elements processed for the second time are 2,4,7,8, and the elements processed for the third time are 3,2,4,9, the elements processed for the third time are 3,2,4,9.
  • the elements processed four times are 8, 3, 0, 5.
  • the elements/data are input to the pooling operation circuit.
  • Figure 9 for understanding. It should be noted that the embodiments of the present application do not limit the input order of elements processed by each pooling process.
  • the elements processed for the first time by the pooling window are 1,1,5,6, and can be processed in accordance with 1,1,5 , you can input elements/data to the pooling operation circuit in the order of 6, you can also input elements/data to the pooling operation circuit in the order of 1,5,1,6, you can also input elements/data to the pool in the order of 6,5,1,1 ion arithmetic circuit input elements/data, etc.
  • the input of the convolution operation circuit needs to be padded. There are already mature technical means on how to perform padding, which will not be explained in the embodiments of this application.
  • Figure 10a Shaded area for understanding.
  • the size of the pooling window is 2 ⁇ 2 and the stride is 2.
  • the input of the filled convolution operation circuit is called the feature matrix in the following.
  • the third volume The convolution result obtained by the product is element 2 in the first row and third column in Figure 9, and the first pooling process of the pooling window cannot be guaranteed.
  • each The principle of reading 1xK elements is to read elements at other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read.
  • the convolution result obtained after the fourth convolution should be element 6 in the second row and second column.
  • the embodiment of the present application uses the pooling window as a unit to determine the order in which data is input to the pooling operation circuit.
  • the embodiment of the present application does not limit the order in which the pooling window performs the pooling operation. For example, This is illustrated with an example.
  • the order in which the pooling window performs pooling operations is different from the order shown in Figure 9. Specifically, if the elements processed by the pooling window for the first time are 1,1,5,6, and the elements processed for the second time The elements are 3,2,4,9, the elements processed for the third time are 2,4,7,8, and the elements processed for the fourth time are 8,3,0,5.
  • the second operation circuit is configured to perform at least one of activation operation, quantization operation and pooling operation along the way. Specifically, it includes but is not limited to the following situations:
  • Case 1 The second operation circuit is configured to perform activation operation, quantization operation and pooling operation along the way.
  • Case 2 The second operation circuit is configured to perform activation operation and quantization operation along with the circuit.
  • Case 3 The second operation circuit is configured to perform the quantization operation along with the circuit.
  • Case 4 The second operation circuit is configured to perform the activation operation along with the circuit.
  • Case 5 The second operation circuit is configured to perform quantization operation and pooling operation along with the path.
  • the second operation circuit includes a quantization operation circuit, an activation operation circuit and a pooling operation circuit. Whether to activate the quantization operation circuit, the activation operation circuit and the pooling operation circuit can be configured through instructions to suit different situations.
  • a possible instruction configuration method is given below:
  • This instruction is used to indicate that 4 different parameters can be configured.
  • This instruction is used to indicate that the parameter [Xd] is 32 bits and is used to indicate the cache destination address where the final calculation result of the accelerator is located.
  • This instruction is used to indicate that the parameter [Xn] is 32 bits and is used to indicate the starting address of the cache where one of the inputs of the first operation circuit is located.
  • This instruction is used to indicate that the parameter [Xm] is 32 bits and is used to indicate the starting address of the cache where the other input of the first operation circuit is located.
  • Xt[31:0] is used to indicate that parameter [Xu] is 32 bits, including but not limited to indicating the following configuration information:
  • the type of convolution operation for example, used to indicate that the type of convolution operation is one of depthwise, GEMV and GEMM.
  • instruction configuration method is only an example. In fact, other instruction configuration methods can also be used to ensure the normal operation of the first operation circuit and the second operation circuit, such as:
  • FM_cfg configuration information and Weight_cfg configuration information can also be introduced, where the FM_cfg configuration information is used to indicate the size and number of channels of the input image matrix, and the Weight_cfg configuration information is used to indicate the size of the convolution kernel, Step size and other information.
  • Deq_cfg configuration information can also be introduced to indicate information such as quantization targets.
  • Pooling_cfg configuration information can also be introduced to indicate the window of the pooling window, the pooling step size and other information.
  • the information related to the convolution operation circuit can be loaded.
  • Configuration information such as the FM_cfg configuration information and Weight_cfg configuration information introduced above.
  • the configuration instruction starts the quantization operation circuit and loads the configuration information related to the quantization operation circuit, such as the Deq_cfg configuration information introduced above. If it does not need to be started, Then the configuration instruction does not start the quantization operation circuit. Then it is determined whether the activation operation circuit needs to be started. If it needs to be started, the configuration instruction starts the activation operation circuit and loads the configuration information related to the activation operation circuit.
  • the configuration instruction does not start the pooling operation circuit.
  • determine whether the pooling operation circuit needs to be started If it needs to be started, configure the instruction to start the pooling operation circuit and load the configuration information related to the pooling operation circuit, such as the Pooling_cfg configuration information introduced above. If not If it needs to be started, the configuration instruction does not start the pooling operation circuit.
  • the flow chart of the design configuration instructions shown in Figure 13 is only an exemplary illustration. In fact, there can be multiple sequences for configuring whether to activate each circuit. For example, you can first determine whether the activation circuit needs to be activated, and then determine whether It is necessary to start the quantization operation circuit, and then determine whether it is necessary to start the pooling operation circuit, etc.
  • more or fewer configuration instructions can be designed according to actual scene requirements. For example, as mentioned above, some neural networks also need to perform element wise operations, and you can also configure instructions to determine whether to start the item-by-item operation circuit. To determine whether to perform elementwise operations, etc.
  • the convolution operation or the element wise operation can be performed by the same circuit.
  • the first operation circuit in addition to performing convolution operations, can also be used to perform element wise operations.
  • the essence of convolution operation is element multiplication and accumulation (element multiplication, element addition), and the essence of element wise operation is to add (add), subtract (sub), multiply (mul), divide (div), and take elements.
  • the elementwise operation is performed by the first operation circuit (herein referred to as the item-wise operation circuit)
  • the input of the item-wise operation circuit includes one vector and another vector, and the output is still a vector.
  • FIG. 14 a schematic structural diagram of an accelerator is provided when the convolution operation circuit and the item-by-item operation circuit multiplex hardware resources.
  • the second operation circuit can be understood with reference to the above description of the second operation circuit. This is not the case here. Let me repeat it again.
  • the first operation circuit can be configured through instructions to perform a convolution operation or to perform an element wise operation. For example, you can also use the parameter Xt[31:0] introduced above to indicate whether to perform a convolution operation or an element wise operation.
  • the parameter Xt[31:0] introduced above can also be used to indicate the type of element wise operation, such as performing addition (add), subtraction (sub), multiplication (mul), One of division (div), maximum value (max), minimum value (min).
  • the configuration instruction starts the first operation circuit to perform the convolution operation and configures the loader. Enter the configuration information related to the convolution operation, such as the FM_cfg configuration information and Weight_cfg configuration information introduced above. If the convolution operation does not need to be performed, the configuration instruction starts the first operation circuit to perform the element wise operation and configures to load the configuration information related to the element wise operation. Then it can be determined whether the quantization operation circuit needs to be started.
  • the configuration instruction starts the quantization operation circuit and loads the configuration information related to the quantization operation circuit, such as the Deq_cfg configuration information introduced above. If it does not need to be started, Then the configuration instruction does not start the quantization operation circuit. Then it is determined whether the activation operation circuit needs to be started. If it needs to be started, the configuration instruction starts the activation operation circuit and loads the configuration information related to the activation operation circuit. If it does not need to be started, the configuration instruction does not start the pooling operation circuit. Next, determine whether the pooling operation circuit needs to be started. If it needs to be started, configure the instruction to start the pooling operation circuit and load the configuration information related to the pooling operation circuit, such as the Pooling_cfg configuration information introduced above. If not If it needs to be started, the configuration instruction does not start the pooling operation circuit.
  • a memory may also be introduced.
  • the first operation circuit is used to perform a convolution operation, and the obtained convolution result is written into the memory.
  • the convolution result can again be used as the input of the first operation circuit, such as the first operation
  • the circuit can retrieve the convolution result from memory and perform element wise operations using the convolution result.
  • the activation operation circuit, the pooling operation circuit, and the quantization operation circuit can also write the output results into the memory.
  • a schematic flowchart of an acceleration method includes: performing a convolution operation 1601 on the input of the first operation circuit.
  • the output interface of the first operation circuit is directly connected to the input interface of the second operation circuit.
  • the first operation circuit directly inputs the output to the input of the second operation circuit through the output interface of the first operation circuit.
  • Interface; the second operation circuit includes at least one of the following circuits: an activation operation circuit, a quantization operation circuit, or a pooling operation circuit.
  • an activation operation 1602 is performed on the input of the activation operation circuit, and the input of the activation operation circuit is obtained from the first operation circuit, the quantization operation circuit, or the pooling operation circuit.
  • the quantization operation 1603 is performed on the input of the quantization operation circuit, and the input of the quantization operation circuit is obtained from the first operation circuit or the activation operation circuit or the pooling operation circuit.
  • a pooling operation 1604 is performed on the input of the pooling operation circuit, and the input of the pooling operation circuit is obtained from the first operation circuit or the activation operation circuit or the quantization operation circuit.
  • the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation.
  • the previous operation circuit performs the corresponding operation.
  • the operation result is input to the current operation circuit.
  • the current operation circuit is any one of the second operation circuits, and the previous operation circuit is the first operation circuit or the operation circuit of the second operation circuit.
  • the first operation circuit when the input terminal of the activation operation circuit is connected to the output terminal of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input terminal of the activation operation circuit, or when the input terminal of the activation operation circuit
  • the quantization operation circuit performs the quantization operation and then inputs the output data to the input terminal of the activation operation circuit, or when the input terminal of the activation operation circuit is connected to the output terminal of the pooling operation circuit,
  • the pooling operation circuit performs the pooling operation and inputs the output data to the input end of the activation operation circuit;
  • the first operation circuit When the input end of the quantization operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the quantization operation circuit, or when the input end of the quantization operation circuit is connected to the activation
  • the activation operation circuit When the output terminal of the operation circuit is connected, the activation operation circuit performs the activation operation and then inputs the output data to the input terminal of the activation operation circuit, or when the input terminal of the quantization operation circuit is connected to the output terminal of the pooling operation circuit, the pooling operation After the circuit performs the pooling operation, the output data is input to the input end of the quantization operation circuit;
  • the first operation circuit When the input end of the pooling operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the pooling operation circuit, or when the input end of the pooling operation circuit
  • the activation operation circuit When the terminal is connected to the output terminal of the activation operation circuit, the activation operation circuit performs the activation operation and then inputs the output data to the input terminal of the pooling operation circuit, or when the input terminal of the pooling operation circuit is connected to the output terminal of the quantization operation circuit , quantization operation circuit After performing the quantization operation, the output data is input to the input end of the pooling operation circuit.
  • performing a convolution operation on the input of the first operation circuit includes: using a convolution kernel to traverse the feature map to compare the elements in the convolution kernel and the elements in the feature map within the traversal area. Perform a convolution operation to obtain multiple convolution results.
  • Performing a pooling operation on the input of the pooling operation circuit includes: a pooling operation circuit, specifically configured to obtain multiple convolution results in the order in which the pooling operation circuit performs pooling operations on the multiple convolution results.
  • the method further includes: performing addition operations, subtraction operations, multiplication operations, division operations, and maximization on elements at corresponding positions of the two tensors input to the first operation circuit. Value operation or minimum value operation.
  • the input of the activation operation circuit is specifically obtained from the first operation circuit
  • the input of the quantization operation circuit is specifically obtained from the activation operation circuit
  • the input of the pooling operation circuit is specifically obtained from the quantization operation circuit. Obtained from the arithmetic circuit.
  • the size of the pooling window in the pooling operation is w ⁇ h
  • the pooling step size is stride, where w, h, and stride have the same values, and w is a positive integer greater than 1.
  • the activation operation is implemented through any one of the sigmoid function, tanh function, prelu function, leave function and relu function.
  • the pooling operation includes a maximum pooling operation or an average pooling operation.
  • the convolution operation includes depthwise separable convolution, convolution GEMM of matrix-matrix multiplication, and convolution GEMV of matrix-vector multiplication.
  • This application also provides an electronic device in which a processing unit and a communication interface can be provided.
  • the processing unit obtains program instructions through the communication interface.
  • the program instructions are executed by the processing unit.
  • the processing unit is used to execute the acceleration method described in the above embodiment.
  • the electronic device may specifically include various terminals or wearable devices.
  • the wearable device may include a bracelet, a smart watch, a smart glasses, a head mounted display (HMD), an augmented reality (AR) device, a mixed reality (MR) device, etc.
  • HMD head mounted display
  • AR augmented reality
  • MR mixed reality
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium includes instructions instructing to execute the acceleration method described in the above embodiments.
  • An embodiment of the present application also provides a computer program product.
  • the computer program product When the computer program product is executed by a computer, the computer executes the acceleration method described in the previous embodiment.
  • the computer program product can be a software installation package. If any of the foregoing methods is required, the computer program product can be downloaded and executed on the computer.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), Optical media, or semiconductor media (such as solid state disk (SSD)), etc.
  • An embodiment of the present application also provides a digital processing chip.
  • the digital processing chip integrates a circuit and one or more interfaces for realizing the above-mentioned processor or functions of the processor.
  • the digital processing chip can complete the method steps of any one or more of the foregoing embodiments.
  • the digital processing chip does not have an integrated memory, it can be connected to an external memory through a communication interface.
  • the digital processing chip implements the actions performed by the neural network accelerator in the above embodiment according to the program code stored in the external memory.
  • Embodiments of the present application also provide a chip in which the neural network accelerator provided by the present application can be deployed.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor.
  • the communication unit may be, for example, an input unit. /Output interface, pin or circuit, etc.
  • the processing unit can execute computer execution instructions stored in the storage unit, so that the chip in the server performs the actions performed by the neural network accelerator described in the embodiment shown above.
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (GPU), or a digital signal processing unit.
  • CPU central processing unit
  • NPU network processor
  • GPU graphics processing unit
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • Figure 17 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 170.
  • the NPU 170 serves as a co-processor and is mounted to the main CPU ( On the Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1703.
  • the arithmetic circuit 1703 is controlled by the controller 1704 to extract the matrix data in the memory and perform multiplication operations.
  • the computing circuit 1703 internally includes multiple processing engines (PEs).
  • PEs processing engines
  • arithmetic circuit 1703 is a two-dimensional systolic array.
  • the arithmetic circuit 1703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1703 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1702 and caches it on each PE in the arithmetic circuit.
  • the operation circuit takes matrix A data and matrix B from the input memory 1701 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1708 .
  • the unified memory 1706 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (direct memory access controller, DMAC) 1705, and the DMAC is transferred to the weight memory 1702.
  • Input data is also transferred to unified memory 1706 via DMAC.
  • DMAC direct memory access controller
  • Bus interface unit (bus interface unit, BIU) 1710 is used for interaction between the AXI bus and DMAC and instruction fetch buffer (IFB) 1709.
  • the bus interface unit 1710 (bus interface unit, BIU) is used to fetch the memory 1709 to obtain instructions from the external memory, and is also used for the storage unit access controller 1705 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1706 or the weight data to the weight memory 1702 or the input data to the input memory 1701 .
  • the vector calculation unit 1707 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as batch normalization, pixel-level summation, upsampling of feature planes, etc.
  • vector calculation unit 1707 can store the processed output vectors to unified memory 1706 .
  • the vector calculation unit 1707 can apply a linear function and/or a nonlinear function to the output of the operation circuit 1703, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value.
  • vector calculation unit 1707 generates normalized values, pixel-wise summed values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1703, such as for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 1709 connected to the controller 1704 is used to store instructions used by the controller 1704;
  • the unified memory 1706, the input memory 1701, the weight memory 1702 and the fetch memory 1709 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1703 or the vector calculation unit 1707.
  • the processor mentioned in any of the above places may be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for program execution that control the actions performed by the neural network accelerator.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between circuits indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application. Law.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in embodiments of the present application are a neural network accelerator, and an acceleration method and apparatus. The neural network accelerator comprises a first operation circuit and a second operation circuit. The second operation circuit comprises at least one of the following circuits: an activation operation circuit, a quantization operation circuit, or a pooling operation circuit. The first operation circuit is used to execute a convolution operation on the input of the first operation circuit. The input of a current operation circuit in a second operation model is directly inputted to the current operation circuit after a previous operation circuit connected thereto executes a corresponding operation. According to the solution provided in the embodiments of the present application, the activation operation circuit, the quantization operation circuit, or the pooling operation circuit is configured to complete a respective operation along the path, thereby improving the processing performance of a neural network.

Description

一种神经网络加速器、加速方法以及装置A neural network accelerator, acceleration method and device
本申请要求于2022年05月31日提交中国专利局、申请号为“202210612341.9”、申请名称为“一种神经网络加速器、加速方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application submitted to the China Patent Office on May 31, 2022, with the application number "202210612341.9" and the application title "A neural network accelerator, acceleration method and device", the entire content of which is incorporated by reference. incorporated in this application.
技术领域Technical field
本申请涉及神经网络领域,尤其涉及一种神经网络加速器、加速方法以及装置。The present application relates to the field of neural networks, and in particular, to a neural network accelerator, acceleration method and device.
背景技术Background technique
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence. Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
神经网络隶属于人工智能,是一种应用类似于大脑神经突触连接的结构进行信息处理的数学模型。神经网络中涉及大量的计算,主要包括卷积操作、激活操作、池化操作和量化运算等,占用了神经网络处理的大部分时间。Neural network belongs to artificial intelligence and is a mathematical model that uses a structure similar to synaptic connections in the brain for information processing. Neural networks involve a large amount of calculations, mainly including convolution operations, activation operations, pooling operations and quantization operations, etc., which take up most of the time of neural network processing.
如何在有限的硬件面积内,使神经网络获得较高的性能,是一个值得研究的问题。How to achieve higher performance of neural networks within a limited hardware area is a problem worth studying.
发明内容Contents of the invention
本申请提供一种神经网络加速器,还提供了加速方法、装置、计算机可读存储介质以及计算机程序产品。通过本申请提供的方案能够降低神经网络在运行中产生的功耗,提升神经网络的处理性能。This application provides a neural network accelerator, an acceleration method, a device, a computer-readable storage medium, and a computer program product. The solution provided by this application can reduce the power consumption generated by the neural network during operation and improve the processing performance of the neural network.
为达到上述目的,本申请实施例提供如下技术方案:In order to achieve the above objectives, the embodiments of this application provide the following technical solutions:
第一方面,本申请的实施例提供了一种神经网络加速器,该加速器包括第一运算电路和第二运算电路,第二运算电路包括以下电路中的至少一种:激活运算电路、量化运算电路或者池化运算电路。换句话说,第二运算电路包括激活运算电路、量化运算电路或者池化运算电路中的至少一种运算电路。第一运算电路,用于对第一运算电路的输入执行卷积运算。第一运算电路包括两个输入和一个输出,两个输入可以包括卷积核和输入图像,其中卷积核通过权重矩阵表示,输入图像可以通过输入图像矩阵表示,输入图像也可以通过输入图像向量表示。第一运算电路的输出端与第二运算电路的输入端直接连接,即第一运算电路在执行卷积运算后将输出通过输出接口直接输入至第二运算电路的输入接口,第一运算电路的输出端无需设置用于缓存该输出结果的存储器或者缓存器。In a first aspect, embodiments of the present application provide a neural network accelerator, which includes a first arithmetic circuit and a second arithmetic circuit. The second arithmetic circuit includes at least one of the following circuits: an activation arithmetic circuit and a quantization arithmetic circuit. Or pooling operation circuit. In other words, the second operation circuit includes at least one of an activation operation circuit, a quantization operation circuit, or a pooling operation circuit. The first operation circuit is used to perform a convolution operation on the input of the first operation circuit. The first operation circuit includes two inputs and one output. The two inputs may include a convolution kernel and an input image. The convolution kernel may be represented by a weight matrix. The input image may be represented by an input image matrix. The input image may also be represented by an input image vector. express. The output end of the first operation circuit is directly connected to the input end of the second operation circuit. That is, after performing the convolution operation, the first operation circuit directly inputs the output to the input interface of the second operation circuit through the output interface. The output terminal does not need to set up a memory or buffer for caching the output results.
当第二运算电路包括激活运算电路时,该激活运算电路,用于对激活运算电路的输入执行激活运算,激活运算电路的输入是从第一运算电路或者量化运算电路或者池化运算电路中获取的。假设激活运算电路的输入为一个1*N的向量,输出经过激活处理后仍为一个1*N的向量。当第二运算电路包括量化运算电路时,该量化运算电路,用于对量化运算电路的输入执行量化运算,量化运算电路的输入是从第一运算电路或者激活运算电路或者池 化运算电路中获取的。假设量化运算电路的输入为一个1*N的向量,输出经过量化处理后仍为一个1*N的向量,输出的向量的数据格式包括但不限于fp16、s16、s8以及s4。当第二运算电路包括池化运算电路时,该池化运算电路,用于对池化运算电路的输入执行池化运算,池化运算电路的输入是从第一运算电路或者激活运算电路或者量化运算电路中获取的。也即,第一运算电路在执行卷积运算后将输出结果输入至第二运算电路,当前运算电路是第二运算电路中的任意一个电路,上一个运算电路为第一运算电路或者第二运算电路中的运算电路,当前运算电路的输入端与上一个运算电路的输出端连接时,上一个运算电路在执行对应的运算后将运算结果输入至当前运算电路。When the second operation circuit includes an activation operation circuit, the activation operation circuit is used to perform an activation operation on the input of the activation operation circuit, and the input of the activation operation circuit is obtained from the first operation circuit, the quantization operation circuit, or the pooling operation circuit. of. Assume that the input of the activation operation circuit is a 1*N vector, and the output is still a 1*N vector after activation processing. When the second operation circuit includes a quantization operation circuit, the quantization operation circuit is used to perform a quantization operation on the input of the quantization operation circuit. The input of the quantization operation circuit is obtained from the first operation circuit or the activation operation circuit or the pool. obtained from the arithmetic circuit. Assume that the input of the quantization operation circuit is a 1*N vector, and the output is still a 1*N vector after quantization processing. The data format of the output vector includes but is not limited to fp16, s16, s8 and s4. When the second operation circuit includes a pooling operation circuit, the pooling operation circuit is used to perform a pooling operation on the input of the pooling operation circuit, and the input of the pooling operation circuit is obtained from the first operation circuit or the activation operation circuit or the quantization Obtained from the arithmetic circuit. That is, the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation. The current operation circuit is any one of the second operation circuits, and the previous operation circuit is the first operation circuit or the second operation circuit. For the arithmetic circuit in the circuit, when the input end of the current arithmetic circuit is connected to the output end of the previous arithmetic circuit, the previous arithmetic circuit inputs the operation result to the current arithmetic circuit after performing the corresponding operation.
由第一方面提供的方案可知,将神经网络执行的多种运算配置为随路运算,每次完成一种运算,不需要将该运算结果存入存储器中,而是直接用于其他运算中。由于不需要每次完成一种运算,都要将该运算结果存入存储器中,当需要调用该运算结果时,再从存储器中进行读取,所以节约了将运算结果存入存储器,再从存储器中读取运算结果这部分的功耗,有助于提升神经网络的性能。It can be seen from the solution provided by the first aspect that the multiple operations performed by the neural network are configured as path-dependent operations. Each time one operation is completed, the operation results do not need to be stored in the memory, but are directly used in other operations. Since there is no need to store the operation result in the memory every time an operation is completed, and then read it from the memory when the operation result needs to be called, it saves the need to store the operation result in the memory and then read it from the memory. The power consumption of reading the operation results helps to improve the performance of the neural network.
在第一方面的一种可能的实施方式中,当所述激活运算电路的输入端与所述第一运算模型的输出端连接时,所述第一运算电路执行卷积操作后将输出数据输入至所述激活运算电路的输入端,或者,当所述激活运算电路的输入端与所述量化运算电路的输出端连接时,所述量化运算电路执行量化操作后将输出数据输入至所述激活运算电路的输入端,或者,当所述激活运算电路的输入端与所述池化运算电路的输出端连接时,所述池化运算电路执行池化运算后将输出数据输入至所述激活运算电路的输入端;In a possible implementation of the first aspect, when the input end of the activation operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the activation operation circuit, or when the input end of the activation operation circuit is connected to the output end of the quantization operation circuit, the quantization operation circuit performs a quantization operation and then inputs the output data to the activation operation circuit. The input end of the operation circuit, or when the input end of the activation operation circuit is connected to the output end of the pooling operation circuit, the pooling operation circuit performs the pooling operation and inputs the output data to the activation operation The input terminal of the circuit;
当所述量化运算电路的输入端与所述第一运算模型的输出端连接时,所述第一运算电路执行卷积操作后将输出数据输入至所述量化运算电路的输入端,或者,当所述量化运算电路的输入端与所述激活运算电路的输出端连接时,所述激活运算电路执行激活运算后将输出数据输入至所述激活运算电路的输入端,或者,当所述量化运算电路的输入端与所述池化运算电路的输出端连接时,所述池化运算电路执行池化运算后将输出数据输入至所述量化运算电路的输入端;When the input end of the quantization operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the quantization operation circuit, or when When the input end of the quantization operation circuit is connected to the output end of the activation operation circuit, the activation operation circuit performs the activation operation and then inputs the output data to the input end of the activation operation circuit, or when the quantization operation When the input end of the circuit is connected to the output end of the pooling operation circuit, the pooling operation circuit performs the pooling operation and inputs the output data to the input end of the quantization operation circuit;
当所述池化运算电路的输入端与所述第一运算模型的输出端连接时,所述第一运算电路执行卷积操作后将输出数据输入至所述池化运算电路的输入端,或者,当所述池化运算电路的输入端与所述激活运算电路的输出端连接时,所述激活运算电路执行激活运算后将输出数据输入至所述池化运算电路的输入端,或者,当所述池化运算电路的输入端与所述量化运算电路的输出端连接时,所述量化运算电路执行量化运算后将输出数据输入至所述池化运算电路的输入端。When the input end of the pooling operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the pooling operation circuit, or , when the input end of the pooling operation circuit is connected to the output end of the activation operation circuit, the activation operation circuit performs the activation operation and then inputs the output data to the input end of the pooling operation circuit, or when When the input end of the pooling operation circuit is connected to the output end of the quantization operation circuit, the quantization operation circuit performs a quantization operation and then inputs the output data to the input end of the pooling operation circuit.
因此,在本申请实施方式中,第二运算电路中的各个电路存在多种连接方式,可以适应大多数需要进行量化、激活或者池化的神经网络,可以为大多数神经网络通过随路的方式提供加速,提高神经网络的运行效率。Therefore, in the embodiment of the present application, each circuit in the second operation circuit has multiple connection methods, which can be adapted to most neural networks that need to be quantized, activated or pooled, and can be used for most neural networks by following the path. Provide acceleration and improve the operating efficiency of neural networks.
在第一方面的一种可能的实施方式中,第一运算电路,具体用于利用卷积核遍历特征图feature map,以对卷积核中的元素和遍历区域内特征图中的元素进行卷积运算,得到多个卷积结果。池化运算电路,具体用于按照池化运算电路对多个卷积结果执行池化运算的顺序,获取多个卷积结果。为了保证池化运算电路能够正常进行随路运算,而无需引入 存储器,卷积运算中卷积核执行卷积运算的顺序是根据池化运算中池化窗口执行池化运算的顺序确定的。换句话说,为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据。In a possible implementation of the first aspect, the first operation circuit is specifically configured to use a convolution kernel to traverse the feature map to convolve the elements in the convolution kernel and the elements in the feature map in the traversed area. product operation to obtain multiple convolution results. The pooling operation circuit is specifically used to obtain multiple convolution results in the order in which the pooling operation circuit performs pooling operations on the multiple convolution results. In order to ensure that the pooling operation circuit can perform normal path operation without introducing Memory, the order in which convolution kernels perform convolution operations in convolution operations is determined based on the order in which pooling windows perform pooling operations in pooling operations. In other words, in order to ensure that the pooling operation circuit can normally perform the path-dependent operation, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation.
在第一方面的一种可能的实施方式中,第一运算电路还用于对输入至第一运算电路的两个张量的对应位置的元素进行相加操作、相减操作、相乘操作、相除操作、取最大值操作或者取最小值操作。通过同一个电路执行卷积运算或者element wise运算。换句话说,第一运算电路除了用于执行卷积运算,还可以用于执行element wise运算。这是因为卷积运算的本质是元素乘累加(元素乘,元素加),element wise运算的本质是对元素进行加、减、乘、除、取最大值,取最小值等操作。所以,本质上二者在运算是有重叠的,可以用一个硬件来做,即二者的硬件资源可以复用,降低硬件的面积开销,降低设计的复杂性。In a possible implementation of the first aspect, the first operation circuit is further configured to perform addition operations, subtraction operations, and multiplication operations on elements at corresponding positions of the two tensors input to the first operation circuit. Division operation, maximum value operation or minimum value operation. Convolution operations or element wise operations are performed through the same circuit. In other words, in addition to performing convolution operations, the first operation circuit can also be used to perform element wise operations. This is because the essence of convolution operation is element multiplication and accumulation (element multiplication, element addition), and the essence of element wise operation is to add, subtract, multiply, divide, take the maximum value, take the minimum value and other operations on the elements. Therefore, in essence, the two operations overlap and can be done with one hardware, that is, the hardware resources of the two can be reused, reducing the area overhead of the hardware and reducing the complexity of the design.
在第一方面的一种可能的实施方式中,激活运算电路的输入具体是从第一运算电路中获取的,量化运算电路的输入具体是从激活运算电路中获取的,池化运算电路的输入具体是从量化运算电路中获取的。在这种实施方式中,给出了一种具体的加速器的结构,该结构能够支持大部分不同结构的应用于穿戴领域中的神经网络中的运算流程。In a possible implementation of the first aspect, the input of the activation operation circuit is specifically obtained from the first operation circuit, the input of the quantization operation circuit is specifically obtained from the activation operation circuit, and the input of the pooling operation circuit Specifically, it is obtained from the quantization operation circuit. In this implementation mode, a specific accelerator structure is given, which can support the operation processes of most different structures in neural networks applied in the wearable field.
在第一方面的一种可能的实施方式中,池化运算中池化窗口的尺寸为w×h,池化步长为stride,其中,w、h、stride的取值相同,w为大于1的正整数。通过这样的设计,任意两次池化处理所处理的元素都没有重叠,能够节约计算量,进一步提升神经网络的性能。In a possible implementation of the first aspect, the size of the pooling window in the pooling operation is w×h, and the pooling step size is stride, where w, h, and stride have the same values, and w is greater than 1 positive integer. Through this design, there is no overlap in the elements processed by any two pooling processes, which can save the amount of calculation and further improve the performance of the neural network.
在第一方面的一种可能的实施方式中,该加速器应用于卷积神经网络CNN中。In a possible implementation of the first aspect, the accelerator is applied to a convolutional neural network CNN.
在第一方面的一种可能的实施方式中,该加速器应用于循环神经网络RNN中。In a possible implementation of the first aspect, the accelerator is applied to a recurrent neural network RNN.
在第一方面的一种可能的实施方式中,该加速器部署在可穿戴设备上。申请人发现在穿戴设备中部署相同的硬件设备,并将运行在该硬件设备上的神经网络执行的多种运算配置为随路运算,该硬件设备能够支持大部分不同结构的应用于穿戴领域中的神经网络。通过本方案,既可以保证能够通过一套硬件设备支持多种不同结构的神经网络执行多种运算,还能够提升神经网络的性能。In a possible implementation of the first aspect, the accelerator is deployed on a wearable device. The applicant found that by deploying the same hardware device in a wearable device and configuring the various operations performed by the neural network running on the hardware device as random operations, the hardware device can support most applications in the wearable field with different structures. neural network. Through this solution, it can not only ensure that a set of hardware devices can support multiple neural networks with different structures to perform multiple operations, but also improve the performance of the neural network.
在第一方面的一种可能的实施方式中,激活运算通过sigmoid函数、tanh函数、prelu函数、leay函数或者relu函数实现。In a possible implementation of the first aspect, the activation operation is implemented through a sigmoid function, tanh function, prelu function, leave function or relu function.
在第一方面的一种可能的实施方式中,池化运算包括最大值池化运算或者平均值池化运算。In a possible implementation of the first aspect, the pooling operation includes a maximum pooling operation or an average pooling operation.
在第一方面的一种可能的实施方式中,卷积运算包括深度可分卷积depthwise运算、矩阵与向量相乘的卷积GEMV运算或者矩阵与矩阵相乘的卷积GEMM运算。In a possible implementation of the first aspect, the convolution operation includes a depthwise separable convolution operation, a convolution GEMV operation of matrix-vector multiplication, or a convolution GEMM operation of matrix-matrix multiplication.
第二方面,本申请的实施例提供了一种加速方法,包括:对第一运算电路的输入执行卷积运算。其中,第一运算电路的输出接口与第二运算电路的输入接口直接连接,第一运算电路在执行卷积运算后,将输出通过第一运算电路的输出接口直接输入至第二运算电路的输入接口;第二运算电路包括以下电路中的至少一种:激活运算电路、量化运算电路或者池化运算电路。当第二运算电路包括激活运算电路时,对激活运算电路的输入执行激活运算,激活运算电路的输入是从第一运算电路或者量化运算电路或者池化运算电路中获取的。当第二运算电路包括量化运算电路时,对量化运算电路的输入执行量化运算,量化运算电路的输入是从第一运算电路或者激活运算电路或者池化运算电路中获取的。当第二运 算电路包括池化运算电路时,对池化运算电路的输入执行池化运算,池化运算电路的输入是从第一运算电路或者激活运算电路或者量化运算电路中获取的。也即,第一运算电路在执行卷积运算后将输出结果输入至第二运算电路,当前运算电路是第二运算电路中的任意一个电路,上一个运算电路为第一运算电路或者第二运算电路中的运算电路,当前运算电路的输入端与上一个运算电路的输出端连接时,上一个运算电路在执行对应的运算后将运算结果输入至当前运算电路。In a second aspect, embodiments of the present application provide an acceleration method, including: performing a convolution operation on the input of the first operation circuit. The output interface of the first operation circuit is directly connected to the input interface of the second operation circuit. After performing the convolution operation, the first operation circuit directly inputs the output to the input of the second operation circuit through the output interface of the first operation circuit. Interface; the second operation circuit includes at least one of the following circuits: an activation operation circuit, a quantization operation circuit, or a pooling operation circuit. When the second operation circuit includes an activation operation circuit, the activation operation is performed on the input of the activation operation circuit, and the input of the activation operation circuit is obtained from the first operation circuit, the quantization operation circuit, or the pooling operation circuit. When the second operation circuit includes a quantization operation circuit, the quantization operation is performed on the input of the quantization operation circuit, and the input of the quantization operation circuit is obtained from the first operation circuit, the activation operation circuit, or the pooling operation circuit. When the second luck When the arithmetic circuit includes a pooling arithmetic circuit, the pooling arithmetic operation is performed on the input of the pooling arithmetic circuit, and the input of the pooling arithmetic circuit is obtained from the first arithmetic circuit, the activation arithmetic circuit or the quantization arithmetic circuit. That is, the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation. The current operation circuit is any one of the second operation circuits, and the previous operation circuit is the first operation circuit or the second operation circuit. For the arithmetic circuit in the circuit, when the input end of the current arithmetic circuit is connected to the output end of the previous arithmetic circuit, the previous arithmetic circuit inputs the operation result to the current arithmetic circuit after performing the corresponding operation.
在第二方面的一种可能的实施方式中,对第一运算电路的输入执行卷积运算,包括:利用卷积核遍历特征图feature map,以对卷积核中的元素和遍历区域内特征图中的元素进行卷积运算,得到多个卷积结果。对池化运算电路的输入执行池化运算,包括:池化运算电路,具体用于按照池化运算电路对多个卷积结果执行池化运算的顺序,获取多个卷积结果。In a possible implementation of the second aspect, performing a convolution operation on the input of the first operation circuit includes: using a convolution kernel to traverse the feature map to compare the elements in the convolution kernel and the features in the traversal area. The elements in the graph are convolved to obtain multiple convolution results. Performing a pooling operation on the input of the pooling operation circuit includes: a pooling operation circuit, specifically configured to obtain multiple convolution results in the order in which the pooling operation circuit performs pooling operations on the multiple convolution results.
在第二方面的一种可能的实施方式中,该方法还包括:对输入至第一运算电路的两个张量的对应位置的元素进行相加操作、相减操作、相乘操作、相除操作、取最大值操作或者取最小值操作。In a possible implementation of the second aspect, the method further includes: performing addition operations, subtraction operations, multiplication operations, and division on elements at corresponding positions of the two tensors input to the first operation circuit. operation, maximum value operation or minimum value operation.
在第二方面的一种可能的实施方式中,激活运算电路的输入具体是从第一运算电路中获取的,量化运算电路的输入具体是从激活运算电路中获取的,池化运算电路的输入具体是从量化运算电路中获取的。In a possible implementation of the second aspect, the input of the activation operation circuit is specifically obtained from the first operation circuit, the input of the quantization operation circuit is specifically obtained from the activation operation circuit, and the input of the pooling operation circuit Specifically, it is obtained from the quantization operation circuit.
在第二方面的一种可能的实施方式中,池化运算中池化窗口的尺寸为w×h,池化步长为stride,其中,w、h、stride的取值相同,w为大于1的正整数。In a possible implementation of the second aspect, the size of the pooling window in the pooling operation is w×h, and the pooling step size is stride, where w, h, and stride have the same values, and w is greater than 1 positive integer.
在第二方面的一种可能的实施方式中,激活运算通过sigmoid函数、tanh函数、prelu函数、leay函数或relu函数中的任一种实现。In a possible implementation of the second aspect, the activation operation is implemented through any one of the sigmoid function, tanh function, prelu function, leave function or relu function.
在第二方面的一种可能的实施方式中,池化运算包括最大值池化运算或者平均值池化运算。In a possible implementation of the second aspect, the pooling operation includes a maximum pooling operation or an average pooling operation.
在第二方面的一种可能的实施方式中,卷积运算包括深度可分卷积(depthwise separable convolution)、矩阵与矩阵相乘的卷积(general matrix to matrix multiplication,GEMM)以及矩阵与向量相乘的卷积(general matrix to vector multiplication,GEMV)。In a possible implementation of the second aspect, the convolution operation includes depthwise separable convolution (depthwise separable convolution), convolution of matrix and matrix multiplication (general matrix to matrix multiplication, GEMM), and matrix and vector convolution. Multiplication convolution (general matrix to vector multiplication, GEMV).
第三方面,本申请提供一种加速装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第二方面任一项所示的加速方法中与处理相关的功能。可选地,该加速装置可以是芯片。In a third aspect, the present application provides an acceleration device, including: a processor and a memory, wherein the processor and the memory are interconnected through lines, and the processor calls the program code in the memory to execute any of the steps shown in the second aspect. Processing-related functions in accelerated methods. Optionally, the acceleration device may be a chip.
第四方面,本申请实施例提供一种计算机可读存储介质,包括指令,当其在计算机集群上运行时,使得计算机集群执行如第二方面或第二方面任意可能实现方式中所描述的方法。In a fourth aspect, embodiments of the present application provide a computer-readable storage medium that includes instructions that, when run on a computer cluster, cause the computer cluster to execute the method described in the second aspect or any possible implementation of the second aspect. .
第五方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机集群上运行时,使得计算机集群执行第二方面或第二方面任意可能实现方式中所描述的方法。In a fifth aspect, embodiments of the present application provide a computer program product containing instructions that, when run on a computer cluster, cause the computer cluster to execute the method described in the second aspect or any possible implementation of the second aspect.
第六方面,本申请实施例提供了一种芯片***,包括:处理器,处理器用于从存储器中调用并运行该存储器中存储的计算机程序,执行如第二方面或其相应的可能的实施方式提供的方法。 In a sixth aspect, embodiments of the present application provide a chip system, including: a processor, the processor is configured to call and run the computer program stored in the memory from the memory, and execute the second aspect or its corresponding possible implementation. provided method.
第七方面,本申请实施例提供了一种可穿戴设备,可穿戴设备上部署有第一方面或第一方面任意可能实现方式中所描述的神经网络加速器。In a seventh aspect, embodiments of the present application provide a wearable device, on which the neural network accelerator described in the first aspect or any possible implementation of the first aspect is deployed.
在第七方面的一种可能的实施方式中,可穿戴设备可以包括眼镜、电视、车载设备、手表或者手环中的至少一种。In a possible implementation of the seventh aspect, the wearable device may include at least one of glasses, a television, a vehicle-mounted device, a watch, or a bracelet.
需要说明的是,第二方面至第七方面所带来的有益效果可以参照第一方面以及第一方面可能的实现方式所带来的有益效果进行理解,这里不再重复赘述。It should be noted that the beneficial effects brought by the second to seventh aspects can be understood with reference to the beneficial effects brought by the first aspect and possible implementations of the first aspect, and will not be repeated here.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.
附图说明Description of the drawings
图1为卷积神经网络CNN的结构示意图;Figure 1 is a schematic structural diagram of the convolutional neural network CNN;
图2为循环神经网络RNN的结构示意图;Figure 2 is a schematic structural diagram of the recurrent neural network RNN;
图3为本申请实施例提供的一种神经网络加速器的结构示意图;Figure 3 is a schematic structural diagram of a neural network accelerator provided by an embodiment of the present application;
图4为本申请实施例提供的另一种神经网络加速器的结构示意图;Figure 4 is a schematic structural diagram of another neural network accelerator provided by an embodiment of the present application;
图5为本申请实施例提供的另一种神经网络加速器的结构示意图;Figure 5 is a schematic structural diagram of another neural network accelerator provided by an embodiment of the present application;
图6为本申请实施例提供的第一运算电路的运算流程示意图;Figure 6 is a schematic diagram of the operation flow of the first operation circuit provided by the embodiment of the present application;
图7a为relu激活函数的示意图;Figure 7a is a schematic diagram of the relu activation function;
图7b为sigmoid激活函数的示意图;Figure 7b is a schematic diagram of the sigmoid activation function;
图7c为tanh激活函数的示意图;Figure 7c is a schematic diagram of the tanh activation function;
图8a为本申请实施例提供的池化运算电路的一种运算流程示意图;Figure 8a is a schematic diagram of an operation flow of the pooling operation circuit provided by the embodiment of the present application;
图8b为本申请实施例提供的池化运算电路的另一种运算流程示意图;Figure 8b is a schematic diagram of another operation flow of the pooling operation circuit provided by the embodiment of the present application;
图9为本申请实施例提供的向池化运算电路输入数据的示意图;Figure 9 is a schematic diagram of inputting data to the pooling operation circuit provided by the embodiment of the present application;
图10a为本申请实施例提供的卷积运算电路的一种运算流程示意图;Figure 10a is a schematic diagram of an operation flow of the convolution operation circuit provided by an embodiment of the present application;
图10b为本申请实施例提供的卷积运算电路的另一种运算流程示意图;Figure 10b is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application;
图10c为本申请实施例提供的卷积运算电路的另一种运算流程示意图;Figure 10c is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application;
图10d为本申请实施例提供的卷积运算电路的另一种运算流程示意图;Figure 10d is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application;
图11为本申请实施例提供的卷积运算电路的另一种运算流程示意图;Figure 11 is a schematic diagram of another operation flow of the convolution operation circuit provided by the embodiment of the present application;
图12为本申请实施例提供的神经网络加速器的运算流程示意图;Figure 12 is a schematic diagram of the operation flow of the neural network accelerator provided by the embodiment of the present application;
图13为本申请实施例提供配置神经网络加速器的指令时的流程图;Figure 13 is a flow chart when providing instructions for configuring a neural network accelerator according to an embodiment of the present application;
图14为本申请实施例提供的另一种神经网络加速器的结构示意图;Figure 14 is a schematic structural diagram of another neural network accelerator provided by an embodiment of the present application;
图15为本申请实施例提供配置神经网络加速器的指令时的流程图;Figure 15 is a flow chart when providing instructions for configuring a neural network accelerator according to an embodiment of the present application;
图16为本申请的实施例提供的一种加速方法的流程示意图;Figure 16 is a schematic flowchart of an acceleration method provided by an embodiment of the present application;
图17为本申请的实施例提供的一种芯片的结构示意图。Figure 17 is a schematic structural diagram of a chip provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申 请保护的范围。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative efforts belong to this application. Please protect the scope.
本申请实施例提供一种神经网络加速器、加速方法以及装置,为了更好的理解本申请实施例提供的方案,下面首先对本申请实施例提供的方案的研究思路进行介绍:The embodiments of this application provide a neural network accelerator, acceleration method and device. In order to better understand the solution provided by the embodiment of this application, the following first introduces the research ideas of the solution provided by the embodiment of this application:
神经网络可以看作由神经单元组成的机器学习模型,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
The neural network can be regarded as a machine learning model composed of neural units. The neural unit can refer to an operation unit that takes xs and intercept 1 as input. The output of the operation unit can be:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. A neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.
神经网络的种类很多,比如卷积神经网络(convolutional neuron network,CNN)、循环神经网络(recurrent neural networks,RNN)、深度神经网络(deep neural network,DNN)、深度卷积神经网络(deep convolutional neural networks,DCNN)等等。不同类型的神经网络,结构往往不同,即使是同一种类型的神经网络,也存在多种不同的结构。示例性的,下面结合几种典型的神经网络结构对神经网络进行介绍。There are many types of neural networks, such as convolutional neural network (CNN), recurrent neural networks (RNN), deep neural network (DNN), deep convolutional neural network (deep convolutional neural network) networks, DCNN) and so on. Different types of neural networks often have different structures. Even the same type of neural network has many different structures. As an example, the neural network is introduced below with several typical neural network structures.
参阅图1,为卷积神经网络CNN(以下简称为CNN网络)的结构示意图。在图1中,CNN网络的输入数据可以涉及图像、文本、语音,还可以涉及物联网数据,包括已有***的业务数据以及力、位移、液位、温度、湿度等感知数据。以下以该输入数据是图像为例进行说明。CNN网络将获取到的图像交由卷积层、池化层以及后面的神经网络层(图中未示出)进行处理,可以得到图像的处理结果。Refer to Figure 1, which is a schematic structural diagram of the convolutional neural network CNN (hereinafter referred to as CNN network). In Figure 1, the input data of the CNN network can involve images, text, voice, and IoT data, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity. The following description takes the input data as an image as an example. The CNN network passes the acquired image to the convolution layer, pooling layer and subsequent neural network layer (not shown in the figure) for processing, and the image processing result can be obtained.
卷积层(比如图1中示出的卷积层101、卷积层102以及卷积层103)可以包括很多个卷积算子,卷积算子也称为核或者卷积核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图 合并形成卷积运算的输出。The convolution layer (such as the convolution layer 101, the convolution layer 102 and the convolution layer 103 shown in Figure 1) can include many convolution operators. The convolution operator is also called a kernel or a convolution kernel. The role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image , the weight matrix is usually processed on the input image one pixel after another (or two pixels after two pixels...it depends on the value of the stride) along the horizontal direction to complete the extraction of specific features from the image. work. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same size (rows × columns) are applied, That is, multiple matrices of the same type. The output of each weight matrix is stacked to form the depth dimension of the convolution image. The dimension here can be understood as being determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc. The multiple weight matrices have the same size (row × column), and the convolution feature maps extracted by the multiple weight matrices with the same size are also the same size. Then, the multiple extracted convolution feature maps with the same size are Combined to form the output of the convolution operation.
通过上述介绍可知,卷积运算的本质是对矩阵中的元素进行乘累加运算,具体的,将权重矩阵中的元素与输入图像矩阵中的元素进行乘累加运算。As can be seen from the above introduction, the essence of the convolution operation is to perform a multiplication and accumulation operation on the elements in the matrix. Specifically, the elements in the weight matrix and the elements in the input image matrix are multiplied and accumulated.
当有多层卷积层时,上一层卷积层和下一层卷积层之间具有一个函数关系,该函数关系称为激活函数或者激励函数。换句话说,需要对上一层卷积层的输出进行激活处理后作为下一层卷积层的输入。这是因为如果不引入激活函数,则下一层卷积层的输入都是上层卷积层输出的线性函数,则无论神经网络有多少卷积层,每个卷积层的输出都是上一层卷积层输入的线性组合,那么神经网络的逼近能力就相当有限,所以可以通过引入非线性函数作为激活函数,以增加神经网络的表达能力,使得神经网络可以任意逼近任何非线性函数,这样神经网络就可以应用到众多的非线性模型中。When there are multiple convolutional layers, there is a functional relationship between the previous convolutional layer and the next convolutional layer, which is called an activation function or excitation function. In other words, the output of the previous convolutional layer needs to be activated and used as the input of the next convolutional layer. This is because if the activation function is not introduced, the input of the next convolutional layer is a linear function of the output of the upper convolutional layer. No matter how many convolutional layers the neural network has, the output of each convolutional layer is the same as the previous one. If it is a linear combination of convolutional layer inputs, the approximation ability of the neural network is quite limited, so the expression ability of the neural network can be increased by introducing a nonlinear function as the activation function, so that the neural network can arbitrarily approximate any nonlinear function, so Neural networks can be applied to many nonlinear models.
此外,还可以观测卷积层输出和原始输入之间的差异,该差异可以作为训练过程中的辅助信息,以评估CNN网络的性能,有助于提升CNN网络的训练效果。具体的,可以在卷积层之后引入逐项运算element-wise。element-wise本质上是对两个张量进行操作,它在相应张量内的对应的元素进行操作,比如可以对对应的元素进行加(add)、减(sub)、乘(mul)、除(div)、取最大(max),取最小(min)、取绝对值(abs)等操作。如果两个元素在张量内占据相同位置,则称这两个元素是对应的。该位置由用于定位每个元素的索引确定。In addition, the difference between the convolutional layer output and the original input can also be observed. This difference can be used as auxiliary information during the training process to evaluate the performance of the CNN network and help improve the training effect of the CNN network. Specifically, element-wise operations can be introduced after the convolutional layer. Element-wise essentially operates on two tensors. It operates on the corresponding elements in the corresponding tensors. For example, the corresponding elements can be added (add), subtracted (sub), multiplied (mul), and divided. (div), take the maximum (max), take the minimum (min), take the absolute value (abs) and other operations. Two elements are said to correspond if they occupy the same position within a tensor. The position is determined by the index used to locate each element.
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层,在如图1中所示例的卷积层201接在最后一层卷积层之后。在图像处理过程中,池化层的目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子(可以理解为进行平均池化运算或者最大池话运算),以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after a convolutional layer. This can be a convolutional layer followed by a pooling layer, or multiple convolutional layers followed by a pooling layer. One or more pooling layers, convolutional layer 201 as illustrated in Figure 1 follows the last convolutional layer. In the image processing process, the purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator (which can be understood as performing an average pooling operation or a maximum pooling operation) to sample the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling. In addition, just like the size of the weight matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the size of the image. The size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
需要说明的是,图1所展示的仅为CNN网络的一种可能的结构,CNN网络还可能包括其他结构,比如可能不包括池化层,或者在每一层卷积层之后都接入池化层等等。此外,除了图1中所展示的CNN网络包括的层,CNN网络可能包括更多的层,比如,在经过卷积层/池化层的处理后,卷积神经网络还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络需要利用神经网络层来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层中可以包括多层隐含层,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。在神经网络层中的多层隐含层之后,也就是整个卷积神经网络的最后层还可以包括输出层,该输出层具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络的前向传播(如图1由101至201方向的传播为前向传播)完成,反向传播(如图1由201至101方向的传播为反向传播) 就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络的损失,及卷积神经网络通过输出层输出的结果和理想结果之间的误差。It should be noted that Figure 1 shows only one possible structure of the CNN network. The CNN network may also include other structures. For example, it may not include a pooling layer, or a pooling layer may be connected after each convolutional layer. layer and so on. In addition, in addition to the layers included in the CNN network shown in Figure 1, the CNN network may include more layers. For example, after being processed by the convolutional layer/pooling layer, the convolutional neural network is not enough to output the required Output information. Because as mentioned before, the convolutional layer/pooling layer will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network needs to use neural network layers to generate an output or a set of required number of classes. Therefore, the neural network layer can include multiple hidden layers, and the parameters contained in the multiple hidden layers can be pre-trained based on relevant training data of a specific task type. For example, the task type can include image recognition, Image classification, image super-resolution reconstruction, etc. After the multi-layer hidden layer in the neural network layer, that is, the last layer of the entire convolutional neural network can also include an output layer, which has a loss function similar to classification cross entropy, specifically used to calculate the prediction error. Once the entire convolutional neural network The forward propagation of the convolutional neural network (the propagation in the direction from 101 to 201 in Figure 1 is forward propagation) is completed, and the backward propagation (the propagation in the direction from 201 to 101 in Figure 1 is back propagation) It will start to update the weight values and deviations of each layer mentioned above to reduce the loss of the convolutional neural network and the error between the result output by the convolutional neural network through the output layer and the ideal result.
参阅图2,为循环神经网络RNN(以下简称为RNN网络)的结构示意图。如图2所示,RNN网络可能包括多个图2所展示的结构,以下将图2所展示的结构称为RNN模块,虽然和图1所展示的CNN网络的结构不同,但是在RNN网络中也涉及逐项运算、激活处理等。关于逐项运算、激活处理可以参照图1中相关描述进行理解,这里不再重复赘述。图2所示的RNN模块包括三个输入,和两个输出,其中,三个输入分别为ct-1、xt以及卷积层(图2中未示出)的输出,两个输出分别为ct和本RNN模块的最终输出ht,其中,ct作为下一个RNN模块的输入ct-1,xt为一个常量。全部RNN模块的最终输出用于获取RNN网络的输出结果。Refer to Figure 2, which is a schematic structural diagram of a recurrent neural network RNN (hereinafter referred to as RNN network). As shown in Figure 2, the RNN network may include multiple structures shown in Figure 2. Hereinafter, the structure shown in Figure 2 is called the RNN module. Although it is different from the structure of the CNN network shown in Figure 1, in the RNN network It also involves item-by-item operations, activation processing, etc. The item-by-item operation and activation processing can be understood with reference to the relevant description in Figure 1, and will not be repeated here. The RNN module shown in Figure 2 includes three inputs and two outputs. The three inputs are c t-1 , x t and the output of the convolution layer (not shown in Figure 2). The two outputs are respectively is c t and the final output h t of this RNN module, where c t serves as the input of the next RNN module c t-1 and x t is a constant. The final output of all RNN modules is used to obtain the output results of the RNN network.
可见,神经网络中涉及大量的计算,主要包括卷积操作、激活操作、池化操作和量化运算等,占用了神经网络处理的大部分时间。但是在很多领域中,需要在有限的硬件面积内获得较高的性能和能效比。如何能在满足一定的功耗约束下,高效的处理神经网络执行的多种运算,是硬件设备的关键。It can be seen that a large amount of calculations are involved in neural networks, mainly including convolution operations, activation operations, pooling operations and quantization operations, etc., which take up most of the time of neural network processing. However, in many fields, it is necessary to obtain high performance and energy efficiency within a limited hardware area. How to efficiently process the various operations performed by neural networks while meeting certain power consumption constraints is the key to hardware equipment.
申请人发现在穿戴设备中部署相同的硬件设备,并将运行在该硬件设备上的神经网络执行的多种运算配置为随路运算,该硬件设备能够支持大部分不同结构的应用于穿戴领域中的神经网络。其中,非随路是指每次完成一种运算,都要将该运算结果存入存储器中,当需要调用该运算结果时,需要从存储器中进行读取。随路运算是指每次完成一种运算,不需要将该结果存入存储器中,而是直接用于其他运算中。为了更好的理解随路运算和非随路运算的概念,这里结合一个例子进行说明,比如神经网络包括卷积模块和池化模块,该卷积模块的输出用于作为池化模块的输入。其中卷积模块用于执行卷积运算,可以参照上文描述的卷积层进行理解,池化模块用于执行池化运算,可以参照上文描述的池化层进行理解。如果卷积模块被配置为非随路的,则卷积模块每次完成卷积运算后,将该卷积运算的结果存在存储器中,池化模块需要从存储器中调取该卷积结果并对该卷积结果进行池化运算。如果卷积模块被配置为随路的,则卷积模块的输出直接作为池化模块的输入,不需要再将卷积模块的输出写入到存储器中。将神经网络执行的多种运算配置为非随路,每一次运算结果都需要写入存储器中,导致硬件设备的功耗巨大、性能低下。此外,由于需要支持不同的运算流程,所以需要添加数据依赖控制,控制各个模块之间的运算顺序,或者说控制各个模块从存储器中读取运算结果,设计比较复杂。所以,本申请实施例提供的方案将运行在该硬件设备上的神经网络执行的多种运算配置为随路运算,此基础上,本申请实施例还可以对硬件资源进行复用,能够在苛刻面积、满足功耗约束基础上,兼顾性能,有效提升神经网络的运算速率。The applicant found that by deploying the same hardware device in a wearable device and configuring the various operations performed by the neural network running on the hardware device as random operations, the hardware device can support most applications in the wearable field with different structures. neural network. Among them, non-pathway means that every time an operation is completed, the operation result must be stored in the memory. When the operation result needs to be called, it needs to be read from the memory. Path-dependent operation means that one operation is completed at a time, and the result does not need to be stored in the memory, but is directly used in other operations. In order to better understand the concepts of random operations and non-random operations, here is an example. For example, a neural network includes a convolution module and a pooling module. The output of the convolution module is used as the input of the pooling module. The convolution module is used to perform convolution operations and can be understood with reference to the convolution layer described above. The pooling module is used to perform pooling operations and can be understood with reference to the pooling layer described above. If the convolution module is configured to be non-path-dependent, each time the convolution module completes the convolution operation, the result of the convolution operation will be stored in the memory. The pooling module needs to retrieve the convolution result from the memory and perform the convolution operation. The convolution result is pooled. If the convolution module is configured as path-dependent, the output of the convolution module is directly used as the input of the pooling module, and there is no need to write the output of the convolution module to the memory. Configuring the various operations performed by the neural network to be non-sequential, the results of each operation need to be written to the memory, resulting in huge power consumption and low performance of the hardware device. In addition, since different operation processes need to be supported, data dependency control needs to be added to control the order of operations between various modules, or to control each module to read operation results from the memory. The design is relatively complex. Therefore, the solution provided by the embodiments of the present application configures various operations performed by the neural network running on the hardware device as path-dependent operations. On this basis, the embodiments of the present application can also reuse hardware resources, which can be used in harsh environments. On the basis of reducing area and meeting power consumption constraints, it takes into account performance and effectively improves the computing speed of neural networks.
基于上述研究思路,下面对本申请实施例提供的一种神经网络加速器进行具体的介绍:Based on the above research ideas, the following is a detailed introduction to a neural network accelerator provided by the embodiment of this application:
参阅图3,为本申请实施例提供的一种神经网络加速器的结构示意图。如图3所示,本申请实施例提供的一种神经网络加速器可以包括第一运算电路和第二运算电路,第一运算电路的输出端与第二电路的输入端直接连接,第一运算电路的输出为第二运算电路的输入,第一运算电路用于执行卷积运算,并在执行卷积操作后通过与第二运算电路之间连接的接口将输出直接输入至第二运算电路的输入端。第二运算电路被配置为随路执行激活运 算、量化运算以及池化运算中的至少一种。即第二运算电路可以包括以下电路中的至少一种:量化运算电路、激活运算电路以及池化运算电路。其中,量化运算电路用于执行量化运算,激活运算电路用于执行激活运算,池化运算电路用于执行池化运算,量化运算电路、激活运算电路以及池化运算电路均被配置为执行随路运算,即互相连接的各个运算电路之间直接连接,互相连接的上一个运算电路在执行对应的运算后直接将输出结果输入至下一个运算电路,无需额外设置用于存储数据的模块或者存储介质。Refer to Figure 3, which is a schematic structural diagram of a neural network accelerator provided by an embodiment of the present application. As shown in Figure 3, a neural network accelerator provided by an embodiment of the present application may include a first arithmetic circuit and a second arithmetic circuit. The output end of the first arithmetic circuit is directly connected to the input end of the second circuit. The first arithmetic circuit The output of is the input of the second operation circuit. The first operation circuit is used to perform the convolution operation, and after performing the convolution operation, the output is directly input to the input of the second operation circuit through the interface connected to the second operation circuit. end. The second operation circuit is configured to execute the activation operation along the way. At least one of calculation, quantization operation and pooling operation. That is, the second operation circuit may include at least one of the following circuits: a quantization operation circuit, an activation operation circuit, and a pooling operation circuit. Among them, the quantization operation circuit is used to perform quantization operations, the activation operation circuit is used to perform activation operations, and the pooling operation circuit is used to perform pooling operations. The quantization operation circuit, the activation operation circuit, and the pooling operation circuit are all configured to perform path-following operations. Operation, that is, the interconnected operation circuits are directly connected to each other. The previous operation circuit connected to each other directly inputs the output result to the next operation circuit after performing the corresponding operation. There is no need to set up additional modules or storage media for storing data. .
前述的直接连接,即上一个运算电路的输出接口直接与下一个运算电路的输入接口电连接,上一个运算电路在执行对应的运算后将输出通过输出接口直接输入至下一个运算电路的输入接口中,运算电路之间可以直接传输数据,而无需经过存储器或者缓存器进行存储,可以实现各个运算电路的随路配置。The aforementioned direct connection means that the output interface of the previous operation circuit is directly electrically connected to the input interface of the next operation circuit. After the previous operation circuit performs the corresponding operation, the output is directly input to the input interface of the next operation circuit through the output interface. In the system, data can be transmitted directly between computing circuits without being stored in a memory or buffer, and the on-path configuration of each computing circuit can be realized.
可以理解为,第一运算电路在执行卷积运算后将输出结果输入至第二运算电路,当前运算电路是第二运算电路中的任意一个电路,上一个运算电路为第一运算电路或者第二运算电路中的运算电路,当前运算电路的输入端与上一个运算电路的输出端直接连接,上一个运算电路在执行对应的运算后直接将运算结果输入至当前运算电路,而无需经过额外得到的用于存储数据的模块或者存储介质,实现更高效的数据传输。It can be understood that the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation. The current operation circuit is any one of the second operation circuits, and the previous operation circuit is the first operation circuit or the second operation circuit. In the arithmetic circuit in the arithmetic circuit, the input end of the current arithmetic circuit is directly connected to the output end of the previous arithmetic circuit. After performing the corresponding operation, the previous arithmetic circuit directly inputs the operation result to the current arithmetic circuit without going through additional processing. Modules or storage media used to store data to achieve more efficient data transmission.
需要说明的是,本申请提供的神经网络加速器中的运算电路,可以包括使用多种元件来,如通过三极管、电阻、二极管等电子元件的组合可以实现加、按位与、或等电路,这些电路可以通过组合的方式来实现前述的卷积、池化、量化或者激活等运算,具体的电子元件设置方式可以根据实际所需的电路来进行连接,本申请对此并不作赘述。It should be noted that the arithmetic circuit in the neural network accelerator provided by this application can include the use of a variety of components. For example, addition, bitwise AND, or circuits can be realized through a combination of transistors, resistors, diodes and other electronic components. These circuits can The circuits can be combined to implement the aforementioned operations such as convolution, pooling, quantization, or activation. The specific arrangement of electronic components can be connected according to the actual required circuits, which will not be described in detail in this application.
其中,第二运算电路中包括的各个电路可以有多种随路配置方式,比如,参阅图4,在一个优选的实施方式中,第一运算电路的输出具体为量化运算电路的输入,量化运算电路的输出为激活运算电路的输入,激活运算电路的输出为池化运算电路的输入。换句话说,第二运算电路被配置为先随路执行量化运算,再随路执行激活运算,最后再随路执行池化运算。Among them, each circuit included in the second operation circuit can be configured in a variety of ways. For example, referring to Figure 4, in a preferred embodiment, the output of the first operation circuit is specifically the input of the quantization operation circuit, and the quantization operation The output of the circuit is the input of the activation operation circuit, and the output of the activation operation circuit is the input of the pooling operation circuit. In other words, the second operation circuit is configured to first perform the quantization operation along the path, then perform the activation operation along the path, and finally perform the pooling operation along the path.
再比如,参阅图5,第一运算电路的输出具体为激活运算电路的输入,激活运算电路的输出为池化运算电路的输入,池化运算电路的输出为量化运算电路的输入。换句话说,第二运算电路被配置为先随路执行激活运算,再随路执行池化运算,最后再随路执行量化运算。For another example, referring to FIG. 5 , the output of the first operation circuit is specifically the input of the activation operation circuit, the output of the activation operation circuit is the input of the pooling operation circuit, and the output of the pooling operation circuit is the input of the quantization operation circuit. In other words, the second operation circuit is configured to first perform the activation operation along the path, then perform the pooling operation along the path, and finally perform the quantization operation along the path.
卷积运算电路(为了便于描述方案,图3至图12中将第一运算电路称为卷积运算电路)包括两个输入和一个输出,两个输入可以包括卷积核和输入图像,其中卷积核通过权重矩阵表示,输入图像可以通过输入图像矩阵表示,输入图像也可以通过输入图像向量表示,以下以输入图像为输入图像向量表示为例进行介绍。参阅图6,卷积运算电路的输入包括一个向量和一个矩阵,卷积运算电路主要完成乘累加的操作,输出仍为一个向量,图6中所示的K和N均为大于0的正整数。需要说明的是,本申请实施例对卷积运算的类型并不进行限定,比如卷积运算的类型包括但不限于深度可分卷积(depthwise separable convolution)、矩阵与矩阵相乘的卷积(general matrix to matrix multiplication,GEMM)以及矩阵与向量相乘的卷积(general matrix to vector multiplication,GEMV)。The convolution operation circuit (for the convenience of describing the solution, the first operation circuit is called the convolution operation circuit in Figures 3 to 12) includes two inputs and one output. The two inputs may include a convolution kernel and an input image, where the convolution The kernel is represented by a weight matrix, the input image can be represented by an input image matrix, and the input image can also be represented by an input image vector. The following introduction takes the input image as an input image vector representation as an example. Refer to Figure 6. The input of the convolution operation circuit includes a vector and a matrix. The convolution operation circuit mainly completes the multiplication and accumulation operations. The output is still a vector. K and N shown in Figure 6 are both positive integers greater than 0. . It should be noted that the embodiments of the present application do not limit the types of convolution operations. For example, the types of convolution operations include but are not limited to depthwise separable convolution (depthwise separable convolution), matrix-matrix multiplication convolution ( general matrix to matrix multiplication, GEMM) and the convolution of matrix and vector multiplication (general matrix to vector multiplication, GEMV).
量化运算电路用于执行量化操作,可以被配置为执行任意一种量化操作,本申请实施 例对量化的方式并不进行限定。假设量化运算电路的输入为一个1*N的向量,输出经过量化处理后仍为一个1*N的向量,输出的向量的数据格式包括但不限于fp16、s16、s8以及s4。The quantization operation circuit is used to perform quantization operations and can be configured to perform any kind of quantization operations. This application implements Examples do not limit the method of quantification. Assume that the input of the quantization operation circuit is a 1*N vector, and the output is still a 1*N vector after quantization processing. The data format of the output vector includes but is not limited to fp16, s16, s8 and s4.
激活运算电路通过激活函数执行激活操作,可以被配置为采用任意一种激活函数,本申请实施例对激活函数的类型并不进行限定。假设激活运算电路的输入为一个1*N的向量,输出经过激活处理后仍为一个1*N的向量。示例性的,激活函数包括但不限于relu、sigmoid、tanh。参阅图7a至图7c,给出了relu激活函数、sigmoid激活函数以及tanh激活函数的示意图,其中横纵代表激活运算电路的输入,纵轴代表激活运算电路的输出。The activation operation circuit performs an activation operation through an activation function and can be configured to use any activation function. The embodiment of the present application does not limit the type of activation function. Assume that the input of the activation operation circuit is a 1*N vector, and the output is still a 1*N vector after activation processing. Exemplary activation functions include but are not limited to relu, sigmoid, and tanh. Referring to Figure 7a to Figure 7c, a schematic diagram of the relu activation function, the sigmoid activation function and the tanh activation function is given, in which the horizontal and vertical axis represent the input of the activation operation circuit, and the vertical axis represents the output of the activation operation circuit.
池化运算电路用于执行池化操作,可以被配置为执行任意一种池化操作,本申请实施例对池化的方式并不进行限定,示例性的,包括但不限于最大值池化(max pooling)和平均值池化(average pooling)。假设池化运算中池化窗口的尺寸为w×h,池化步长为stride,w、h、stride均为大于1的正整数。在一个优选的实施方式中,为了能够节约计算量,可以使w、h、stride的取值相同。为了更好的理解这一优选的实施方式的优势,下面结合图8a和图8b进行举例说明,如图8a所示,池化窗口的w、h、stride的取值不相同,图8a以3×3×2为例进行展示,其中w和h的取值为3,stride的取值为2,可见,两次池化运算的对象中存在重复的元素,具体的,第一次池化运算处理的元素包括1,1,4,4,3,5,2,2,8,7第二次池化运算处理的元素包括4,5,3,5,7,8,7,8,9,两次池化运算处理的元素存在重复的元素4,5,7。继续参阅图8b,使w、h、stride的取值相同,图8b以3×3×3为例进行展示,其中w、h、stride的取值均为3,可见,两次池化运算的对象中不存在重复的元素,具体的,第一次池化运算处理的元素包括1,4,4,3,5,2,8,7,第二次池化运算处理的元素包括5,3,3,7,8,1,8,9,2。通过另池化窗口的w、h、stride的取值相同,能够更快的完成对池化运算电路的输入的池化运算,节约计算量,提升性能。The pooling operation circuit is used to perform pooling operations and can be configured to perform any kind of pooling operations. The embodiments of the present application do not limit the pooling method. Examples include but are not limited to maximum pooling ( max pooling) and average pooling (average pooling). Assume that the size of the pooling window in the pooling operation is w × h, the pooling step size is stride, and w, h, and stride are all positive integers greater than 1. In a preferred implementation, in order to save the amount of calculation, the values of w, h, and stride can be made the same. In order to better understand the advantages of this preferred implementation, an example is given below in conjunction with Figure 8a and Figure 8b. As shown in Figure 8a, the values of w, h, and stride of the pooling window are different. Figure 8a uses 3 ×3×2 is shown as an example, where the values of w and h are 3, and the value of stride is 2. It can be seen that there are duplicate elements in the objects of the two pooling operations. Specifically, the first pooling operation The elements processed include 1, 1, 4, 4, 3, 5, 2, 2, 8, 7. The elements processed by the second pooling operation include 4, 5, 3, 5, 7, 8, 7, 8, 9 , the elements processed by the two pooling operations have duplicate elements 4, 5, and 7. Continuing to refer to Figure 8b, make the values of w, h, and stride the same. Figure 8b takes 3×3×3 as an example. The values of w, h, and stride are all 3. It can be seen that the values of the two pooling operations are There are no duplicate elements in the object. Specifically, the elements processed by the first pooling operation include 1, 4, 4, 3, 5, 2, 8, 7, and the elements processed by the second pooling operation include 5, 3. ,3,7,8,1,8,9,2. By setting the values of w, h, and stride of the pooling window to be the same, the pooling operation on the input of the pooling operation circuit can be completed faster, saving calculation amount and improving performance.
此外,为了保证池化运算电路能够正常进行随路运算,而无需引入存储器,卷积运算中卷积核执行卷积运算的顺序是根据池化运算中池化窗口执行池化运算的顺序确定的。换句话说,为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据。参阅图9,以采用最大值池化方式为例对此进行举例说明。如图9所示,假设卷积运算后获取了4×4×N的张量,池化窗口的尺寸为2×2,步长为2。如果池化窗口第一次处理的元素为1,1,5,6,第二次处理的元素为2,4,7,8,第三次处理的元素为3,2,4,9,第四次处理的元素为8,3,0,5,那么按照第一次至第四次处理的顺序,向池化运算电路输入元素/数据,具体的可以继续参阅图9进行理解。需要说明的是,本申请实施例并不限定每次池化处理的元素的输入顺序,比如,池化窗口第一次处理的元素为1,1,5,6,可以按照1,1,5,6的顺序向池化运算电路输入元素/数据,也可以按照1,5,1,6的顺序向池化运算电路输入元素/数据,还可以按照6,5,1,1的顺序向池化运算电路输入元素/数据,等等。In addition, in order to ensure that the pooling operation circuit can perform normal path operations without introducing memory, the order in which the convolution kernels perform convolution operations in the convolution operation is determined based on the order in which the pooling windows in the pooling operation perform pooling operations. . In other words, in order to ensure that the pooling operation circuit can normally perform the path-dependent operation, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Refer to Figure 9 to illustrate this using the maximum pooling method as an example. As shown in Figure 9, assume that a 4×4×N tensor is obtained after the convolution operation, the size of the pooling window is 2×2, and the step size is 2. If the elements processed by the pooling window for the first time are 1,1,5,6, the elements processed for the second time are 2,4,7,8, and the elements processed for the third time are 3,2,4,9, the elements processed for the third time are 3,2,4,9. The elements processed four times are 8, 3, 0, 5. Then, according to the order of the first to fourth processing, the elements/data are input to the pooling operation circuit. For details, you can continue to refer to Figure 9 for understanding. It should be noted that the embodiments of the present application do not limit the input order of elements processed by each pooling process. For example, the elements processed for the first time by the pooling window are 1,1,5,6, and can be processed in accordance with 1,1,5 , you can input elements/data to the pooling operation circuit in the order of 6, you can also input elements/data to the pooling operation circuit in the order of 1,5,1,6, you can also input elements/data to the pool in the order of 6,5,1,1 ion arithmetic circuit input elements/data, etc.
为了更好的理解本申请实施例提供的方案,下面结合一个具体的例子对方案进行介绍:In order to better understand the solution provided by the embodiments of this application, the solution is introduced below with a specific example:
假设配置input size为H=4,W=4,C=16,weight配置为kernel_h=3,kernel_w=3,kernel_stride=1,kernel_output_channel=16,用于表示卷积运算电路的输入为 4×4×16的矩阵,权重矩阵的尺寸为3×3,步长为1,卷积后输出的数据的维度为16。为了保持卷积运算电路的输出数据不要太小,需要对卷积运算电路的输入进行填充(padding)。关于如何进行padding已经有成熟的技术手段,本申请实施例对此不再解释。这里假设padding配置为:top_padding=botton_padding=left_padding=right_padding=1,用于表示对卷积运算电路的输入在上、下、左、右四个方向的填充方式,具体的可以参照图10a所示的阴影区域进行理解。再假设配置池化窗口的尺寸为2×2,步长为2。为了便于描述,以下将填充后的卷积运算电路的输入称为特征矩阵。Assume that the input size is configured as H=4, W=4, C=16, the weight is configured as kernel_h=3, kernel_w=3, kernel_stride=1, kernel_output_channel=16, which is used to represent the input of the convolution operation circuit. A 4×4×16 matrix, the size of the weight matrix is 3×3, the step size is 1, and the dimension of the output data after convolution is 16. In order to keep the output data of the convolution operation circuit from being too small, the input of the convolution operation circuit needs to be padded. There are already mature technical means on how to perform padding, which will not be explained in the embodiments of this application. It is assumed here that the padding configuration is: top_padding=botton_padding=left_padding=right_padding=1, which is used to indicate the filling method of the input of the convolution operation circuit in the four directions of up, down, left, and right. For details, please refer to Figure 10a. Shaded area for understanding. Also assume that the size of the pooling window is 2×2 and the stride is 2. For the convenience of description, the input of the filled convolution operation circuit is called the feature matrix in the following.
上文介绍到为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据,那么卷积运算电路执行卷积运算的顺序如下:As introduced above, in order to ensure that the pooling operation circuit can perform normal path operations, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Then the order in which the convolution operation circuit performs the convolution operation is as follows:
如图10a所示,第1次卷积时,根据卷积核的尺寸从特征矩阵最左上角开始读取数据。比如,对应卷积核最左上角起始位置的坐标点,可以先读取特征矩阵中坐标(X,Y)=(0,5)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取卷积核内的特征矩阵中其他位置的元素,直到读完所有的卷积核内的特征矩阵中的元素,以保证输出一次卷积结果。As shown in Figure 10a, during the first convolution, data is read from the upper left corner of the feature matrix according to the size of the convolution kernel. For example, for the coordinate point corresponding to the starting position of the upper left corner of the convolution kernel, you can first read the data/elements corresponding to the coordinates (X, Y) = (0, 5) in the feature matrix. Then according to the principle of reading 1xK elements at a time, read the elements in other positions in the feature matrix in the convolution kernel until all the elements in the feature matrix in the convolution kernel are read to ensure that the convolution result is output once. .
如图10b所示,第2次卷积时,根据卷积核的尺寸以及步长,在w方向向右移动一个步长,读取当前卷积核内的特征矩阵中的元素。比如,根据当前卷积核最左上角起始位置的坐标点,可以先读取特征矩阵中坐标(X,Y)=(1,5)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素,以保证输出一次卷积结果。As shown in Figure 10b, during the second convolution, according to the size and step size of the convolution kernel, move one step to the right in the w direction and read the elements in the feature matrix in the current convolution kernel. For example, based on the coordinate point of the starting position of the upper left corner of the current convolution kernel, you can first read the data/elements corresponding to the coordinates (X, Y) = (1, 5) in the feature matrix. Then according to the principle of reading 1xK elements each time, read elements in other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read to ensure that the volume is output once. accumulation results.
如图10c所示,第3次卷积时,为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据,所以第三次卷积时,先读取特征矩阵中坐标(X,Y)=(0,4)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素,以保证输出一次卷积结果。As shown in Figure 10c, in the third convolution, in order to ensure that the pooling operation circuit can normally perform the path operation, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Therefore, the third convolution When multiplying, first read the data/elements corresponding to coordinates (X, Y) = (0, 4) in the feature matrix. Then according to the principle of reading 1xK elements each time, read elements in other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read to ensure that the volume is output once. accumulation results.
如图10d所示,第4次卷积时,为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据,所以第四次卷积时,先读取特征矩阵中坐标(X,Y)=(1,4)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素,以保证输出一次卷积结果。As shown in Figure 10d, in the fourth convolution, in order to ensure that the pooling operation circuit can perform the path-following operation normally, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Therefore, the fourth convolution When multiplying, first read the data/elements corresponding to the coordinates (X, Y) = (1, 4) in the feature matrix. Then according to the principle of reading 1xK elements each time, read elements in other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read to ensure that the volume is output once. accumulation results.
为了更好的理解这一原理,这里再结合图9进行举例说明,由于池化窗口第一次处理的元素为1,1,5,6,所以为了能够保证池化运算电路能够正常进行第一次随路运算,需要将池化窗口第一次处理的元素都提供给池化运算电路。假设第一次卷积后获取的卷积结果相当于图9中的第一行第一列的元素1,第二次卷积后获取的卷积结果为图9中的第一行第二列的元素2。如果第三次卷积先读取特征矩阵中坐标(X,Y)=(2,5)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素,那么第三次卷积获取的卷积结果为图9中的第一行第三列的元素2,将无法保证池化窗口第一次池化处理。为了保证池化运算电路能够正常进行随路运算,第三次卷积后获取的卷积结果应当为第二行第一列的元素5,所以第三次卷积时,读取特征矩阵中坐标(X,Y)=(0,4)对应的数据/元素。然后按照每次 读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素。同理,第四次卷积后获取的卷积结果应当为第二行第二列的元素6。In order to better understand this principle, here is an example combined with Figure 9. Since the elements processed for the first time by the pooling window are 1,1,5,6, in order to ensure that the pooling operation circuit can normally perform the first For the second path-following operation, all the elements processed for the first time by the pooling window need to be provided to the pooling operation circuit. Assume that the convolution result obtained after the first convolution is equivalent to element 1 in the first row and first column in Figure 9, and the convolution result obtained after the second convolution is the first row and second column in Figure 9 Element 2. If the third convolution first reads the data/elements corresponding to coordinates (X, Y) = (2,5) in the feature matrix. Then according to the principle of reading 1xK elements each time, read elements in other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read, then the third volume The convolution result obtained by the product is element 2 in the first row and third column in Figure 9, and the first pooling process of the pooling window cannot be guaranteed. In order to ensure that the pooling operation circuit can perform normal path operation, the convolution result obtained after the third convolution should be element 5 in the second row and first column. Therefore, during the third convolution, read the coordinates in the feature matrix (X,Y)=(0,4) corresponding data/element. Then follow each The principle of reading 1xK elements is to read elements at other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read. In the same way, the convolution result obtained after the fourth convolution should be element 6 in the second row and second column.
按照上述思路,如图11所示,接下来卷积运算电路执行卷积运算的顺序如下:According to the above idea, as shown in Figure 11, the order in which the convolution operation circuit performs the convolution operation is as follows:
第5-8次卷积时,为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据,所以第5-8次卷积时,分别先读取特征矩阵中坐标(X,Y)=(2,5)对应的数据/元素,坐标(X,Y)=(3,5)对应的数据/元素,坐标(X,Y)=(2,4)对应的数据/元素,坐标(X,Y)=(3,4)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素,以保证输出至这几次卷积结果。During the 5th to 8th convolution, in order to ensure that the pooling operation circuit can perform normal path-following operations, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Therefore, during the 5th to 8th convolution, , first read the data/elements corresponding to coordinates (X,Y)=(2,5) in the feature matrix, the data/elements corresponding to coordinates (X,Y)=(3,5), and the coordinates (X,Y) =Data/elements corresponding to (2,4), data/elements corresponding to coordinates (X,Y)=(3,4). Then according to the principle of reading 1xK elements at a time, read elements in other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read to ensure that the output is here Several convolution results.
第9-12次卷积时,为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据,所以第5-8次卷积时,分别先读取特征矩阵中坐标(X,Y)=(0,3)对应的数据/元素,坐标(X,Y)=(1,3)对应的数据/元素,坐标(X,Y)=(0,2)对应的数据/元素,坐标(X,Y)=(1,2)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素,以保证输出至这几次卷积结果。During the 9th to 12th convolution, in order to ensure that the pooling operation circuit can perform normal path-following operations, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Therefore, during the 5th to 8th convolution, , first read the data/elements corresponding to coordinates (X,Y)=(0,3) in the feature matrix, the data/elements corresponding to coordinates (X,Y)=(1,3), and the coordinates (X,Y) =The data/elements corresponding to (0,2), the data/elements corresponding to coordinates (X,Y)=(1,2). Then according to the principle of reading 1xK elements at a time, read elements in other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read to ensure that the output is here Several convolution results.
第13-16次卷积时,为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据,所以第5-8次卷积时,分别先读取特征矩阵中坐标(X,Y)=(2,3)对应的数据/元素,坐标(X,Y)=(3,3)对应的数据/元素,坐标(X,Y)=(2,2)对应的数据/元素,坐标(X,Y)=(3,2)对应的数据/元素。然后按照每次读取1xK个元素的原则,读取当前卷积核内的特征矩阵中其他位置的元素,直到读完所有的当前卷积核内的特征矩阵中的元素,以保证输出至这几次卷积结果。During the 13th to 16th convolutions, in order to ensure that the pooling operation circuit can perform normal path-following operations, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Therefore, during the 5th to 8th convolutions, , first read the data/elements corresponding to coordinates (X,Y)=(2,3) in the feature matrix, the data/elements corresponding to coordinates (X,Y)=(3,3), and the coordinates (X,Y) =Data/elements corresponding to (2,2), data/elements corresponding to coordinates (X,Y)=(3,2). Then according to the principle of reading 1xK elements at a time, read elements in other positions in the feature matrix in the current convolution kernel until all elements in the feature matrix in the current convolution kernel are read to ensure that the output is here Several convolution results.
如图12所示,假设经过图11的卷积运算后,再经过量化运算后,获得图9所示的卷积结果,则按照池化窗口执行池化运算的顺序向池化运算电路输入数据,具体原理已经在上文进行了论述,这里不再重复赘述。As shown in Figure 12, assuming that after the convolution operation in Figure 11 and then the quantization operation, the convolution result shown in Figure 9 is obtained, then the data is input to the pooling operation circuit in the order in which the pooling operation is performed in the pooling window. , the specific principles have been discussed above and will not be repeated here.
需要说明的是,本申请实施例以池化窗口为单位,确定向池化运算电路输入数据的顺序,本申请实施例并不对池化窗口执行池化运算的顺序进行限定,示例性的,再结合一个例子对此进行说明。继续参阅图9,假设池化窗口执行池化运算的顺序和图9所示的顺序不同,具体的如果池化窗口第一次处理的元素为1,1,5,6,第二次处理的元素为3,2,4,9,第三次处理的元素为2,4,7,8,第四次处理的元素为8,3,0,5。那么为了保证池化运算电路能够正常进行随路运算,需要按照池化窗口执行池化运算的顺序向池化运算电路输入数据,那么卷积运算电路执行卷积运算的顺序如下:It should be noted that the embodiment of the present application uses the pooling window as a unit to determine the order in which data is input to the pooling operation circuit. The embodiment of the present application does not limit the order in which the pooling window performs the pooling operation. For example, This is illustrated with an example. Continuing to refer to Figure 9, assume that the order in which the pooling window performs pooling operations is different from the order shown in Figure 9. Specifically, if the elements processed by the pooling window for the first time are 1,1,5,6, and the elements processed for the second time The elements are 3,2,4,9, the elements processed for the third time are 2,4,7,8, and the elements processed for the fourth time are 8,3,0,5. In order to ensure that the pooling operation circuit can perform normal path operations, data needs to be input to the pooling operation circuit in the order in which the pooling window performs the pooling operation. Then the order in which the convolution operation circuit performs the convolution operation is as follows:
第1-4次卷积时,图10a至图10d所描述的第1次卷积的过程至第4次卷积的过程进行理解。During the 1st to 4th convolutions, the process from the 1st convolution to the 4th convolution described in Figure 10a to Figure 10d can be understood.
第5-8次卷积时,按照图11所示的第5-8次卷积的过程进行理解,这里不再重复赘述。During the 5th to 8th convolution, follow the process of the 5th to 8th convolution shown in Figure 11 to understand it, and will not be repeated here.
第9-12次卷积时,按照图11所示的第9-12次卷积的过程进行理解,这里不再重复赘述。During the 9th to 12th convolution, follow the process of the 9th to 12th convolution as shown in Figure 11, and will not be repeated here.
第13-16次卷积时,按照图11所示的第13-16次卷积的过程进行理解,这里不再重复 赘述。During the 13th to 16th convolution, follow the process of the 13th to 16th convolution as shown in Figure 11 and will not repeat it here. Repeat.
上文介绍到第二运算电路被配置为随路执行激活运算、量化运算以及池化运算中的至少一种,具体的,包括但不限于几下几种情况:As introduced above, the second operation circuit is configured to perform at least one of activation operation, quantization operation and pooling operation along the way. Specifically, it includes but is not limited to the following situations:
情况1:第二运算电路被配置为随路执行激活运算、量化运算以及池化运算。Case 1: The second operation circuit is configured to perform activation operation, quantization operation and pooling operation along the way.
情况2:第二运算电路被配置为随路执行激活运算和量化运算。Case 2: The second operation circuit is configured to perform activation operation and quantization operation along with the circuit.
情况3:第二运算电路被配置为随路执行量化运算。Case 3: The second operation circuit is configured to perform the quantization operation along with the circuit.
情况4:第二运算电路被配置为随路执行激活运算。Case 4: The second operation circuit is configured to perform the activation operation along with the circuit.
情况5:第二运算电路被配置为随路执行量化运算和池化运算。Case 5: The second operation circuit is configured to perform quantization operation and pooling operation along with the path.
在一个可能的事实方式中,第二运算电路包括量化运算电路、激活运算电路以及池化运算电路。可以通过指令配置是否启动量化运算电路、激活运算电路以及池化运算电路,以适用不同的情况。下面给出一种可能的指令配置方式:In a possible implementation, the second operation circuit includes a quantization operation circuit, an activation operation circuit and a pooling operation circuit. Whether to activate the quantization operation circuit, the activation operation circuit and the pooling operation circuit can be configured through instructions to suit different situations. A possible instruction configuration method is given below:
Operation.type[Xd],[Xn],[Xm],[Xu]Operation.type[Xd],[Xn],[Xm],[Xu]
该条指令用于表示可以配置4种不同的参数。This instruction is used to indicate that 4 different parameters can be configured.
Xd[31:0]:Destination ADDR。Xd[31:0]:Destination ADDR.
该条指令用于表示参数[Xd]是32比特,用于表示加速器的最终计算结果所在的缓存目的地址。This instruction is used to indicate that the parameter [Xd] is 32 bits and is used to indicate the cache destination address where the final calculation result of the accelerator is located.
Xn[31:0]:Source0ADDR。Xn[31:0]:Source0ADDR.
该条指令用于表示参数[Xn]是32比特,用于表示第一运算电路的其中一个输入所在的缓存起始地址。This instruction is used to indicate that the parameter [Xn] is 32 bits and is used to indicate the starting address of the cache where one of the inputs of the first operation circuit is located.
Xm[31:0]:Source1ADDR。Xm[31:0]:Source1ADDR.
该条指令用于表示参数[Xm]是32比特,用于表示第一运算电路的另一个输入所在的缓存起始地址。This instruction is used to indicate that the parameter [Xm] is 32 bits and is used to indicate the starting address of the cache where the other input of the first operation circuit is located.
Xt[31:0]用于表示参数[Xu]是32比特,包括但不限于指示以下配置信息:Xt[31:0] is used to indicate that parameter [Xu] is 32 bits, including but not limited to indicating the following configuration information:
(1)卷积运算的类型,例如用于指示卷积运算的类型为depthwise、GEMV以及GEMM中的其中一种。(1) The type of convolution operation, for example, used to indicate that the type of convolution operation is one of depthwise, GEMV and GEMM.
(2)用于指示是否随路执行激活运算,或者说是否启动激活运算电路。(2) Used to indicate whether to perform the activation operation along the way, or whether to start the activation operation circuit.
(3)用于指示是否随路执行量化运算,或者说是否启动量化运算电路。(3) Used to indicate whether to perform quantization operation along the way, or whether to start the quantization operation circuit.
(4)用于指示是否随路执行池化运算,或者说是否启动池化运算电路。(4) Used to indicate whether to perform pooling operation along the way, or whether to start the pooling operation circuit.
需要说明的是,上述给出的指令配置方式仅为一种示例,实际上还可以采用其他的指令配置方式,以保证第一运算电路和第二运算电路的正常运行,比如:It should be noted that the instruction configuration method given above is only an example. In fact, other instruction configuration methods can also be used to ensure the normal operation of the first operation circuit and the second operation circuit, such as:
在一个可能的实施方式中,还可以引入FM_cfg配置信息和Weight_cfg配置信息,其中,FM_cfg配置信息用于指示输入图像矩阵的尺寸以及通道数等信息,Weight_cfg配置信息用于指示卷积核的尺寸、步长等信息。In a possible implementation, FM_cfg configuration information and Weight_cfg configuration information can also be introduced, where the FM_cfg configuration information is used to indicate the size and number of channels of the input image matrix, and the Weight_cfg configuration information is used to indicate the size of the convolution kernel, Step size and other information.
在一个可能的实施方式中,还可以引入Deq_cfg配置信息,用于指示量化的目标等信息。In a possible implementation, Deq_cfg configuration information can also be introduced to indicate information such as quantization targets.
在一个可能的实施方式中,还可以引入Pooling_cfg配置信息,用于指示池化窗口的窗口、池化步长等信息。In a possible implementation, Pooling_cfg configuration information can also be introduced to indicate the window of the pooling window, the pooling step size and other information.
参阅图13,给出了一种设计配置指令时的流程图,首先可以载入与卷积运算电路相关 的配置信息,比如上文介绍的FM_cfg配置信息和Weight_cfg配置信息。然后可以判断是否需要启动量化运算电路,如果需要启动,则配置指令启动量化运算电路,并配置载入与量化运算电路相关的配置信息,比如上文介绍到的Deq_cfg配置信息,如果不需要启动,则配置指令不启动量化运算电路。再然后判断是否需要启动激活运算电路,如果需要启动,则配置指令启动激活运算电路,并载入与激活运算电路相关的配置信息,如果不需要启动,则配置指令不启动池化运算电路。接下来,判断是否需要启动池化运算电路,如果需要启动,则配置指令启动池化运算电路,并载入与池化运算电路相关的配置信息,比如上文介绍到的Pooling_cfg配置信息,如果不需要启动,则配置指令不启动池化运算电路。Referring to Figure 13, a flow chart for designing configuration instructions is given. First, the information related to the convolution operation circuit can be loaded. Configuration information, such as the FM_cfg configuration information and Weight_cfg configuration information introduced above. Then it can be determined whether the quantization operation circuit needs to be started. If it needs to be started, the configuration instruction starts the quantization operation circuit and loads the configuration information related to the quantization operation circuit, such as the Deq_cfg configuration information introduced above. If it does not need to be started, Then the configuration instruction does not start the quantization operation circuit. Then it is determined whether the activation operation circuit needs to be started. If it needs to be started, the configuration instruction starts the activation operation circuit and loads the configuration information related to the activation operation circuit. If it does not need to be started, the configuration instruction does not start the pooling operation circuit. Next, determine whether the pooling operation circuit needs to be started. If it needs to be started, configure the instruction to start the pooling operation circuit and load the configuration information related to the pooling operation circuit, such as the Pooling_cfg configuration information introduced above. If not If it needs to be started, the configuration instruction does not start the pooling operation circuit.
需要说明的是,图13给出的设计配置指令的流程图仅为示例性的说明,实际上配置是否启动各个电路可以有多种顺序,比如可以先判断是否需要启动激活运算电路,再判断是否需要启动量化运算电路,再判断是否需要启动池化运算电路等等。此外,还可以根据实际场景需求,设计更多或者更少的配置指令,比如上文介绍到有一些神经网络还需要执行element wise运算,则还可以配置指令用于确定是否启动逐项运算电路,以确定是否执行element wise运算,等等。It should be noted that the flow chart of the design configuration instructions shown in Figure 13 is only an exemplary illustration. In fact, there can be multiple sequences for configuring whether to activate each circuit. For example, you can first determine whether the activation circuit needs to be activated, and then determine whether It is necessary to start the quantization operation circuit, and then determine whether it is necessary to start the pooling operation circuit, etc. In addition, more or fewer configuration instructions can be designed according to actual scene requirements. For example, as mentioned above, some neural networks also need to perform element wise operations, and you can also configure instructions to determine whether to start the item-by-item operation circuit. To determine whether to perform elementwise operations, etc.
在一个优选的实施方式中,可以通过同一个电路执行卷积运算或者element wise运算。换句话说,第一运算电路除了用于执行卷积运算,还可以用于执行element wise运算。这是因为卷积运算的本质是元素乘累加(元素乘,元素加),element wise运算的本质是对元素进行加(add)、减(sub)、乘(mul)、除(div)、取最大(max),取最小(min)、等操作。所以,本质上二者在运算是有重叠的,可以用一个硬件来做,即二者的硬件资源可以复用,降低硬件的面积开销,降低设计的复杂性。当通过第一运算电路来执行element wise运算时(这里将其称之为逐项运算电路),逐项运算电路的输入包括一个向量和另一个向量,输出仍为一个向量。In a preferred embodiment, the convolution operation or the element wise operation can be performed by the same circuit. In other words, in addition to performing convolution operations, the first operation circuit can also be used to perform element wise operations. This is because the essence of convolution operation is element multiplication and accumulation (element multiplication, element addition), and the essence of element wise operation is to add (add), subtract (sub), multiply (mul), divide (div), and take elements. Maximum (max), take the minimum (min), and other operations. Therefore, in essence, the two operations overlap and can be done with one hardware, that is, the hardware resources of the two can be reused, reducing the area overhead of the hardware and reducing the complexity of the design. When the elementwise operation is performed by the first operation circuit (herein referred to as the item-wise operation circuit), the input of the item-wise operation circuit includes one vector and another vector, and the output is still a vector.
参阅图14,给出了卷积运算电路和逐项运算电路复用硬件资源时的一种加速器的结构示意图,其中第二运算电路可以参照上文关于第二运算电路的描述进行理解,这里不再重复赘述。在这种实施方式中,可以通过指令配置第一运算电路执行卷积运算或者执行element wise运算。比如还可以通过上文介绍的参数Xt[31:0]来指示执行卷积运算还是执行element wise运算。在一个可能的实施方式中,还可以通过上文介绍的参数Xt[31:0]来指示element wise运算的类型,比如执行对元素进行加(add)、减(sub)、乘(mul)、除(div)、取最大值(max),取最小值(min)中的其中一种。Referring to Figure 14, a schematic structural diagram of an accelerator is provided when the convolution operation circuit and the item-by-item operation circuit multiplex hardware resources. The second operation circuit can be understood with reference to the above description of the second operation circuit. This is not the case here. Let me repeat it again. In this implementation manner, the first operation circuit can be configured through instructions to perform a convolution operation or to perform an element wise operation. For example, you can also use the parameter Xt[31:0] introduced above to indicate whether to perform a convolution operation or an element wise operation. In a possible implementation, the parameter Xt[31:0] introduced above can also be used to indicate the type of element wise operation, such as performing addition (add), subtraction (sub), multiplication (mul), One of division (div), maximum value (max), minimum value (min).
参阅图15,给出了另一种设计配置指令时的流程图,首先判断是否需要执行卷积运算,如果需要执行卷积运算,则配置指令启动第一运算电路执行卷积运算,并配置载入与卷积运算相关的配置信息,比如上文介绍的FM_cfg配置信息和Weight_cfg配置信息。如果不需要执行卷积运算,则配置指令启动第一运算电路执行element wise运算,并配置载入与element wise运算相关的配置信息。然后可以判断是否需要启动量化运算电路,如果需要启动,则配置指令启动量化运算电路,并配置载入与量化运算电路相关的配置信息,比如上文介绍到的Deq_cfg配置信息,如果不需要启动,则配置指令不启动量化运算电路。再然后判断是否需要启动激活运算电路,如果需要启动,则配置指令启动激活运算电路,并载入与激活运算电路相关的配置信息,如果不需要启动,则配置指令不启动池化运算电路。 接下来,判断是否需要启动池化运算电路,如果需要启动,则配置指令启动池化运算电路,并载入与池化运算电路相关的配置信息,比如上文介绍到的Pooling_cfg配置信息,如果不需要启动,则配置指令不启动池化运算电路。Referring to Figure 15, another flow chart for designing a configuration instruction is given. First, it is judged whether the convolution operation needs to be performed. If the convolution operation needs to be performed, the configuration instruction starts the first operation circuit to perform the convolution operation and configures the loader. Enter the configuration information related to the convolution operation, such as the FM_cfg configuration information and Weight_cfg configuration information introduced above. If the convolution operation does not need to be performed, the configuration instruction starts the first operation circuit to perform the element wise operation and configures to load the configuration information related to the element wise operation. Then it can be determined whether the quantization operation circuit needs to be started. If it needs to be started, the configuration instruction starts the quantization operation circuit and loads the configuration information related to the quantization operation circuit, such as the Deq_cfg configuration information introduced above. If it does not need to be started, Then the configuration instruction does not start the quantization operation circuit. Then it is determined whether the activation operation circuit needs to be started. If it needs to be started, the configuration instruction starts the activation operation circuit and loads the configuration information related to the activation operation circuit. If it does not need to be started, the configuration instruction does not start the pooling operation circuit. Next, determine whether the pooling operation circuit needs to be started. If it needs to be started, configure the instruction to start the pooling operation circuit and load the configuration information related to the pooling operation circuit, such as the Pooling_cfg configuration information introduced above. If not If it needs to be started, the configuration instruction does not start the pooling operation circuit.
在一个可能的实施方式中,为了能够使本申请实施例提供的加速器能够支持更多不同结构的神经网络执行多种运算,还可以引入存储器。比如,在一个可能的实施方式中,利用第一运算电路执行卷积运算,将获取到的卷积结果写入存储器中,该卷积结果可以再次作为第一运算电路的输入,比如第一运算电路可以从存储器中获取该卷积结果,并利用该卷积结果执行element wise运算。再比如,在一个可能的实施方式中,激活运算电路、池化运算电路、量化运算电路也可以将输出的结果写入存储器中。In a possible implementation, in order to enable the accelerator provided by the embodiment of the present application to support more neural networks with different structures to perform multiple operations, a memory may also be introduced. For example, in one possible implementation, the first operation circuit is used to perform a convolution operation, and the obtained convolution result is written into the memory. The convolution result can again be used as the input of the first operation circuit, such as the first operation The circuit can retrieve the convolution result from memory and perform element wise operations using the convolution result. For another example, in a possible implementation, the activation operation circuit, the pooling operation circuit, and the quantization operation circuit can also write the output results into the memory.
参阅图16,为本申请的实施例提供的一种加速方法的流程示意图,包括:对第一运算电路的输入执行卷积运算1601。其中,第一运算电路的输出接口与第二运算电路的输入接口直接连接,第一运算电路在执行卷积运算后,将输出通过第一运算电路的输出接口直接输入至第二运算电路的输入接口;第二运算电路包括以下电路中的至少一种:激活运算电路、量化运算电路或者池化运算电路。Referring to FIG. 16 , a schematic flowchart of an acceleration method provided in an embodiment of the present application includes: performing a convolution operation 1601 on the input of the first operation circuit. The output interface of the first operation circuit is directly connected to the input interface of the second operation circuit. After performing the convolution operation, the first operation circuit directly inputs the output to the input of the second operation circuit through the output interface of the first operation circuit. Interface; the second operation circuit includes at least one of the following circuits: an activation operation circuit, a quantization operation circuit, or a pooling operation circuit.
当第二运算电路包括激活运算电路时,对激活运算电路的输入执行激活运算1602,激活运算电路的输入是从第一运算电路或者量化运算电路或者池化运算电路中获取的。当第二运算电路包括量化运算电路时,对量化运算电路的输入执行量化运算1603,量化运算电路的输入是从第一运算电路或者激活运算电路或者池化运算电路中获取的。当第二运算电路包括池化运算电路时,对池化运算电路的输入执行池化运算1604,池化运算电路的输入是从第一运算电路或者激活运算电路或者量化运算电路中获取的。When the second operation circuit includes an activation operation circuit, an activation operation 1602 is performed on the input of the activation operation circuit, and the input of the activation operation circuit is obtained from the first operation circuit, the quantization operation circuit, or the pooling operation circuit. When the second operation circuit includes a quantization operation circuit, the quantization operation 1603 is performed on the input of the quantization operation circuit, and the input of the quantization operation circuit is obtained from the first operation circuit or the activation operation circuit or the pooling operation circuit. When the second operation circuit includes a pooling operation circuit, a pooling operation 1604 is performed on the input of the pooling operation circuit, and the input of the pooling operation circuit is obtained from the first operation circuit or the activation operation circuit or the quantization operation circuit.
其中,第一运算电路在执行卷积运算后将输出结果输入至第二运算电路,当前运算电路的输入端与上一个运算电路的输出端连接时,上一个运算电路在执行对应的运算后将运算结果输入至当前运算电路,当前运算电路是第二运算电路中的任意一个电路,上一个运算电路为第一运算电路或者第二运算电路中的运算电路。Among them, the first operation circuit inputs the output result to the second operation circuit after performing the convolution operation. When the input end of the previous operation circuit is connected to the output end of the previous operation circuit, the previous operation circuit performs the corresponding operation. The operation result is input to the current operation circuit. The current operation circuit is any one of the second operation circuits, and the previous operation circuit is the first operation circuit or the operation circuit of the second operation circuit.
具体地,当激活运算电路的输入端与第一运算模型的输出端连接时,第一运算电路执行卷积操作后将输出数据输入至激活运算电路的输入端,或者,当激活运算电路的输入端与量化运算电路的输出端连接时,量化运算电路执行量化操作后将输出数据输入至激活运算电路的输入端,或者,当激活运算电路的输入端与池化运算电路的输出端连接时,池化运算电路执行池化运算后将输出数据输入至激活运算电路的输入端;Specifically, when the input terminal of the activation operation circuit is connected to the output terminal of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input terminal of the activation operation circuit, or when the input terminal of the activation operation circuit When the terminal is connected to the output terminal of the quantization operation circuit, the quantization operation circuit performs the quantization operation and then inputs the output data to the input terminal of the activation operation circuit, or when the input terminal of the activation operation circuit is connected to the output terminal of the pooling operation circuit, The pooling operation circuit performs the pooling operation and inputs the output data to the input end of the activation operation circuit;
当量化运算电路的输入端与第一运算模型的输出端连接时,第一运算电路执行卷积操作后将输出数据输入至量化运算电路的输入端,或者,当量化运算电路的输入端与激活运算电路的输出端连接时,激活运算电路执行激活运算后将输出数据输入至激活运算电路的输入端,或者,当量化运算电路的输入端与池化运算电路的输出端连接时,池化运算电路执行池化运算后将输出数据输入至量化运算电路的输入端;When the input end of the quantization operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the quantization operation circuit, or when the input end of the quantization operation circuit is connected to the activation When the output terminal of the operation circuit is connected, the activation operation circuit performs the activation operation and then inputs the output data to the input terminal of the activation operation circuit, or when the input terminal of the quantization operation circuit is connected to the output terminal of the pooling operation circuit, the pooling operation After the circuit performs the pooling operation, the output data is input to the input end of the quantization operation circuit;
当池化运算电路的输入端与第一运算模型的输出端连接时,第一运算电路执行卷积操作后将输出数据输入至池化运算电路的输入端,或者,当池化运算电路的输入端与激活运算电路的输出端连接时,激活运算电路执行激活运算后将输出数据输入至池化运算电路的输入端,或者,当池化运算电路的输入端与量化运算电路的输出端连接时,量化运算电路 执行量化运算后将输出数据输入至池化运算电路的输入端。When the input end of the pooling operation circuit is connected to the output end of the first operation model, the first operation circuit performs a convolution operation and then inputs the output data to the input end of the pooling operation circuit, or when the input end of the pooling operation circuit When the terminal is connected to the output terminal of the activation operation circuit, the activation operation circuit performs the activation operation and then inputs the output data to the input terminal of the pooling operation circuit, or when the input terminal of the pooling operation circuit is connected to the output terminal of the quantization operation circuit , quantization operation circuit After performing the quantization operation, the output data is input to the input end of the pooling operation circuit.
在一种可能的实施方式中,对第一运算电路的输入执行卷积运算,包括:利用卷积核遍历特征图feature map,以对卷积核中的元素和遍历区域内特征图中的元素进行卷积运算,得到多个卷积结果。对池化运算电路的输入执行池化运算,包括:池化运算电路,具体用于按照池化运算电路对多个卷积结果执行池化运算的顺序,获取多个卷积结果。In a possible implementation, performing a convolution operation on the input of the first operation circuit includes: using a convolution kernel to traverse the feature map to compare the elements in the convolution kernel and the elements in the feature map within the traversal area. Perform a convolution operation to obtain multiple convolution results. Performing a pooling operation on the input of the pooling operation circuit includes: a pooling operation circuit, specifically configured to obtain multiple convolution results in the order in which the pooling operation circuit performs pooling operations on the multiple convolution results.
在一种可能的实施方式中,该方法还包括:对输入至第一运算电路的两个张量的对应位置的元素进行相加操作、相减操作、相乘操作、相除操作、取最大值操作或者取最小值操作。In a possible implementation, the method further includes: performing addition operations, subtraction operations, multiplication operations, division operations, and maximization on elements at corresponding positions of the two tensors input to the first operation circuit. Value operation or minimum value operation.
在一种可能的实施方式中,激活运算电路的输入具体是从第一运算电路中获取的,量化运算电路的输入具体是从激活运算电路中获取的,池化运算电路的输入具体是从量化运算电路中获取的。In a possible implementation, the input of the activation operation circuit is specifically obtained from the first operation circuit, the input of the quantization operation circuit is specifically obtained from the activation operation circuit, and the input of the pooling operation circuit is specifically obtained from the quantization operation circuit. Obtained from the arithmetic circuit.
在一种可能的实施方式中,池化运算中池化窗口的尺寸为w×h,池化步长为stride,其中,w、h、stride的取值相同,w为大于1的正整数。In a possible implementation, the size of the pooling window in the pooling operation is w×h, and the pooling step size is stride, where w, h, and stride have the same values, and w is a positive integer greater than 1.
在一种可能的实施方式中,激活运算通过sigmoid函数、tanh函数、prelu函数、leay函数和relu函数中的任一种实现。In a possible implementation, the activation operation is implemented through any one of the sigmoid function, tanh function, prelu function, leave function and relu function.
在一种可能的实施方式中,池化运算包括最大值池化运算或者平均值池化运算。In a possible implementation, the pooling operation includes a maximum pooling operation or an average pooling operation.
在一种可能的实施方式中,卷积运算包括深度可分卷积(depthwise separable convolution)、矩阵与矩阵相乘的卷积GEMM以及矩阵与向量相乘的卷积GEMV。In a possible implementation, the convolution operation includes depthwise separable convolution, convolution GEMM of matrix-matrix multiplication, and convolution GEMV of matrix-vector multiplication.
本申请还提供一种电子设备,该电子设备中可以设置处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行上述实施例描述的加速方法。该电子设备具体可以包括各种终端或者穿戴设备等。This application also provides an electronic device in which a processing unit and a communication interface can be provided. The processing unit obtains program instructions through the communication interface. The program instructions are executed by the processing unit. The processing unit is used to execute the acceleration method described in the above embodiment. The electronic device may specifically include various terminals or wearable devices.
例如,该穿戴设备可以包括手环、智能手表、智能眼镜、头戴显示设备(Head Mount Display,HMD)、增强现实(augmented reality,AR)设备、混合现实(mixed reality,MR)设备等。For example, the wearable device may include a bracelet, a smart watch, a smart glasses, a head mounted display (HMD), an augmented reality (AR) device, a mixed reality (MR) device, etc.
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括指令,所述指令指示执行上述实施例描述的加速方法。An embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium includes instructions instructing to execute the acceleration method described in the above embodiments.
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机执行时,所述计算机执行前述实施例所描述的加速方法。该计算机程序产品可以为一个软件安装包,在需要使用前述任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、 光介质、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。An embodiment of the present application also provides a computer program product. When the computer program product is executed by a computer, the computer executes the acceleration method described in the previous embodiment. The computer program product can be a software installation package. If any of the foregoing methods is required, the computer program product can be downloaded and executed on the computer. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), Optical media, or semiconductor media (such as solid state disk (SSD)), etc.
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器或者处理器的功能的电路和一个或者多个接口。当该数字处理芯片中集成了存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中神经网络加速器执行的动作。An embodiment of the present application also provides a digital processing chip. The digital processing chip integrates a circuit and one or more interfaces for realizing the above-mentioned processor or functions of the processor. When a memory is integrated into the digital processing chip, the digital processing chip can complete the method steps of any one or more of the foregoing embodiments. When the digital processing chip does not have an integrated memory, it can be connected to an external memory through a communication interface. The digital processing chip implements the actions performed by the neural network accelerator in the above embodiment according to the program code stored in the external memory.
本申请实施例还提供一种芯片,该芯片中可以部署本申请提供的神经网络加速器,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述所示实施例描述的神经网络加速器执行的动作。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。Embodiments of the present application also provide a chip in which the neural network accelerator provided by the present application can be deployed. The chip includes: a processing unit and a communication unit. The processing unit may be, for example, a processor. The communication unit may be, for example, an input unit. /Output interface, pin or circuit, etc. The processing unit can execute computer execution instructions stored in the storage unit, so that the chip in the server performs the actions performed by the neural network accelerator described in the embodiment shown above. Optionally, the storage unit is a storage unit within the chip, such as a register, cache, etc. The storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。Specifically, the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (GPU), or a digital signal processing unit. Digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete Hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.
示例性地,请参阅图17,图17为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 170,NPU 170作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1703,通过控制器1704控制运算电路1703提取存储器中的矩阵数据并进行乘法运算。Exemplarily, please refer to Figure 17. Figure 17 is a schematic structural diagram of a chip provided by an embodiment of the present application. The chip can be represented as a neural network processor NPU 170. The NPU 170 serves as a co-processor and is mounted to the main CPU ( On the Host CPU), tasks are allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1703. The arithmetic circuit 1703 is controlled by the controller 1704 to extract the matrix data in the memory and perform multiplication operations.
在一些实现中,运算电路1703内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1703是二维脉动阵列。运算电路1703还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1703是通用的矩阵处理器。In some implementations, the computing circuit 1703 internally includes multiple processing engines (PEs). In some implementations, arithmetic circuit 1703 is a two-dimensional systolic array. The arithmetic circuit 1703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1703 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1702中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1701中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1708中。For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1702 and caches it on each PE in the arithmetic circuit. The operation circuit takes matrix A data and matrix B from the input memory 1701 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 1708 .
统一存储器1706用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)1705,DMAC被搬运到权重存储器1702中。输入数据也通过DMAC被搬运到统一存储器1706中。The unified memory 1706 is used to store input data and output data. The weight data directly passes through the storage unit access controller (direct memory access controller, DMAC) 1705, and the DMAC is transferred to the weight memory 1702. Input data is also transferred to unified memory 1706 via DMAC.
总线接口单元(bus interface unit,BIU)1710,用于AXI总线与DMAC和取指存储器(instruction fetch buffer,IFB)1709的交互。 Bus interface unit (bus interface unit, BIU) 1710 is used for interaction between the AXI bus and DMAC and instruction fetch buffer (IFB) 1709.
总线接口单元1710(bus interface unit,BIU),用于取指存储器1709从外部存储器获取指令,还用于存储单元访问控制器1705从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 1710 (bus interface unit, BIU) is used to fetch the memory 1709 to obtain instructions from the external memory, and is also used for the storage unit access controller 1705 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1706或将权重数据搬运到权重存储器1702中或将输入数据数据搬运到输入存储器1701中。DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1706 or the weight data to the weight memory 1702 or the input data to the input memory 1701 .
向量计算单元1707包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特征平面进行上采样等。The vector calculation unit 1707 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as batch normalization, pixel-level summation, upsampling of feature planes, etc.
在一些实现中,向量计算单元1707能将经处理的输出的向量存储到统一存储器1706。例如,向量计算单元1707可以将线性函数和/或非线性函数应用到运算电路1703的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1707生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1703的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, vector calculation unit 1707 can store the processed output vectors to unified memory 1706 . For example, the vector calculation unit 1707 can apply a linear function and/or a nonlinear function to the output of the operation circuit 1703, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 1707 generates normalized values, pixel-wise summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 1703, such as for use in a subsequent layer in a neural network.
控制器1704连接的取指存储器(instruction fetch buffer)1709,用于存储控制器1704使用的指令;The instruction fetch buffer 1709 connected to the controller 1704 is used to store instructions used by the controller 1704;
统一存储器1706,输入存储器1701,权重存储器1702以及取指存储器1709均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 1706, the input memory 1701, the weight memory 1702 and the fetch memory 1709 are all On-Chip memories. External memory is private to the NPU hardware architecture.
其中,循环神经网络中各层的运算可以由运算电路1703或向量计算单元1707执行。Among them, the operations of each layer in the recurrent neural network can be performed by the operation circuit 1703 or the vector calculation unit 1707.
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述神经网络加速器执行的动作的程序执行的集成电路。The processor mentioned in any of the above places may be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for program execution that control the actions performed by the neural network accelerator.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,电路之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between circuits indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方 法。Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application. Law.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server or data center integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects without necessarily using Used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
最后应说明的是:以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。 Finally, it should be noted that: the above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of changes within the technical scope disclosed in the present application. or replacement, shall be covered by the protection scope of this application.

Claims (22)

  1. 一种神经网络加速器,其特征在于,包括:第一运算电路和第二运算电路,所述第二运算电路包括以下电路中的至少一种:激活运算电路、量化运算电路或者池化运算电路;A neural network accelerator, characterized by comprising: a first arithmetic circuit and a second arithmetic circuit, the second arithmetic circuit including at least one of the following circuits: an activation arithmetic circuit, a quantization arithmetic circuit or a pooling arithmetic circuit;
    所述第一运算电路,用于对所述第一运算电路的输入执行卷积运算;The first operation circuit is used to perform a convolution operation on the input of the first operation circuit;
    所述第一运算电路的输出端与所述第二运算电路的输入端直接连接,所述第一运算电路在执行卷积运算后,将输出通过所述第一运算电路的输出端直接输入至所述第二运算电路的输入接口;The output end of the first operation circuit is directly connected to the input end of the second operation circuit. After performing the convolution operation, the first operation circuit directly inputs the output through the output end of the first operation circuit to The input interface of the second computing circuit;
    当所述第二运算电路包括所述激活运算电路时,所述激活运算电路用于对所述激活运算电路的输入执行激活运算;When the second operation circuit includes the activation operation circuit, the activation operation circuit is used to perform an activation operation on the input of the activation operation circuit;
    当所述第二运算电路包括所述量化运算电路时,所述第二运算电路用于对所述第二运算电路的输入执行量化运算;When the second operation circuit includes the quantization operation circuit, the second operation circuit is used to perform a quantization operation on the input of the second operation circuit;
    当所述第二运算电路包括所述池化运算电路时,所述池化运算电路用于对所述池化运算电路的输入执行池化运算;When the second operation circuit includes the pooling operation circuit, the pooling operation circuit is used to perform a pooling operation on the input of the pooling operation circuit;
    其中,当前运算电路的输入端与上一个运算电路的输出端连接时,所述上一个运算电路在执行对应的运算后将运算结果输入至所述当前运算电路,所述当前运算电路是所述第二运算电路中的任意一个电路,所述上一个运算电路为所述第一运算电路或者所述第二运算电路中的运算电路。Wherein, when the input end of the current operation circuit is connected to the output end of the previous operation circuit, the previous operation circuit inputs the operation result to the current operation circuit after performing the corresponding operation, and the current operation circuit is the Any circuit in the second arithmetic circuit, the previous arithmetic circuit is the arithmetic circuit in the first arithmetic circuit or the second arithmetic circuit.
  2. 根据权利要求1所述的加速器,其特征在于,The accelerator according to claim 1, characterized in that:
    所述第一运算电路,具体用于利用卷积核遍历特征图feature map,以对所述卷积核中的元素和遍历区域内所述特征图中的元素进行卷积运算,得到多个卷积结果;The first operation circuit is specifically used to traverse the feature map using a convolution kernel to perform a convolution operation on the elements in the convolution kernel and the elements in the feature map within the traversal area to obtain multiple convolutions. accumulation result;
    所述池化运算电路,具体用于按照所述池化运算电路对所述多个卷积结果执行池化运算的顺序,获取所述多个卷积结果。The pooling operation circuit is specifically configured to obtain the plurality of convolution results according to the order in which the pooling operation circuit performs pooling operations on the plurality of convolution results.
  3. 根据权利要求1或2所述的加速器,其特征在于,所述第一运算电路还用于:The accelerator according to claim 1 or 2, characterized in that the first operation circuit is also used for:
    对输入至所述第一运算电路的两个张量的对应位置的元素进行相加操作、相减操作、相乘操作、相除操作、取最大值操作或者取最小值操作。An addition operation, a subtraction operation, a multiplication operation, a division operation, a maximum value operation or a minimum value operation are performed on the elements at corresponding positions of the two tensors input to the first operation circuit.
  4. 根据权利要求1至3任一项所述的加速器,其特征在于,所述激活运算电路的输入具体是从所述第一运算电路中获取的,所述量化运算电路的输入具体是从所述激活运算电路中获取的,所述池化运算电路的输入具体是从所述量化运算电路中获取的。The accelerator according to any one of claims 1 to 3, characterized in that the input of the activation operation circuit is specifically obtained from the first operation circuit, and the input of the quantization operation circuit is specifically obtained from the The input of the pooling operation circuit is obtained from the activation operation circuit. The input of the pooling operation circuit is specifically obtained from the quantization operation circuit.
  5. 根据权利要求1至4任一项所述的加速器,其特征在于,所述池化运算中池化窗口的尺寸为w×h,池化步长为stride,其中,w、h、stride的取值相同,w为大于1的正整数。The accelerator according to any one of claims 1 to 4, characterized in that in the pooling operation, the size of the pooling window is w×h, and the pooling step size is stride, where w, h, and stride are The values are the same, w is a positive integer greater than 1.
  6. 根据权利要求1至5任一项所述的加速器,其特征在于,所述加速器应用于卷积神经网络CNN中。The accelerator according to any one of claims 1 to 5, characterized in that the accelerator is applied in a convolutional neural network (CNN).
  7. 根据权利要求1至5任一项所述的加速器,其特征在于,所述加速器应用于循环神经网络RNN中。The accelerator according to any one of claims 1 to 5, characterized in that the accelerator is applied in a recurrent neural network RNN.
  8. 根据权利要求1至7任一项所述的加速器,其特征在于,所述加速器部署在可穿戴设备上。The accelerator according to any one of claims 1 to 7, characterized in that the accelerator is deployed on a wearable device.
  9. 根据权利要求1至8任一项所述的加速器,其特征在于,所述激活运算通过sigmoid 函数、tanh函数、prelu函数、leay函数或者relu函数实现。The accelerator according to any one of claims 1 to 8, characterized in that the activation operation is performed by sigmoid function, tanh function, prelu function, leave function or relu function implementation.
  10. 根据权利要求1至9任一项所述的加速器,其特征在于,所述池化运算包括最大值池化运算或者平均值池化运算。The accelerator according to any one of claims 1 to 9, characterized in that the pooling operation includes a maximum pooling operation or an average pooling operation.
  11. 根据权利要求1至10任一项所述的加速器,其特征在于,所述卷积运算包括深度可分卷积depthwise运算、矩阵与向量相乘的卷积GEMV运算或者矩阵与矩阵相乘的卷积GEMM运算。The accelerator according to any one of claims 1 to 10, wherein the convolution operation includes a depthwise separable convolution operation, a convolution GEMV operation of matrix and vector multiplication, or a convolution operation of matrix and matrix multiplication. Product GEMM operation.
  12. 一种加速方法,其特征在于,包括:An acceleration method, characterized by including:
    对第一运算电路的输入执行卷积运算,其中,所述第一运算电路的输出接口与所述第二运算电路的输入接口直接连接,所述第一运算电路在执行卷积运算后,将输出通过所述第一运算电路的输出端直接输入至所述第二运算电路的输入端;A convolution operation is performed on the input of the first operation circuit, wherein the output interface of the first operation circuit is directly connected to the input interface of the second operation circuit. After the first operation circuit performs the convolution operation, The output is directly input to the input terminal of the second operation circuit through the output terminal of the first operation circuit;
    当第二运算电路包括激活运算电路时,对所述激活运算电路的输入执行激活运算;When the second operation circuit includes an activation operation circuit, performing an activation operation on the input of the activation operation circuit;
    当所述第二运算电路包括所述量化运算电路时,对所述量化运算电路的输入执行量化运算;When the second operation circuit includes the quantization operation circuit, performing a quantization operation on the input of the quantization operation circuit;
    当所述第二运算电路包括所述池化运算电路时,对所述池化运算电路的输入执行池化运算;When the second operation circuit includes the pooling operation circuit, performing a pooling operation on the input of the pooling operation circuit;
    其中,所述第一运算电路在执行所述卷积运算后将输出结果输入至所述第二运算电路,当前运算电路的输入端与上一个运算电路的输出端连接时,所述上一个运算电路在执行对应的运算后将运算结果输入至所述当前运算电路,所述当前运算电路是所述第二运算电路中的任意一个电路,所述上一个运算电路为所述第一运算电路或者所述第二运算电路中的运算电路。Wherein, the first operation circuit inputs the output result to the second operation circuit after executing the convolution operation. When the input end of the previous operation circuit is connected to the output end of the previous operation circuit, the previous operation circuit After the circuit performs the corresponding operation, the operation result is input to the current operation circuit. The current operation circuit is any one of the second operation circuits. The previous operation circuit is the first operation circuit or The arithmetic circuit in the second arithmetic circuit.
  13. 根据权利要求12所述的方法,其特征在于,所述对第一运算电路的输入执行卷积运算,包括:The method according to claim 12, characterized in that performing a convolution operation on the input of the first operation circuit includes:
    利用卷积核遍历特征图feature map,以对所述卷积核中的元素和遍历区域内所述特征图中的元素进行卷积运算,得到多个卷积结果;Use a convolution kernel to traverse the feature map to perform a convolution operation on the elements in the convolution kernel and the elements in the feature map in the traversed area to obtain multiple convolution results;
    所述对所述池化运算电路的输入执行池化运算,包括:The execution of the pooling operation on the input of the pooling operation circuit includes:
    按照所述池化运算电路对所述多个卷积结果执行池化运算的顺序,获取所述多个卷积结果并执行池化运算。According to the order in which the pooling operation circuit performs pooling operations on the plurality of convolution results, the plurality of convolution results are obtained and the pooling operation is performed.
  14. 根据权利要求12或13所述的方法,其特征在于,所述方法还包括:The method according to claim 12 or 13, characterized in that, the method further includes:
    对输入至所述第一运算电路的两个张量的对应位置的元素进行相加操作、相减操作、相乘操作、相除操作、取最大值操作或者取最小值操作。An addition operation, a subtraction operation, a multiplication operation, a division operation, a maximum value operation or a minimum value operation are performed on the elements at corresponding positions of the two tensors input to the first operation circuit.
  15. 根据权利要求12至14任一项所述的方法,其特征在于,所述激活运算电路的输入具体是从所述第一运算电路中获取的,所述量化运算电路的输入具体是从所述激活运算电路中获取的,所述池化运算电路的输入具体是从所述量化运算电路中获取的。The method according to any one of claims 12 to 14, characterized in that the input of the activation operation circuit is specifically obtained from the first operation circuit, and the input of the quantization operation circuit is specifically obtained from the The input of the pooling operation circuit is obtained from the activation operation circuit. The input of the pooling operation circuit is specifically obtained from the quantization operation circuit.
  16. 根据权利要求12至15任一项所述的方法,其特征在于,所述池化运算中池化窗口的尺寸为w×h,池化步长为stride,其中,w、h、stride的取值相同,w为大于1的正整数。The method according to any one of claims 12 to 15, characterized in that in the pooling operation, the size of the pooling window is w×h, and the pooling step size is stride, where w, h, and stride are The values are the same, w is a positive integer greater than 1.
  17. 根据权利要求12至16任一项所述的方法,其特征在于,所述激活运算通过sigmoid函数、tanh函数、prelu函数、leay函数和relu函数中的任一种实现。 The method according to any one of claims 12 to 16, characterized in that the activation operation is implemented by any one of a sigmoid function, a tanh function, a prelu function, a leave function and a relu function.
  18. 根据权利要求12至17任一项所述的方法,其特征在于,所述池化运算包括最大值池化运算或者平均值池化运算。The method according to any one of claims 12 to 17, wherein the pooling operation includes a maximum pooling operation or an average pooling operation.
  19. 根据权利要求12至18任一项所述的方法,其特征在于,所述卷积运算包括深度可分卷积depthwise运算、矩阵与向量相乘的卷积GEMV运算或者矩阵与矩阵相乘的卷积GEMM运算。The method according to any one of claims 12 to 18, wherein the convolution operation includes a depthwise separable convolution operation, a convolution GEMV operation of matrix and vector multiplication, or a convolution operation of matrix and matrix multiplication. Product GEMM operation.
  20. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求12至19任一所述的方法。A computer-readable storage medium includes instructions that, when run on a computer, cause the computer to perform the method according to any one of claims 12 to 19.
  21. 一种可穿戴设备,其特征在于,所述可穿戴设备上部署有权利要求1至11中任一项所述的神经网络加速器。A wearable device, characterized in that the neural network accelerator according to any one of claims 1 to 11 is deployed on the wearable device.
  22. 根据权利要求21所述的可穿戴设备,其特征在于,所述可穿戴设备包括眼镜、电视、车载设备、手表或者手环中的至少一种。 The wearable device according to claim 21, characterized in that the wearable device includes at least one of glasses, a television, a vehicle-mounted device, a watch, or a bracelet.
PCT/CN2023/085884 2022-05-31 2023-04-03 Neural network accelerator, and acceleration method and apparatus WO2023231559A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210612341.9 2022-05-31
CN202210612341.9A CN117217269A (en) 2022-05-31 2022-05-31 Neural network accelerator, acceleration method and device

Publications (1)

Publication Number Publication Date
WO2023231559A1 true WO2023231559A1 (en) 2023-12-07

Family

ID=89026861

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085884 WO2023231559A1 (en) 2022-05-31 2023-04-03 Neural network accelerator, and acceleration method and apparatus

Country Status (2)

Country Link
CN (1) CN117217269A (en)
WO (1) WO2023231559A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137417A1 (en) * 2016-11-17 2018-05-17 Irida Labs S.A. Parsimonious inference on convolutional neural networks
CN113298237A (en) * 2021-06-23 2021-08-24 东南大学 Convolutional neural network on-chip training accelerator based on FPGA
CN113592066A (en) * 2021-07-08 2021-11-02 深圳市易成自动驾驶技术有限公司 Hardware acceleration method, apparatus, device, computer program product and storage medium
WO2022067508A1 (en) * 2020-09-29 2022-04-07 华为技术有限公司 Neural network accelerator, and acceleration method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137417A1 (en) * 2016-11-17 2018-05-17 Irida Labs S.A. Parsimonious inference on convolutional neural networks
WO2022067508A1 (en) * 2020-09-29 2022-04-07 华为技术有限公司 Neural network accelerator, and acceleration method and device
CN113298237A (en) * 2021-06-23 2021-08-24 东南大学 Convolutional neural network on-chip training accelerator based on FPGA
CN113592066A (en) * 2021-07-08 2021-11-02 深圳市易成自动驾驶技术有限公司 Hardware acceleration method, apparatus, device, computer program product and storage medium

Also Published As

Publication number Publication date
CN117217269A (en) 2023-12-12

Similar Documents

Publication Publication Date Title
WO2021190127A1 (en) Data processing method and data processing device
WO2020073211A1 (en) Operation accelerator, processing method, and related device
Snider et al. From synapses to circuitry: Using memristive memory to explore the electronic brain
WO2022022274A1 (en) Model training method and apparatus
WO2022042713A1 (en) Deep learning training method and apparatus for use in computing device
EP3627397A1 (en) Processing method and apparatus
WO2022067508A1 (en) Neural network accelerator, and acceleration method and device
KR20200091623A (en) Method and device for performing convolution operation on neural network based on Winograd transform
WO2022001805A1 (en) Neural network distillation method and device
JP2020513637A (en) System and method for data management
US11423288B2 (en) Neuromorphic synthesizer
WO2022052601A1 (en) Neural network model training method, and image processing method and device
CN111465943B (en) Integrated circuit and method for neural network processing
WO2022001372A1 (en) Neural network training method and apparatus, and image processing method and apparatus
CN107944545B (en) Computing method and computing device applied to neural network
WO2022111617A1 (en) Model training method and apparatus
CN112613581A (en) Image recognition method, system, computer equipment and storage medium
CN111797982A (en) Image processing system based on convolution neural network
CN108171328B (en) Neural network processor and convolution operation method executed by same
WO2023231794A1 (en) Neural network parameter quantification method and apparatus
CN108681773B (en) Data operation acceleration method, device, terminal and readable storage medium
US11763131B1 (en) Systems and methods for reducing power consumption of convolution operations for artificial neural networks
WO2023010244A1 (en) Neural network accelerator, and data processing method for neural network accelerator
WO2022179588A1 (en) Data coding method and related device
WO2022156475A1 (en) Neural network model training method and apparatus, and data processing method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23814755

Country of ref document: EP

Kind code of ref document: A1