WO2020062299A1 - 一种神经网络处理器、数据处理方法及相关设备 - Google Patents

一种神经网络处理器、数据处理方法及相关设备 Download PDF

Info

Publication number
WO2020062299A1
WO2020062299A1 PCT/CN2018/109208 CN2018109208W WO2020062299A1 WO 2020062299 A1 WO2020062299 A1 WO 2020062299A1 CN 2018109208 W CN2018109208 W CN 2018109208W WO 2020062299 A1 WO2020062299 A1 WO 2020062299A1
Authority
WO
WIPO (PCT)
Prior art keywords
cores
matrix
input
core
batch
Prior art date
Application number
PCT/CN2018/109208
Other languages
English (en)
French (fr)
Inventor
顾雄礼
李艳华
张惠敏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880098253.3A priority Critical patent/CN112789627B/zh
Priority to PCT/CN2018/109208 priority patent/WO2020062299A1/zh
Publication of WO2020062299A1 publication Critical patent/WO2020062299A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data computing technology in the field of artificial intelligence, and in particular, to a neural network processor, a data processing method, and related equipment.
  • CNN Convolutional neural network
  • BN Batch normalization
  • BN is also called gradient normalization processing method, which is performed by adding a normalization layer to a certain convolutional layer or each convolutional layer in the CNN for normalization. After it is processed, it enters the next layer of network as the input of the next layer of the network, that is, the mean and variance of the input matrices of these layers are standardized by introducing the BN algorithm, thereby solving the following problems:
  • normalization processing (including calculation of mean and variance) needs to be performed before the input of some layers or each layer in the CNN, and the deeper the number of CNN network model layers, the normalization is performed.
  • Embodiments of the present invention provide a neural network processor, a data processing method, and related equipment to improve the training speed of a neural network.
  • an embodiment of the present invention provides a neural network processor, which may include: n computing cores, an atomic operation accumulation unit, and an on-chip shared cache, where the n cores and the on-chip shared cache are respectively coupled to all The atomic operation accumulation unit, wherein n is an integer greater than 1; each of the n Cores is configured to calculate an intra-core mean ⁇ of the input matrix according to an input matrix, and write u to all The atomic operation accumulation unit, the input matrix includes m training samples, and the kernel average ⁇ is an average value of the m training samples x, where m is an integer greater than or equal to 1; according to the input matrix Calculate the average value v of m x 2 and write v to the atomic operation accumulation unit; the atomic operation accumulation unit is configured to accumulate n ⁇ s written by the n Cores to obtain S1, and The S1 is written into the on-chip shared cache; the n v written by the n cores are accumulated to obtain S2, and the S2
  • the neural network processor calculates the global variance of the batch, it no longer depends on first calculating the global mean value (each calculation core first calculates the internal mean value, and then the global summation and then average). Calculate the global variance (each calculation kernel first calculates the intra-core variance, then global summation and average), but each calculation kernel calculates the intra-core mean ⁇ of the intra-core training sample x and the intra-core mean v of x 2 , and then separately It is sent to the atomic operation accumulation unit to accumulate and store the accumulation result in the on-chip network cache.
  • the n calculation cores obtain the accumulation results of n ⁇ and n v from the on-chip network cache at one time and calculate based on the accumulation result
  • each calculation core in the neural network processor needs to obtain the global mean (that is, the sum of the core averages of all the calculation cores and then the mean) before the global variance can be calculated.
  • all computing cores synchronize the global average from the atomic operation accumulation unit (or one of the n computing cores), which takes a certain amount of time, and the more the computing cores are synchronized, the longer the time may be.
  • each calculation kernel calculates the variance in the kernel, and then affects the time for the atomic operation accumulation unit to accumulate the global variance. Therefore, in the embodiment of the present invention, during the calculation process of Batch Normalization, n calculation cores only need to be synchronized once (that is, the calculation elements for calculating the global mean and global variance are obtained at one time), which reduces the synchronization process once and greatly reduces The synchronization overhead and time increase the training speed of the entire neural network.
  • each of the n Cores is further configured to obtain S1 and S2 from the on-chip shared cache, and calculate n of the n Cores according to the S1 and S2.
  • the global variance of the input matrix includes: obtaining S1 and S2 from the on-chip shared cache, and according to a calculation formula: Calculate the global variance of the n input matrices of the n Cores.
  • the global variances of the n input matrices of the n Cores are calculated according to the calculation formulas of S1 and S2 and ⁇ 2 ; And the global average can be obtained simultaneously.
  • each of the n Cores is further configured to: according to a formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices of the n Cores, and ⁇ 2 is the global variance of the n input matrices of the n Cores, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • each of the n Cores may perform normalization processing of Batch Normalization according to the global mean and global variance calculated by itself.
  • each of the n Cores is further configured to: according to a formula Perform scaling and translation processing on the m training samples x after the normalization processing; among them, Is the normalized result of the ith training sample x i among the m training samples, y i is the output result obtained by batch normalizing B i of x i , ⁇ is the scaling parameter, and ⁇ is the translation parameter.
  • each of the n Cores can perform batch normalization translation and scaling processing according to the global mean and global variance calculated by itself.
  • each of the n Cores is used to obtain a feature map matrix and a weight matrix, and calculate the input matrix according to the feature map matrix and the weight matrix.
  • an input matrix of the Batch Normalization layer is obtained by performing operation on a feature map matrix and a weight matrix obtained in a previous layer of the Batch Normalization layer, and finally, after the Batch Normalization process, it is used as an input of the next layer.
  • an embodiment of the present invention provides a data processing method, which may include the following processing for each of the n input matrices: calculating an intra-core mean ⁇ of the input matrix according to the input matrix, u is written to the atomic operation accumulation unit, the input matrix includes m training samples, and the kernel average ⁇ is an average value of the m training samples x, where m is an integer greater than or equal to 1; Calculate m x 2 average values v according to the input matrix, and write v into the atomic operation accumulation unit; for n ⁇ calculated from n input matrices, and n v are processed as follows: Accumulate n ⁇ to obtain S1; accumulate n v to obtain S2; perform the following processing on S1 and S2: calculate the global variance of the n input matrices according to the S1 and S2.
  • the calculating the global variance of the n input matrices according to the S1 and S2 includes: according to a calculation formula: Calculate the global variance of the n input matrices.
  • the method further includes: according to a formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices, and ⁇ 2 is the variance of the n input matrices, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • the method further includes: according to a formula Perform scaling and translation processing on the m training samples x after the normalization processing; among them, Is the normalized result of the ith training sample x i among the m training samples, y i is the output result obtained by batch normalizing B i of x i , ⁇ is the scaling parameter, and ⁇ is the translation parameter.
  • the method further includes: obtaining a feature map matrix and a weight matrix, and calculating any one of the n input matrices according to the feature map matrix and the weight matrix. .
  • the present application provides a computing accelerator, the computing accelerator being a computing core Core in the neural network processor according to any one of the first aspect, and the computing accelerator is configured to execute the first aspect A function performed by any one of the n computing core Cores in the neural network processor according to any one.
  • the present application provides a computer storage medium that stores a computer program that, when executed by a processor, implements the data processing method flow described in any one of the second aspects.
  • an embodiment of the present invention provides a computer program.
  • the computer program includes instructions.
  • the computer program can execute the data processing method flow described in any one of the second aspects.
  • the present application provides a chip system including a processor, which is configured to implement functions involved in the data processing method flow according to any one of the second aspects.
  • the chip system further includes a memory, and the memory is configured to store program instructions and data necessary for data processing.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • FIG. 1 is a schematic diagram of a convolutional neural network according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of another convolutional neural network according to an embodiment of the present invention.
  • FIG. 3 is a batch normalization forward flowchart provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a hardware structure of a neural network processor according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a refined hardware structure of an NPU according to an embodiment of the present invention.
  • FIG. 6 is another batch normalization forward flowchart provided by an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of a data processing method according to an embodiment of the present invention.
  • an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and / or a computer.
  • an application running on a computing device and a computing device can be components.
  • One or more components can reside within a process and / or thread of execution, and a component can be localized on one computer and / or distributed between 2 or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more data packets (e.g., data from two components that interact with another component between a local system, a distributed system, and / or a network, such as the Internet that interacts with other systems through signals) Communicate via local and / or remote processes.
  • data packets e.g., data from two components that interact with another component between a local system, a distributed system, and / or a network, such as the Internet that interacts with other systems through signals
  • AI Artificial Intelligence
  • AI is a theory, method, technology, and method that uses digital computers or digital computer-controlled machines to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. operating system.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic theories of AI.
  • Convolutional neural network is a multi-layer neural network. Each layer is composed of multiple two-dimensional planes, and each plane is composed of multiple independent neurons. The neurons share weights, and the number of parameters in the neural network can be reduced by weight sharing.
  • a processor performing a convolution operation usually converts a convolution of an input signal feature and a weight into a matrix multiplication operation between a signal matrix and a weight matrix.
  • the signal matrix and the weight matrix are divided into blocks to obtain multiple Fractional signal matrices and fractal weight matrices, and then matrix multiplication and accumulation are performed on the multiple fractal signal matrices and fractal weight matrices.
  • the convolution kernel can be initialized in the form of a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network can use the backpropagation (BP) algorithm to modify the parameters of the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model is getting smaller and smaller.
  • BP backpropagation
  • an error loss is caused by transmitting the input signal forward until the output, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, thereby converging the error loss.
  • the back-propagation algorithm is the back-propagation motion dominated by the error loss, and aims to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
  • Convolution is the operation of the convolution kernel and the image matrix (the input matrix of the convolution layer).
  • the input matrix is a matrix extracted from the image matrix according to the stride of the convolution kernel during the convolution.
  • the convolution kernel is a small window that records the weights.
  • the convolution kernel slides on the image matrix in steps. Each time the convolution kernel corresponds to a sub-matrix of the image matrix, the weights in the convolution kernel are multiplied by the values contained in the sub-matrix and then added to the volume.
  • the kernel is currently at an element corresponding to the output feature map (output matrix).
  • the step that the convolution kernel moves once along the height of the image matrix is the step of the convolution kernel height sliding
  • the step that the convolution kernel moves once along the width of the image matrix is the step of the convolution kernel width sliding .
  • the sliding step of the convolution kernel is represented by the parameter stride.
  • the input matrix is extracted from the image matrix (that is, the input data) according to the stride of the convolution kernel during the convolution.
  • stride [s1, s2]
  • s1 represents the step of the convolution kernel height sliding
  • s2 represents the step of the convolution kernel width sliding.
  • Convolution operation is the most important operator in convolutional neural networks.
  • X represents the input feature map (input matrix of the convolutional layer)
  • X ' represents the matrix obtained by processing the X with im2col operation
  • W represents the weight matrix
  • b represents the offset
  • Y0 represents the result of the product of the matrix X' and W
  • Y represents the output feature map (the output matrix of the convolution layer).
  • the activation value of each element in the output Y is calculated to obtain the final result.
  • CNN is a deep neural network with a convolution structure. It is a deep learning architecture. Deep learning architecture refers to the use of machine learning algorithms to perform multiple levels at different levels of abstraction. Learning. As a deep learning architecture, CNN is a feed-forward artificial neural network, and each neuron in the feed-forward artificial neural network responds to overlapping regions in the image inputted therein.
  • FIG. 1 is a schematic diagram of a convolutional neural network provided by an embodiment of the present invention.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer / pooling layer 120, and a neural network layer. 130, where the pooling layer is optional.
  • the convolution layer / pooling layer 120 may include, for example, 121-126 layers.
  • 121 layers are convolution layers
  • 122 layers are pooling layers
  • 123 layers are convolution layers
  • 124 The layer is a pooling layer
  • 125 is a convolution layer
  • 126 is a pooling layer.
  • 121 and 122 are convolution layers
  • 123 is a pooling layer
  • 124 and 125 are convolution layers
  • 126 is Pooling layer. That is, the output of the convolution layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolution layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually defined in advance. In the process of convolution operation on the image, the weight matrix is usually one pixel and one pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels ..., which depends on the value of step stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolution output of a single depth dimension, but in most cases a single weight matrix is not used, but multiple weight matrices with the same dimensions are applied. The output of each weight matrix is stacked to form the depth dimension of the convolution image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Fuzzification ...
  • the multiple weight matrices have the same dimensions, and the extracted feature maps with the same dimensions have the same dimensions, and then the extracted multiple feature maps with the same dimensions are combined to form the output of the convolution operation. .
  • weight values in these weight matrices need to be obtained through a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer (such as 121) often extracts more general features, which can also be referred to as low-level features; along with the convolutional neural network
  • the features extracted by subsequent convolutional layers (such as 126) become more and more complex, such as features such as high-level semantics.
  • the pooling layer may also be a multi-layer convolution layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and / or a maximum pooling operator for sampling the input image to obtain a smaller-sized image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to produce an average value.
  • the maximum pooling operator can take the pixel with the largest value in the range in a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image processed by the pooling layer may be smaller than the size of the image of the input pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding subregion of the image of the input pooling layer.
  • the convolutional neural network 100 After processing by the convolutional layer / pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as described above, the convolution layer / pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 100 needs to use the neural network layer 130 to generate the output of one or a set of required classes. Therefore, the neural network layer 130 may include multiple hidden layers (such as 131, 132 to 13n shown in FIG. 1) and an output layer 140. The parameters included in the multiple hidden layer may be based on the specific task type. The relevant training data is obtained by pre-training. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc ...
  • the output layer 140 After the multiple hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is an output layer 140, which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 100 (from 110 to 140 in FIG. 1 is forward propagation)
  • the back propagation (from 140 to 110 in FIG. 1 is back propagation) will begin to update
  • the weight values and deviations of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the results output by the convolutional neural network 100 through the output layer and the ideal results.
  • FIG. 1 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example, as As shown in FIG. 2, FIG. 2 is a schematic diagram of another convolutional neural network according to an embodiment of the present invention.
  • a plurality of convolutional layers / pooling layers are parallel, and the extracted features are all input to the full neural network layer 130 for processing.
  • the normalization layer in this application can in principle be performed after any one of the above CNNs, or before any one of the layers, and with the output feature matrix of the previous layer as an input, its output can also be As input to any functional layer in CNN.
  • the normalization layer is generally performed after the convolution layer, and the feature matrix output by the previous convolution layer is used as the input matrix.
  • Batch is a part (or all) of the entire training set of the neural network. Further, the batch is disassembled. Divided into multiple mini-batch, each mini-batch is a subset of the training set corresponding to the batch. Batch Normalization calculates the mean and variance of each mini-batch corresponding subset, and performs a normalization operation on each subset based on the global mean and global variance of all mini-batch corresponding subsets, so as to obtain the corresponding value of each mini-batch The normalized subset is the output.
  • the training set corresponding to the batch has Z training samples.
  • the batch is divided into n mini-batches. Each mini-batch is a subset of the training set corresponding to the batch. Each mini-batch has m training samples.
  • Use B to represent any of the mini-batch.
  • the following shows the forward calculation method of Batch Normalization:
  • Mini-batch B ⁇ x 1 , ... x m ⁇ ; Among them, B has m training samples of x 1 , ... x m respectively ; It should be noted that this input is the input of the normalization layer, or Understand to normalize the output of the previous layer.
  • ⁇ B represents the average value of Mini-batch B input to the normalization layer. After m training samples x 1 , ... x m in Mini-batchB are summed and divided by m, the average value of Mini-batch ⁇ is obtained. B ; x i is the i-th training sample among the m samples of Mini-batch.
  • y i is the output result obtained after the i-th training sample x i in mini-batch B is processed by BN.
  • the scaling parameter ⁇ and translation parameter ⁇ are to solve the problem of reducing the network's expression ability due to normalization (because Basically will be limited to the two parameters introduced by the normal distribution).
  • ⁇ and number ⁇ are learned by the network itself during training, which facilitates the adaptive learning of CNN.
  • the ⁇ B used for normalization in the above formula (3) is the global mean corresponding to n mini-batches (that is, the internal mean of the n mini-batches is accumulated and then the mean value is calculated), and the used Is the global variance corresponding to the n mini-batches (ie, the internal variances of the n mini-batches are accumulated and then averaged).
  • FIG. 3 is a forward flowchart of Batch Normalization provided by an embodiment of the present invention. The specific calculation process is as follows:
  • X j is a matrix
  • x 1 ,... X m can be m column vectors in the modified matrix, that is, the i-th column vector in the matrix X j is the training sample in the corresponding mini-batch. x i .
  • Core j calculates the core average on the mini-batch corresponding to Core j
  • the value of j is 1, ... n;
  • the global average of the batch is calculated by a target core among the n cores, that is, other cores (cores other than the target core) in the n cores need to send the calculated average ⁇ B to the target core.
  • the target Core finally calculates the global average ⁇ B ′ , and then each Core reads the global average ⁇ B ′ on the target Core or broadcasts the target Core to each Core. It can be understood that the target Core can also store the global mean value ⁇ B ′ in the relevant calculation unit, so as to facilitate subsequent calculation of the global output result.
  • Core j calculates the intra-core variance of each mini-batch The value of j is 1, 2 ... n;
  • the above implementation process requires two synchronizations during the batch normalization process: the first synchronization: the global mean value ⁇ B ′ is synchronized between n mini-batches; the second synchronization, n mini-batches Global variance vector
  • the first synchronization the global mean value ⁇ B ′ is synchronized between n mini-batches
  • the second synchronization n mini-batches Global variance vector
  • synchronizing twice is expensive. If there is a layer of Batch Normalization behind each network layer, the synchronization overhead will greatly affect the training speed of the entire network.
  • the problem that this application mainly solves is how to reduce the two synchronization time in the Batch Normalization process, in order to reduce the synchronization overhead in the Batch Normalization process, and improve the training speed of the entire network.
  • the foregoing application scenarios are merely exemplary implementations in the embodiments of the present invention, and the application scenarios in the embodiments of the present invention include but are not limited to the above application scenarios.
  • FIG. 4 is a schematic diagram of a hardware structure of a neural network processor according to an embodiment of the present invention.
  • the neural network processor NPU 20 is mounted on a CPU 30 (such as Host CPU) as a coprocessor, and is hosted by Host The CPU assigns tasks.
  • the NPU 20 may include N computing cores 201 (NPU Cores) for short, and the N Cores 201 may be connected to and communicate with each other through an on-chip interconnection 202 network.
  • NPU Cores N computing cores 201
  • the N Cores 201 are coupled to the atomic operation accumulation unit 203, the on-chip shared cache 204, and the external memory 40 through an on-chip interconnect network 202.
  • the atomic operation accumulation unit 203 and the on-chip shared cache 204 may be integrated together;
  • the external memory 40 may be a double-rate synchronous dynamic random access memory (Double Data Rate, DDR), a high-bandwidth memory (High Bandwidth Memory, HBM), and the like.
  • FIG. 5 is a schematic diagram of an NPU detailed hardware structure provided by an embodiment of the present invention.
  • the core of Core 201 Part is the arithmetic circuit 2013.
  • the direct current memory access controller DMAC 2017 controls the arithmetic circuit 1203 to extract matrix data in the memory (including the input memory 2011 and the weight memory 2012) for multiplication; further, the DMAC 1204 also controls the arithmetic circuit 2013.
  • the operation result or the matrix data in the unified memory 2016 enters the accumulator 2015 and / or the vector calculation unit 2014 for further operations. among them,
  • the arithmetic circuit 2013 can also be referred to as a matrix operation unit (Cube unit), which is used to complete the operation of matrix * matrix.
  • the computing circuit 2013 may include a plurality of processing units (Process Engines, PEs).
  • the arithmetic circuit 2013 is a two-dimensional pulsation array.
  • the operation circuit 2013 may also be a one-dimensional pulsation array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit 1203 is a general-purpose matrix processor.
  • Vector calculation unit 2014 is used to further process the output of the arithmetic circuit 2013, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. It is mainly used for non-convolution / FC layer network calculations in neural networks, such as pooling, batch normalization, local response normalization, and so on.
  • the vector calculation unit 2014 can store the processed output vector to the unified memory 2016.
  • the vector calculation unit 2014 may apply a non-linear function to the output of the arithmetic circuit 2013, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 2014 generates a normalized value or a merged value, or the vector calculation unit 2014 generates a normalized value and a merged value.
  • a vector of the processed output can be used as an activation input for the arithmetic circuit 2013, for example for use in subsequent layers in a neural network.
  • the unified memory 2016 can also be called an on-chip memory (On-chip buffer), which is used to store input data and output data.
  • the weight data (weight matrix) is transferred to the weight memory 2012 through the DMAC 2017.
  • the input matrix (input matrix) is also transferred to the unified memory 2016 or the input memory 2011 through the DMAC.
  • the input memory 2011 may also be referred to as a feature map memory (feature map memory) and is used to store a feature map matrix.
  • feature map memory feature map memory
  • Weight memory 2012 is used to store the weight matrix.
  • the format of the weight matrix includes four dimensions: convolution kernel height, convolution kernel width, input channel number (convolution kernel depth), and output channel number (convolution kernel number).
  • the weight matrix is the convolution kernel.
  • the weight matrix may be a matrix composed of each convolution kernel used by the convolution layer for convolution.
  • Direct Memory Access Controller 2017, which is used to move the input data or input matrix from external memory 40 such as DDR / HBM to various memory buffers, or from the unified memory 2016 to the output data DDR / HBM.
  • DMAC Direct Memory Access Controller
  • a complete DMA transfer process needs to go through four steps: DMA request, DMA response, DMA transfer, and DMA end.
  • Control unit (Flow control) 2018, chip control unit / flow, control data reading mode, used to control processing flow and control data reading mode.
  • On-chip interconnect network is a communication method of system-on-chip (SoC). It is the main component of multi-core technology and is used for multiple computing cores. Interactions, and interactions with multiple computing cores Core 201 and external memory 40 and internal memory 208.
  • Atomic operation accumulator unit 203 is used to store and accumulate output results obtained by multiple computing cores Core 201.
  • On-chip shared buffer 204 is used to buffer the data written by the atomic operation accumulation unit, such as the global mean and global variance. It is also used to synchronize the global mean and global variance to n Core201 under the control of DMAC2017. .
  • the optional on-chip shared cache 204 may be integrated with the atomic operation accumulation unit 203.
  • the unified memory 2016, the input memory 2011, and the weight memory 2012 are all On-Chip memories.
  • the external memory that is, DDR / HBM, can be privately owned by the CNN 20's hardware architecture, and it can also be used by other processors while serving the CNN 20.
  • the main CPU 30 is further configured to run general operating system software, and control the neural network processor 20 to perform neural network training under the function of the general operating system software.
  • the neural network processor 20 described above may also be integrated in the main CPU 30 as part of the main CPU 30; it may also be another functional chip coupled to the main CPU 30 and capable of implementing related functions.
  • the functions performed by the main CPU 30 may also be distributed and executed on multiple different function chips, which are not specifically limited in this embodiment of the present invention.
  • the arithmetic circuit 2013 takes the data corresponding to the matrix B from the weight memory 2012 and buffers the data on each PE in the arithmetic circuit.
  • the arithmetic circuit 2013 takes matrix A data from the input memory 2011 and performs matrix operations on the matrix B, and then performs an addition operation in the accumulator 2015. Partial or final results of the obtained matrix are stored in the unified memory 2016.
  • the hardware structure of the CNN is only one exemplary hardware structure provided by the embodiment of the present invention.
  • the hardware structure of the CNN in the embodiment of the present invention includes, but is not limited to, the above structure and connection relationship.
  • the neural network processor 20 includes n computing cores Core 201, an atomic operation accumulation unit 203, and an on-chip shared cache 204, and the n Core 201 and the on-chip shared cache 204 are respectively coupled to the atom
  • the accumulation unit 203 is operated, where n is an integer greater than one.
  • each of the n Cores 201 also includes multiple functional modules, such as a vector calculation unit 2014, DMAC 2017, and the like. Please refer to the relevant descriptions in FIG. 4 and FIG. 5, and details are not described herein again. among them
  • Each of the n Cores 201 uses a vector calculation unit 2014 to calculate the mean kernel ⁇ of the input matrix according to the input matrix, and writes u to the atomic operation accumulation unit 2014 under the control of DMAC 2017.
  • the input matrix includes m training samples, and the kernel average ⁇ is an average value of the m training samples x, where m is an integer greater than or equal to 1; and m is calculated using a vector calculation unit 2014 according to the input matrix.
  • X 2 mean v, and write v to the atomic operation accumulation unit 203 under the control of DMAC 2017;
  • Atomic operation accumulation unit 203 accumulates n ⁇ s written by the n Cores to obtain S1, and writes the S1 to the on-chip shared cache 204; accumulates n v written by the n Cores Obtaining S2, and writing the S2 to the on-chip shared cache 204;
  • Each of the n Cores 201 further obtains S1 and S2 from the on-chip shared cache 204, and calculates the global variance of the n input matrices of the n Cores 201 according to the S1 and S2.
  • the neural network processor calculates the global variance of the batch, it no longer depends on first calculating the global mean value (each calculation core first calculates the internal mean value, then the global summation, and then averages). Calculate the global variance (each calculation kernel first calculates the intra-core variance, then global summation and average), but each calculation kernel calculates the intra-core mean ⁇ of the intra-core training sample x and the intra-core mean v of x 2 , and then separately It is sent to the atomic operation accumulation unit to accumulate and store the accumulation result in the on-chip network cache.
  • the n calculation cores obtain the accumulation results of n ⁇ and n v from the on-chip network cache at one time and calculate based on the accumulation result To calculate the global mean and global variance.
  • an atomic operation accumulation unit is added to the on-chip network NoC of the neural network processor, different calculation cores complete the accumulation operation during the process of writing data to the same address (that is, the atomic operation accumulation unit), on the one hand, it does not occupy the calculation.
  • the computing resources of the core but the accumulation is completed after each calculation and writing process, and the accumulation process is dispersed to save the total time; avoiding sending all data to a certain calculation core to complete the accumulation operation, leading to accumulation At the same time, other calculation cores are waiting in the air, which is inefficient.
  • the sequence of calculating the ⁇ and v by the vector calculation unit 2014 in each of the Core 201 described above is not specifically limited, that is, ⁇ may be calculated first, and v may be calculated first, and each After the calculation core 201 calculates its own ⁇ or v, it can be immediately sent to the atomic operation accumulation unit 203 for accumulation, so as to synchronize accumulation and save time.
  • ⁇ or v may be uniformly sent to a specified Core 201 among n Core 201 for cumulative calculation. That is, the processor 20 may not have the above-mentioned atomic operation accumulation unit 203 and the on-chip shared cache 204, but a core 201 may replace the functions of the above-mentioned atomic operation accumulation unit 203 and the on-chip shared cache 204.
  • each Core 201 of the n Cores also obtains S1 and S2 from the on-chip shared cache 204 under the control of the DMAC, and uses a vector calculation unit 2014 to calculate S1 and S2 according to the S1 and S2.
  • the global variance of the n input matrices of the n Core 201 specifically includes: obtaining S1 and S2 from the on-chip shared cache, and according to a calculation formula: Calculate the global variance of the n input matrices of the n Cores.
  • the unit 203 has completed the calculation of S1 and S2, so DMAC 2017 can control the Core where it is located to obtain S1 and S2 from the on-chip shared cache.
  • each of the n Cores 201 is further processed by a vector calculation unit 2014 according to a formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices of the n Cores, and ⁇ 2 is the global variance of the n input matrices of the n Cores, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • each of the n Cores 201 is obtained by using a vector calculation unit 2014 according to a formula.
  • each of the n Cores 201 is used to obtain a feature map matrix and a weight matrix from the DDR / HBM 40, and read the obtained feature map matrix and weight matrix respectively Go to the corresponding input memory 2011 and weight memory 2012, and use the arithmetic circuit 2013 and accumulator 2015 to perform multiplication and addition operations on the feature map matrix and the weight matrix to obtain the input matrix.
  • the embodiment of the present invention provides the following Batch Normalization processing process in combination with actual application scenarios.
  • the specific calculation process of the neural network processor may be as follows:
  • the training set corresponding to Batch has Z training samples.
  • the batch is divided into n mini-batches.
  • Each mini-batch is a subset of the training set corresponding to Batch.
  • Each mini-batch has m training samples.
  • Mini-batch B ⁇ x 1 , ... x m ⁇ ; Among them, B has m training samples of x 1 , ... x m respectively ; It should be noted that this input is the input of the normalization layer, or Understand to normalize the output of the previous layer.
  • the above-mentioned Batch Normalization process is a BN process for a certain mini-batch in the batch. From a global perspective of the batch, all n mini-batches in the entire batch need to perform the above-mentioned BN processing, and eventually According to the output results of n mini-batch, Batch normalization can be performed to calculate the mean and standard deviation of the batch, thereby obtaining the output result of the entire batch. Assume that each mini-batch corresponds to a computing core Core, then n mini-batches correspond to n Cores, and the core corresponding to the j-th mini-batch in the n mini-batch is Corej, and the value of j 1, 2 ... n. Please refer to FIG. 6, which is another batch normalization forward flowchart provided by an embodiment of the present invention. The specific calculation process is as follows:
  • DMAC 2017 uses the on-chip interconnect network 202 to control the reading of the Feature Map matrix and weight coefficient matrix corresponding to n Mini-batch from HBM / DDR 40 to the input memory 2011 (such as Feature map) and weight corresponding to each Core.
  • the memory 2012 (such as Weight). For example, read the Feature map matrix and weight corresponding to the first mini-batch into the Feature map, weight, and weight corresponding to Core1, and read the Feature map matrix and weight corresponding to the second mini-batch to Core 2.
  • the Feature map matrix and weight corresponding to the nth mini-batch are read into the corresponding Feature map, buff, and weight in Coren, which will not be described later. It can be understood that the weight coefficient matrices corresponding to different mini-batch are the same.
  • the jth Core that is, the arithmetic circuit 2013 and the accumulator 2015 on the Corej, calculate the result of multiplying the Feature Map matrix and the Weight matrix as Xj, and temporarily store it in the unified memory 2016.
  • the value of j is 1, ... n.
  • Each of the n Core 201s calculates the average value in its own calculation core.
  • the vector calculation unit 2014 on Core j calculates the average value of its corresponding j-th mini-batch B.
  • ⁇ B represents the kernel average of Mini-batch B input to the normalization layer. After summing m training samples x 1 ,... X m in Mini-batch B and dividing by m, ⁇ B ( (I.e., the kernel average of the input matrix of each Core201), x i is the ith training sample of m samples; i takes 1, 2, ... n; j takes 1, 2, ... ... n.
  • DMAC 2017 controls each Core 201 out of n Cores 201 through the on-chip interconnect network 202, and writes the core average ⁇ B calculated by itself into the atomic operation accumulation unit 203 through the on-chip interconnect network 202.
  • Atomic operations The accumulation unit 203 reads the value of the original address and adds it to the current write value to obtain
  • u j represents the kernel average u of the corresponding mini-batch calculated by the j-th core of the n cores, that is, ⁇ B.
  • DMAC 2017 controls each Core 201 out of n Core 201 through the on-chip interconnect network 202, and calculates the average value v of m x 2 respectively.
  • the vector calculation unit 2014 of the j-th Core j calculates m And then averaged to get The value of i is 1, ... n; the value of j is 1, ... n.
  • DMAC 2017 controls all Cores 201 through the on-chip interconnect network 202 and writes the calculated v to the atomic operation accumulation unit 203.
  • the atomic operation accumulation unit 203 reads the value of the original address and adds it to the current write value to obtain All Mini-batch
  • v j represents the average value of m x 2 in the mini-batch corresponding to the j th Core j of the n Cores.
  • the atomic operation accumulation unit 203 writes the accumulated S1 into the on-chip shared cache 204, and writes the accumulated S2 into the on-chip shared cache 204.
  • DMAC 2017 controls each of the n Cores 201 through the on-chip interconnect network 202 to go to the on-chip shared cache 204 to obtain S1 and S2.
  • the jth Core j obtains the on-chip shared cache 204 through the on-chip interconnect network.
  • S1 and S2 and calculate the global mean value based on S1 and S2 respectively And global variance
  • the value of i is 1, ... n; the value of j is 1, ... n.
  • each Core 201 has calculated the global mean and global variance, so each Core 201 can be combined with a normalized calculation formula And zoom and pan operations to calculate their respective After activating the ReLU, it is stored in the HBM / DDR through the on-chip interconnect network 202 for subsequent reverse processing.
  • the sum when calculating the mean ⁇ in the kernel and the mean value of x 2 in each of the calculation kernels, the sum may be calculated in the kernel first, that is, only the sum of m training samples x is calculated. And the sum of m training samples x 2 is written into the atomic operation accumulation unit, and the resulting values are m * S1 and m * S2. Therefore, when the subsequent calculation kernel calculates the global mean and global variance separately, it needs to do more. The division operation of m is performed once, so as to finally obtain the global mean and global variance.
  • the implementation of other parts is similar to the principle of the foregoing embodiment of the invention, and will not be repeated here.
  • FIG. 7 is a schematic flowchart of a data processing method according to an embodiment of the present invention, which can be applied to the neural network processor corresponding to FIG. 4 or FIG. 5 described above.
  • the method may include the following steps S701 to S703.
  • the method may further include steps S700, S704, and S705.
  • Step S700 Obtain a feature map matrix and a weight matrix, and calculate any one of the n input matrices according to the feature map matrix and the weight matrix.
  • Step S701 For each input matrix of the n input matrices, perform the following processing: calculate the kernel mean ⁇ of the input matrix according to the input matrix, and write u into the atomic operation accumulation unit, where the input matrix includes m training samples, where the kernel average ⁇ is the average value of the m training samples x, where m is an integer greater than or equal to 1; the average value v of m x 2 is calculated according to the input matrix, and v is written to the atomic operation accumulation unit.
  • Step S702 The n ⁇ s calculated according to the n input matrices and the n vs are processed as follows: the n ⁇ s are accumulated to obtain S1; the n vs are accumulated to obtain S2.
  • Step S703 Perform the following processing on S1 and S2: calculate the global variances of the n input matrices according to the S1 and S2.
  • the calculating the global variance of the n input matrices according to the S1 and S2 includes: according to a calculation formula: Calculate the global variance of the n input matrices.
  • Step S704 according to the formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices, and ⁇ 2 is the global variance of the n input matrices, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • Step S705 according to the formula Perform scaling and translation processing on the m training samples x after the normalization processing; among them, Is the normalized result of the ith training sample x i among the m training samples, y i is the output result obtained by batch normalizing B i of x i , ⁇ is the scaling parameter, and ⁇ is the translation parameter.
  • An embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a program, and when the program is executed, it includes part or all of the steps described in any of the foregoing method embodiments.
  • An embodiment of the present invention also provides a computer program.
  • the computer program includes instructions.
  • the computer program can execute some or all steps of any data processing method.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the above units is only a logical function division.
  • multiple units or components may be combined or integrated.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, which may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium. It includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, and specifically a processor in a computer device) to perform all or part of the steps of the foregoing method in each embodiment of the present application.
  • a computer device which may be a personal computer, a server, or a network device, and specifically a processor in a computer device
  • the foregoing storage medium may include: a U disk, a mobile hard disk, a magnetic disk, an optical disk, a read-only memory (abbreviation: ROM), or a random access memory (Random Access Memory, abbreviation: RAM).
  • ROM read-only memory
  • RAM random access memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例公开了一种神经网络处理器、方法及相关设备,其中的神经网络处理器,包括:n个计算核Core、原子操作累加单元和片上共享缓存;每一个Core用于根据输入矩阵计算输入矩阵的核内均值μ,并将u写入到原子操作累加单元;根据输入矩阵计算m个x 2的均值v,并将v写入到原子操作累加单元;原子操作累加单元,用于对n个Core写入的n个μ进行累加得到S1,并将S1写入片上共享缓存;对n个Core写入的n个v的进行累加得到S2,并将S2写入片上共享缓存;每一个Core还用于:从片上共享缓存获取S1和S2,并根据S1和S2计算n个Core的n个输入矩阵的全局方差。采用本申请,可以提升神经网络的训练速度。

Description

一种神经网络处理器、数据处理方法及相关设备 技术领域
本申请涉及人工智能领域的数据计算技术领域,尤其涉及一种神经网络处理器、数据处理方法及相关设备。
背景技术
卷积神经网络(Convolutional Neural Network,CNN)从AlexNet的几层,到VGG和GoogleNet的十几层,甚至到ResNet的上百层,网络模型层数不断加深,取得的效果也越来越好,然而,网络越深往往就越难以训练。
CNN网络在训练的过程中,前一层的参数变化影响着后面层的变化,而且这种影响会随着网络深度的增加而不断放大。传统的神经网络,只是对输入层中所输入的的训练数据进行标准化处理(减均值,除标准差),以降低训练样本间的差异性。而批量归一化(Batch Normalization,BN)也称梯度归一化处理方法,则是通过对CNN中的某几个卷积层或者每一个卷积层之后,加入归一化层,进行归一化处理后,再进入下一层网络,作为下一层的网络的输入,即通过引入BN算法来规范这些层输入矩阵的均值与方差,从而解决了以下问题:
1、提升网络模型训练收敛速度。对于深度网络的训练,上一层网络发生的数据变化,可能会积累,导致下一层网络需要适应不同的数据分布,影响收敛速度。使用BN可以减弱不同分布之间的差异,提高收敛速度。
2、提升网络泛化能力。训练数据和测试数据分布差异较大时,网络泛化能力会较差。3、消除权重带来的梯度弥散(vanishing gradient problem)或梯度***(gradient explore)问题。
然而,由于上述BN处理方法,需要在CNN中的某些层或者每一层的输入之前都进行归一化处理(包括均值和方差的计算),且CNN的网络模型层数越深,进行归一化处理次数越多,若每一次归一化处理的时间较长,则可能会影响CNN的整体训练速度。因此,如何在利用BN处理方法保证CNN网络性能的情况下,进一步提升BN处理的速度是亟待解决的问题。
发明内容
本发明实施例提供一种神经网络处理器、数据处理方法及相关设备,以提升神经网络的训练速度。
第一方面,本发明实施例提供了一种神经网络处理器,可包括:n个计算核Core、原子操作累加单元和片上共享缓存,所述n个Core和所述片上共享缓存分别耦合于所述原子操作累加单元,其中,n为大于1的整数;所述n个Core中的每一个Core,用于:根据输入矩阵计算所述输入矩阵的核内均值μ,并将u写入到所述原子操作累加单元,所述输入矩阵包括m个训练样本,所述核内均值μ为所述m个训练样本x的平均值,其中,m为大于或者等于1的整数;根据所述输入矩阵计算m个x 2的均值v,并将v写入到所述原子操作累加单元;所述原子操作累加单元,用于对所述n个Core写入的n个μ进行累加得到 S1,并将所述S1写入所述片上共享缓存;对所述n个Core写入的n个v的进行累加得到S2,并将所述S2写入所述片上共享缓存;所述n个Core中的每一个Core,还用于:从所述片上共享缓存获取S1和S2,并根据所述S1和S2计算所述n个Core的n个输入矩阵的全局方差。
本发明实施例,通过在Batch Normalization过程中,神经网络处理器计算Batch的全局方差时,不再依赖于先计算出全局均值(每个计算核先计算核内均值再全局求和再平均)再计算全局方差(每个计算核先计算核内方差再全局求和再平均),而是每个计算核计算出核内训练样本x的核内均值μ以及x 2的核内均值v,再分别送入原子操作累加单元进行累加并将累加结果存入片上网络缓存中,最终,n个计算核从片上网络缓存中一次性获取n个μ以及n个v的累加结果,并基于该累加结果计算,计算出全局均值和全局方差。即不同于现有技术中神经网络处理器中的每个计算核需要先获取到全局均值(即对所有计算核的核内均值进行求和再求均值)之后,才可以计算全局方差,而在此过程中,所有计算核从原子操作累加单元(也可以是n个计算核中的某个计算核)中同步全局均值,需要一定的时长,且计算核越多同步的时间可能越长,从而影响之后每个计算核计算核内方差,进而影响原子操作累加单元累加全局方差的时间。因此,本发明实施例在计算核Batch Normalization过程中,n个计算核只需要做一次同步(即一次性获取计算全局均值和全局方差的计算元素),减少了一次同步过程,便大大的减少了同步的开销和时间,提升整个神经网路的训练速度。
在一种可能的实现方式中,所述n个Core中的每一个Core,还用于从所述片上共享缓存获取S1和S2,并根据所述S1和S2计算所述n个Core的n个输入矩阵的全局方差,包括:从所述片上共享缓存获取S1和S2,并根据计算公式:
Figure PCTCN2018109208-appb-000001
计算所述n个Core的n个输入矩阵的全局方差。
本发明实施例,所述n个Core中的每一个Core从片上共享缓存获取S1和S2后,依据S1和S2并依据δ 2的计算公式计算出n个Core的n个输入矩阵的全局方差,且可以同步得到出全局均值。
在一种可能的实现方式中,所述n个Core中的每一个Core,还用于:根据公式
Figure PCTCN2018109208-appb-000002
分别对所述输入矩阵中的m个训练样本x进行归一化处理,其中,
Figure PCTCN2018109208-appb-000003
为所述n个Core的n个输入矩阵的全局均值,δ 2为所述n个Core的n个输入矩阵的全局方差,
Figure PCTCN2018109208-appb-000004
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,∈为大于0的值。
本发明实施例,n个Core中的每一个Core可以根据自身计算出的全局均值和全局方差进行Batch Normalization的归一化处理。
在一种可能的实现方式中,所述n个Core中的每一个Core,还用于:根据公式
Figure PCTCN2018109208-appb-000005
对进行归一化处理后的m个训练样本x进行缩放、平移处理;其中,
Figure PCTCN2018109208-appb-000006
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,y i为x i经过批归一化BN处理后得到的输出结果,α为缩放参数,β为平移参数。
本发明实施例,n个Core中的每一个Core可以根据自身计算出的全局均值和全局方差 进行Batch Normalization的平移和缩放处理。
在一种可能的实现方式中,所述n个Core中的每一个Core,用于获取特征图矩阵和权重矩阵,并根据所述特征图矩阵和所述权重矩阵计算得到所述输入矩阵。
本发明实施例,通过在Batch Normalization层的前一层获得的特征图矩阵以及权重矩阵进行运算后,得到Batch Normalization层的输入矩阵,并最终在经过Batch Normalization处理之后,作为下一层的输入。
第二方面,本发明实施例提供了一种数据处理方法,可包括:针对n个输入矩阵中的每个输入矩阵作如下处理:根据输入矩阵计算所述输入矩阵的核内均值μ,并将u写入到所述原子操作累加单元,所述输入矩阵包括m个训练样本,所述核内均值μ为所述m个训练样本x的平均值,其中,m为大于或者等于1的整数;根据所述输入矩阵计算m个x 2的均值v,并将v写入到所述原子操作累加单元;针对根据n个所述输入矩阵计算出的n个μ,以及n个v作如下处理:对n个μ进行累加得到S1;对n个v进行累加得到S2;针对S1和S2作如下处理:根据所述S1和S2计算所述n个输入矩阵的全局方差。
在一种可能的实现方式中,所述根据所述S1和S2计算所述n个输入矩阵的全局方差,包括:根据计算公式:
Figure PCTCN2018109208-appb-000007
计算所述n个输入矩阵的全局方差。
在一种可能的实现方式中,所述方法,还包括:根据公式
Figure PCTCN2018109208-appb-000008
分别对所述输入矩阵中的m个训练样本x进行归一化处理,其中,
Figure PCTCN2018109208-appb-000009
为所述n个输入矩阵的全局均值,δ 2为所述n个输入矩阵的方差,
Figure PCTCN2018109208-appb-000010
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,∈为大于0的值。
在一种可能的实现方式中,所述方法,还包括:根据公式
Figure PCTCN2018109208-appb-000011
对进行归一化处理后的m个训练样本x进行缩放、平移处理;其中,
Figure PCTCN2018109208-appb-000012
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,y i为x i经过批归一化BN处理后得到的输出结果,α为缩放参数,β为平移参数。
在一种可能的实现方式中,所述方法,还包括:获取特征图矩阵和权重矩阵,并根据所述特征图矩阵和所述权重矩阵计算得到所述n个输入矩阵中的任意一个输入矩阵。
第三方面,本申请提供一种运算加速器,所述运算加速器为上述第一方面中任意一项所述的神经网络处理器中的计算核Core,所述运算加速器用于执行上述第一方面中任意一项所述的神经网络处理器中所述的n个计算核Core中任意一个Core所执行的功能。
第四方面,本申请提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述第二方面中任意一项所述的数据处理方法流程。
第五方面,本发明实施例提供了一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第二方面中任意一项所述的数据处理方法流程。
第六方面,本申请提供了一种芯片***,该芯片***包括处理器,用于实现上述第二方面中任意一项所述的数据处理方法流程所涉及的功能。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存数据处理必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包含芯片和其它分立器件。
附图说明
图1是本发明实施例提供的一种卷积神经网络示意图;
图2是本发明实施例提供的另一种卷积神经网络示意图;
图3是本发明实施例提供的一种Batch Normalization前向流程图;
图4是本发明实施例提供的一种神经网络处理器的硬件结构示意图;
图5是本发明实施例提供的一种NPU细化的硬件结构示意图;
图6是本发明实施例提供的另一种Batch Normalization前向流程图;
图7是本发明实施例提供的一种数据处理方法流程示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“***”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地***、分布式***和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它***交互的互联网)的信号通过本地和/或远程进程来通信。
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。
(1)人工智能(Artificial Intelligence,AI),是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个分支,它企图了解智能 的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
(2)卷积神经网络(Convolutional Neural Network,CNN)是一种多层的神经网络,每层有多个二维平面组成,而每个平面由多个独立神经元组成,每个平面的多个神经元共享权重,通过权重共享可以降低神经网络中的参数数目。目前,在卷积神经网络中,处理器进行卷积操作通常是将输入信号特征与权重的卷积,转换为信号矩阵与权重矩阵之间的矩阵乘运算。在具体矩阵乘运算时,对信号矩阵和权重矩阵进行分块处理,得到多个分形(Fractional)信号矩阵和分形权重矩阵,然后对多个分形信号矩阵和分形权重矩阵进行矩阵乘和累加运算。
(3)卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
(5)卷积就是卷积核跟图像矩阵(卷积层的输入矩阵)的运算。通常输入矩阵(input matrix)是卷积时根据卷积核的步长(stride)从图像矩阵提取出来的矩阵。卷积核是一个小窗口,记录的是权重。卷积核在图像矩阵上按步长滑动,每次滑动卷积核对应图像矩阵的一个子矩阵,将卷积核中的权值和该子矩阵包含的值相乘再相加,赋给卷积核当前在输出特征图(输出矩阵)对应的一个元素。
(6)卷积核沿图像矩阵的高度方向移动一次的步长为卷积核高度滑动的步长,卷积核沿图像矩阵的宽度方向移动一次的步长为卷积核宽度滑动的步长。卷积核的滑动步长用参数stride表示。通常输入矩阵是卷积时根据卷积核的步长(stride)从图像矩阵(即输入数据)提取出来的。举例来说,stride=[s1,s2],s1表示卷积核高度滑动的步长,s2表示卷积核宽度滑动的步长。
(7)卷积运算是卷积神经网络中最重要的一个算子。例如,X表示输入特征图(卷积层的输入矩阵),X’表示采用im2col操作处理X后得到的矩阵,W表示权重矩阵,b表示偏置,Y0表示X’和W矩阵乘积的结果,Y表示输出特征图(卷积层的输出矩阵)。可选的,经过可选的Activation操作,计算输出Y中每个元素的激活值,得到最后的结果。
需要说明的是,由于本申请中所涉及的批量归一化(Batch Normalization,BN)是通过对卷积神经网络CNN中某几个卷积层或者每一个卷积层之后,加入归一化层,进行归一化处理后,再进入下一层网络。因此,以下对卷积神经网络CNN的结构进行示例性介绍:
卷积神经网络CNN是一种带有卷积结构的深度神经网络,是一种深度学习(deep  learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。
如图1所示,图1为本发明实施例提供的一种卷积神经网络示意图,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,以及神经网络层130,其中池化层为可选的。
卷积层/池化层120:
如图1所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
卷积层:
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……,这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以被称为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图1中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行 采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的输出。因此,在神经网络层130中可以包括多层隐含层(如图1所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(图1中由110至140的传播为前向传播)完成,反向传播(图1中由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图1所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图2所示,图2为本发明实施例提供的另一种卷积神经网络示意图,多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
本申请中的归一化层,作为CNN的功能层,原则上可以在上述CNN中的任何一层之后,或者任何一层之前进行,并以上一层输出的特征矩阵作为输入,其输出也可以作为CNN中任何一层功能层的输入。但在实际CNN应用中,归一化层一般在卷积层之后进行,并以前面卷积层输出的特征矩阵作为输入矩阵。
为了便于理解本发明实施例,以下具体分析本发明实施例所需要解决的技术问题以及对应的应用场景。
在利用Batch Normalization进行CNN训练的实际运算过程中,绝大多数都采用mini-batch进行训练,例如,Batch为神经网络的整个训练集中的其中一部分(也可以是全部),进一步地,将Batch拆分成多个mini-batch,每个mini-batch则为Batch对应的训练集的一个子集。Batch Normalization计算每个mini-batch对应子集的均值和方差,并基于所有mini-batch对应子集的全局均值和全局方差对每个子集进行归一化操作,从而得到每个mini-batch对应的归一化后的子集即输出结果。避免了当训练集较大时,每次迭代都要计算整个训练集而导致一次迭代计算时间过长的问题;并且由于多个mini-batch可以同步运算, 因此,大大的提高了计算效率。另外,通过增加了缩放和平移(scale and shift)步骤,以保留原有训练集中的数据分布信息,避免使得网络的表达能力下降,同时便于自适应学习。
假设,Batch对应的训练集有Z个训练样本,将Batch分为的n个mini-batch,每个mini-batch是Batch对应的训练集的一个子集,每个mini-batch有m个训练样本,则训练的Batch Size为m,m<=Z,且n*m>=Z;其中mini-batch是按随机或者其他某种分布从Batch对应的训练集的Z个样本中挑选m个样本构成的一个子集,本申请对此不作具体限定。用B表示其中的任意一个mini-batch,以下示出Batch Normalization的前向计算方式:
输入:Mini-batch B={x 1,…x m};其中,B有m个训练样本分别为x 1,…x m;需要说明的是,该输入是归一化层的输入,也可以理解为归一化上一层的输出。
输出:{y i=BN αβ(x i)};其中,y i则为Mini-batchB中的第i个训练样本x i在经过归一化处理后并通过缩放和平移(涉及参数γ和参数β)得到的输出结果。即Mini-batchB的集合为{x 1,…x m},于是其对应的BN Layer的输出集合则为{y i=BN αβ(x i)}。需要说明的是,该输出是归一化层的输出,也可以理解为归一化下一层的输入。
(1)求Mini-batch B的均值:
Figure PCTCN2018109208-appb-000013
其中,μ B表示输入到归一化层的Mini-batch B的均值,对Mini-batchB中的m个训练样本x 1,…x m求和之后除以m,便得到Mini-batch的均值μ B;x i为Mini-batch的m个样本中的第i个训练样本。
(2)求Mini-batch B的方差:
Figure PCTCN2018109208-appb-000014
其中,
Figure PCTCN2018109208-appb-000015
表示输入到归一化层的Mini-batch B的方差,关于方差公式不再赘述。
(3)归一化:
Figure PCTCN2018109208-appb-000016
其中,
Figure PCTCN2018109208-appb-000017
为第i个训练样本x i归一化后的结果,
Figure PCTCN2018109208-appb-000018
应当服从均值μ=0,方差σ=1的正态分布也即是标准正态分布;∈是为了避免分母为0而加进去的接近于0的很小值。
(4)缩放和平移,对经过上面归一化处理得到的结果进行重构:
Figure PCTCN2018109208-appb-000019
其中,y i为mini-batch B中的第i个训练样本x i经过BN处理后得到的输出结果。而缩放参数α和平移参数β,则是为解决由于归一化后使得网络的表达能力下降问题(因为
Figure PCTCN2018109208-appb-000020
基本会被限制在正态分布下)所引入的两个参数。且α和数β是在训练时网络自己学习得到的,便于CNN的自适应学习。
若从Batch对应的训练集的全局角度来看,当整个训练集(Batch)包含n个mini-batch,且n大于1时,则所有n个mini-batch均需要进行上述BN处理。并且,最终,上述公式(3)中归一化所使用的μ B则为n个mini-batch对应的全局均值(即n个mini-batch的内部均值累加之后再求均值),而所使用的
Figure PCTCN2018109208-appb-000021
则为n个mini-batch对应的全局方差(即n个mini-batch的内部方差累加之后再求均值)。
假设,每个mini-batch都对应一个计算核Core,那么n个mini-batch对应n个Core,记n个mini-batch中的第j个mini-batch对应的Core为Core j,j的取值为1、……n。请参见图3,图3为本发明实施例提供的一种Batch Normalization前向流程图。具体计算过程如下:
(1)第j个计算核Core j计算Feature Map矩阵和Weight矩阵相乘结果为X j,而该X j即为n个mini-batch中的第j个mini-batch的输入B={x 1,…x m}。需要说明的是,X j为矩阵,而x 1,…x m则可以为改矩阵中的m个列向量,即矩阵X j中的第i个列向量为对应的mini-batch中的训练样本x i
(2)Core j计算Core j对应的mini-batch上的核内均值
Figure PCTCN2018109208-appb-000022
j的取值为1、……n;
(3)由n个Core中的某个目标Core计算Batch的全局均值,即需要n个Core中的其他Core(除目标Core以外的Core)将计算得到的均值μ B都发送到该目标Core上来,该目标Core最终计算得到全局均值μ B′,然后各个Core读取该目标Core上的全局均值μ B′或者由该目标Core广播给各个Core。可以理解的是,目标Core也可以将全局均值μ B′存储到相关的计算单元中,以便于后续进行全局输出结果的计算。
(4)Core j计算各mini-batch的核内方差
Figure PCTCN2018109208-appb-000023
j的取值为1、2……n;
(5)由某个核(例如仍为上述目标Core)计算全局方差,即需要n个Core中的其他Core(除目标Core以外的Core)将计算得到的方差
Figure PCTCN2018109208-appb-000024
均发送到该目标Core上来,该目标Core最终计算得到全局方差
Figure PCTCN2018109208-appb-000025
然后各个Core读取该目标Core上的全局方差
Figure PCTCN2018109208-appb-000026
或者由该目标Core广播给各个Core。各个Core根据该全局方差进行归一化和缩放、平移操作,ReLU以后,将各自计算出的
Figure PCTCN2018109208-appb-000027
存放到相关计算单元中,以便于结合上述全局均值μ B′和全局方差
Figure PCTCN2018109208-appb-000028
并根据归一化公式,求得Batch全局的输出结果Y,用于后续的反向过程。
上述实现过程,按照图3所示,需要在Batch Normalization过程中做两次同步:第一次同步:n个mini-batch之间同步全局均值μ B′;第二次同步,n个mini-batch之间同步全局方差向量
Figure PCTCN2018109208-appb-000029
然而同步两次开销比较大。假如每一个网络层之后都有一层Batch Normalization,则同步的开销过大会极大的影响整个网路的训练速度。
因此,针对上述技术问题,本申请主要解决的问题为如何减少Batch Normalization过程中的两次同步时间,以减小Batch Normalization过程中的同步开销,提升整个网路的训练速度。可以理解的是,上述应用场景的只是本发明实施例中示例性的实施方式,本发明实施例中的应用场景包括但不仅限于以上应用场景。
基于上述,下面结合本发明实施例提供的神经网络处理器以及相关设备进行描述。请参见图4,图4是本发明实施例提供的一种神经网络处理器的硬件结构示意图,该神经网络处理器NPU 20作为协处理器挂载到CPU 30(如Host CPU)上,由Host CPU 30分配任务。该NPU 20中可包括N个计算核201(NPU Core)简称Core,该N个Core 201之间可以通过片上网络202(on-chip interconnection network)互相连接并通信。进一步可选的,N个Core 201通过片上互连网络202耦合于原子操作累加单元203、片上共享缓存204和外部存储器40。其中,原子操作累加单元203和片上共享缓存204可以集成在一起;外部存储器40可以是双倍速率同步动态随机存储器(Double Data Rate,DDR)、高带宽存储器(High Bandwidth Memory,HBM)等。
下面以n个计算核201中的任意一个计算核的一种可能的硬件结构进行描述,请参见图5,图5是本发明实施例提供的一种NPU细化的硬件结构示意图,Core201的核心部分为运算电路2013,通过直接内存访问控制器DMAC 2017控制运算电路1203提取存储器(包括输入存储器2011和权重存储器2012等)中的矩阵数据进行乘法运算;进一步地,DMAC1204还控制运算电路2013中的运算结果或者统一存储器2016中的矩阵数据进入到累加器2015和/或向量计算单元2014中进行进一步的运算。其中,
运算电路2013,也可以称之为矩阵运算单元(Cube Unit),用于完成矩阵*矩阵的操作。运算电路2013可以包括多个处理单元(Process Engine,PE)。在一种可能的实现方式中,运算电路2013是二维脉动阵列。可选的,运算电路2013还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1203是通用的矩阵处理器。
向量计算单元(Vector Unit)2014,用于对运算电路2013的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如池化(Pooling)、批归一化(Batch Normalization)、局部响应归一化(Local Response Normalization)等。在一些实现中,向量计算单元2014能将经处理的输出的向量存储到统一存储器2016。例如,向量计算单元2014可以将非线性函数应用到运算电路2013的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2014生成归一化的值或合并值,或者,所述向量计算单元2014生成归一化的值和合并值。在一些实现中,处理过的输出的向量能够用作运算电路2013的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器2016,也可以称之为片上存储器(On-chip buffer),用于存放输入数据以及输出数据。权重数据(权重矩阵)通过DMAC2017被搬运到权重存储器2012中。输入矩阵(输入矩阵)也通过DMAC被搬运到统一存储器2016或输入存储器2011中。
输入存储器2011,也可以称之为特征图存储器(Feature map buffer),用于存储Feature map矩阵。
权重存储器(Weight Buffer)2012,用于存放权重矩阵。权重矩阵的格式包括四个维度:卷积核高度、卷积核宽度、输入通道数(卷积核深度)、输出通道数(卷积核个数)。在卷积层仅使用一个卷积核进行卷积时,权重矩阵就是卷积核。在卷积层使用两个或两个以上 卷积核进行卷积时,权重矩阵可以是卷积层进行卷积所使用的各卷积核组成的矩阵。
直接内存访问控制器(Direct Memory Access Controller,DMAC)2017,用于将外部存储器40如DDR/HBM中的输入数据或输入矩阵搬运到各类存储器Buffer中,或者从统一存储器2016将输出数据搬运到DDR/HBM中。可选的,一个完整的DMA传输过程需要经过DMA请求、DMA响应、DMA传输、DMA结束4个步骤。
控制单元(Flow control)2018,芯片控制单元/流水,控制数据读取方式,用于控制处理流程以及控制数据读取方式。
片上互连网络(on-chip interconnection network),是片上***(system-on-chip,SoC)的一种通信方式,它是多核技术的主要组成部分,用于多个计算核Core 201之间的交互,以及与多个计算核Core 201与外部存储器40和内部存储器208之间的交互。
原子操作累加单元(atomic accumulator unit)203,用于存储并累加多个计算核Core 201所得到的输出结果。
片上共享缓存(shared on chip buffer)204,用于缓存原子操作累加单元写入的数据,如全局均值和全局方差,也用于在DMAC2017的控制下,将全局均值和全局方差同步给n个Core201。可选的片上共享缓存204可以和原子操作累加单元203集成在一起。
需要说明的是,统一存储器2016,输入存储器2011,权重存储器2012均为On-Chip存储器。外部存储器即DDR/HBM可以私有于该CNN 20的硬件架构,也可以在服务该CNN20的情况下还服务于其它处理器。可以理解的是,主CPU30还用于运行通用操作***软件,并在通用操作***软件的作用下控制神经网络处理器20进行神经网络训练。
可选的,上述神经网络处理器20也可以作为主CPU 30中的一部分集成在主CPU 30中;也可以为耦合于上述主CPU30,且能实现相关功能的其它功能芯片。同理,主CPU 30所执行的功能也可以分布在多个不同的功能芯片上执行,本发明实施例对此不作具体限定。
基于上述结构,举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路2013从权重存储器2012中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路2013从输入存储器2011中取矩阵A数据与矩阵B进行矩阵运算,然后在累加器2015中进行加法操作,得到的矩阵的部分结果或最终结果,保存在统一存储器2016中。
可以理解的是,上述CNN的硬件结构只是本发明实施例所提供的其中一种示例性硬件结构,本发明实施例中的CNN的硬件结构,包括但不仅限于以上结构和连接关系。
下面,基于上述CNN的硬件结构,以下对本申请中的计算过程进行分析:
已知,B={x 1,…x m},可知均值μ B与方差
Figure PCTCN2018109208-appb-000030
的计算公式如下:
Figure PCTCN2018109208-appb-000031
Figure PCTCN2018109208-appb-000032
所以
Figure PCTCN2018109208-appb-000033
由上述公式分析可以获知,mini-batch的内部方差的计算不依赖于mini-batch的内部均值,所以内部均值与内部方差可以同步。
结合本发明实施例提供的上述图4或图5中的神经网络处理器的硬件架构,以及上述分析推导流程,对本发明实施例提供的一种神经网络处理器的相关功能进行描述。如图4所示,该神经网络处理器20包括n个计算核Core 201、原子操作累加单元203和片上共享缓存204,所述n个Core 201和所述片上共享缓存204分别耦合于所述原子操作累加单元203,其中,n为大于1的整数。进一步可选的,n个Core 201中的每一个Core 201还包括多个功能模块,如向量计算单元2014、DMAC 2017等,请参见图4和图5中的相关描述,在此不再赘述。其中
所述n个Core中的每一个Core 201,根据输入矩阵利用向量计算单元2014计算所述输入矩阵的核内均值μ,并在DMAC 2017的控制下将u写入到原子操作累加单元2014,所述输入矩阵包括m个训练样本,所述核内均值μ为所述m个训练样本x的平均值,其中,m为大于或者等于1的整数;根据所述输入矩阵利用向量计算单元2014计算m个x 2的均值v,并在DMAC 2017的控制下将v写入到原子操作累加单元203;
原子操作累加单元203,对所述n个Core写入的n个μ进行累加得到S1,并将所述S1写入片上共享缓存204;对所述n个Core写入的n个v的进行累加得到S2,并将所述S2写入所述片上共享缓存204;
所述n个Core中的每一个Core 201,还进一步的从片上共享缓存204获取S1和S2,并根据所述S1和S2计算所述n个Core 201的n个输入矩阵的全局方差。
本发明实施例,通过在Batch Normalization过程中,神经网络处理器计算Batch的全局方差时,不再依赖于先计算出全局均值(每个计算核先计算核内均值再全局求和再平均)再计算全局方差(每个计算核先计算核内方差再全局求和再平均),而是每个计算核计算出核内训练样本x的核内均值μ以及x 2的核内均值v,再分别送入原子操作累加单元进行累加并将累加结果存入片上网络缓存中,最终,n个计算核从片上网络缓存中一次性获取n个μ以及n个v的累加结果,并基于该累加结果计算,计算出全局均值和全局方差。减少了一次同步过程,便大大的减少了同步的开销和时间,提升整个神经网路的训练速度。
并且,由于在神经网络处理器的片上网络NoC上增加原子操作累加单元,不同的计算核在往同一个地址(即原子操作累加单元)写数据的过程中即完成累加操作,一方面不占用计算核的计算资源,而是每个计算核算完写数据的过程即完成累加,将累加过程分散开来,节约总的时间;避免了把所有数据发给某一个计算核完成累加操作,导致做累加的时候其他的计算核在空等,效率低。
可选的,本发明实施例中,对上述每一个Core 201中的向量计算单元2014在计算μ和v时的先后顺序不作具体限定,即可以先计算μ也可以先计算v,并且,每个计算核201在计算出自身的μ或v之后,则可以立即发送至原子操作累加单元203进行累加,以同步进行累加,节省时间。
可以理解的是,本发明实施例中,上述每一个Core 201在通过向量计算单元2014计算出μ或v之后,也可以统一发送给n个Core 201中的某一个指定的Core 201进行累加计算,即处理器20中也可以不具有上述原子操作累加单元203和片上共享缓存204,而是其中一个Core 201来代替上述原子操作累加单元203和片上共享缓存204的功能。
在一种可能的实现方式中,所述n个Core中的每一个Core 201,还在DMAC的控制下从片上共享缓存204获取S1和S2,并利用向量计算单元2014根据所述S1和S2计算所述n个Core 201的n个输入矩阵的全局方差,具体包括:从所述片上共享缓存获取S1和S2,并根据计算公式:
Figure PCTCN2018109208-appb-000034
计算所述n个Core的n个输入矩阵的全局方差。可选的,当所有Core 201将自身的μ或v写入至原子操作累加单元中时,都可以给DMAC2017一个反馈,当每个DMAC 2017均接收到两次反馈时,则可以获知原子操作累加单元203已经完成S1和S2的计算,因此DMAC 2017可以控制所在的Core从片上共享缓存中去获取S1和S2。
在一种可能的实现方式中,所述n个Core中的每一个Core 201,还通过向量计算单元2014,根据公式
Figure PCTCN2018109208-appb-000035
分别对所述输入矩阵中的m个训练样本x进行归一化处理,其 中,
Figure PCTCN2018109208-appb-000036
为所述n个Core的n个输入矩阵的全局均值,δ 2为所述n个Core的n个输入矩阵的全局方差,
Figure PCTCN2018109208-appb-000037
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,∈为大于0的值。
在一种可能的实现方式中,所述n个Core中的每一个Core 201,通过向量计算单元2014根据公式
Figure PCTCN2018109208-appb-000038
对进行归一化处理后的m个训练样本x进行缩放、平移处理;其中,
Figure PCTCN2018109208-appb-000039
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,y i为x i经过批归一化BN处理后得到的输出结果,α为缩放参数,β为平移参数。
在一种可能的实现方式中,所述n个Core中的每一个Core 201,用于从DDR/HBM 40中获取特征图矩阵和权重矩阵,并将获取的特征图矩阵和权重矩阵分别读入到对应的输入存储器2011和权重存储器2012,并利用运算电路2013和累加器2015对所述特征图矩阵和所述权重矩阵进行乘法运算和加法运算,得到所述输入矩阵。
下面,基于上述图4、图5提供的神经网络处理器20的硬件结构,本发明实施例结合实际应用场景,提供以下Batch Normalization的处理过程,神经网络处理器的具体计算过程可以如下:
Batch对应的训练集有Z个训练样本,将Batch分为的n个mini-batch,每个mini-batch是Batch对应的训练集的一个子集,每个mini-batch有m个训练样本,则训练的Batch Size为m,m<=Z,且n*m>=Z;其中mini-batch是按随机或者其他某种分布从Batch对应的训练集的Z个样本中挑选m个样本构成的一个子集,本申请对此不作具体限定。用B表示其中的任意一个mini-batch,那么对于一个mini-batch,假设对应的x的集合为{x 1,…x m},于是其对应的BN Layer的输出集合{y i=BN αβ(x i)}可以通过以下方式计算。
输入:Mini-batch B={x 1,…x m};其中,B有m个训练样本分别为x 1,…x m;需要说明的是,该输入是归一化层的输入,也可以理解为归一化上一层的输出。
输出:{y i=BN αβ(x i)};其中,y i则为Mini-batch B中的第i个训练样本x i在经过归一化处理后并通过缩放和平移(涉及参数α和参数β)得到的输出结果。需要说明的是,该输出是归一化层的输出,也可以理解为归一化下一层的输入。
上述Batch Normalization处理过程,为针对Batch中的某一个mini-batch进行的BN处理过程,若从Batch的全局角度来看,整个Batch中的所有n个mini-batch均需要进行上述BN处理,并且最终针对n个mini-batch的输出结果,进行Batch Normalization,才能计算得到Batch的均值和标准差,从而得到整个Batch的输出结果。假设每个mini-batch都对应一个计算核Core,那么n个mini-batch对应n个Core,记n个mini-batch中的第j个mini-batch对应的Core为Core j,j的取值为1、2……n。请参见图6,图6为本发明实施例提供的另一种Batch Normalization前向流程图。具体计算过程如下:
1、DMAC 2017通过片上互连网络202控制从HBM/DDR 40中读取n个Mini-batch对应的Feature Map矩阵和weight系数矩阵分别到各个Core对应的输入存储器2011(如Feature map buf)和权重存储器2012(如Weight buf)中。例如,将第1个mini-batch对应的Feature Map矩阵和weight读取到Core1中对应的Feature map buf和Weight buf中,将第2个mini-batch对应的Feature Map矩阵和weight读取到Core 2中对应的Feature map buf和 Weight buf中,以此类推,将第n个mini-batch对应的Feature Map矩阵和weight读取到Coren中对应的Feature map buf和Weight buf中,后续不再赘述。可以理解的是,不同mini-batch对应的weight系数矩阵是相同的。
2、第j个Core即Corej上的运算电路2013和累加器2015计算Feature Map矩阵和Weight矩阵相乘结果记为Xj,暂时存放到统一存储器2016中,j的取值为1、……n。其中输入矩阵Xj为第j个mini-batch所对应的B,B={x 1,…x m}。可理解的是,不同的mini-batch所对应的B中的{x 1,…x m}不同,本申请对此不作具体限定。
3、n个Core 201中的每个Core201分别计算自身计算核内的均值,例如,Core j上的向量计算单元2014计算出其对应的第j个mini-batch B的均值为
Figure PCTCN2018109208-appb-000040
其中,μ B表示输入至归一化层的Mini-batch B的核内均值,对Mini-batch B中的m个训练样本x 1,…x m求和之后除以m,便得到μ B(即每个Core201的输入矩阵的核内均值),x i为m个样本中的第i个训练样本;i的取值为1、2、……n;j的取值为1、2、……n。
4、DMAC 2017通过片上互连网络202控制n个Core 201中的每个Core 201,分别将自身计算的核内均值μ B通过片上互连网络202写入到原子操作累加单元203中,原子操作累加单元203读取原有地址的值与当前写入值进行累加得到
Figure PCTCN2018109208-appb-000041
其中u j表示n个Core中的第j个Core计算的对应的mini-batch的核内均值u也即是μ B
5、DMAC 2017通过片上互连网络202控制n个Core 201中的每个Core 201,分别计算m个x 2的均值v,例如,第j个Core j的向量计算单元2014计算m个
Figure PCTCN2018109208-appb-000042
的和再求均值,得到
Figure PCTCN2018109208-appb-000043
i的取值为1、……n;j的取值为1、……n。
6、DMAC 2017通过片上互连网络202控制所有Core 201分别将计算出的v写到原子操作累加单元203中,原子操作累加单元203读取原有地址的值与当前写入值进行累加,得到所有Mini-batch对应的
Figure PCTCN2018109208-appb-000044
其中v j表示n个Core中的第j个Core j对应的mini-batch中计算出的m个x 2的均值v。
6、原子操作累加单元203将累加得到的S1写入到片上共享缓存204中,以及将累加得到的S2写入到片上共享缓存204中。
7、DMAC 2017通过片上互连网络202控制n个Core 201中的每个Core201分别去片上共享缓存204获取S1和S2,例如,第j个Core j通过片上互连网络获取存储在片上共享缓存204中的S1和S2,并依据S1和S2各自计算全局均值
Figure PCTCN2018109208-appb-000045
和全局方差
Figure PCTCN2018109208-appb-000046
i的取值为1、……n;j的取值为1、……n。至此,每个Core 201都计算得到了全局均值和全局方差,因此每个Core201可以结合归一化计算公式
Figure PCTCN2018109208-appb-000047
以及缩放、平移操作计算各自的
Figure PCTCN2018109208-appb-000048
激活ReLU以后通过片上互连网络202存放到HBM/DDR中,用于后续的反向过程。
其中,
Figure PCTCN2018109208-appb-000049
为每个计算Core对应的mini-batch中的第i个训练样本x i归一化后的结果,
Figure PCTCN2018109208-appb-000050
应当服从均值μ=0,方差σ=1的正态分布也即是标准正态分布;∈是为了避免分母为0而加进去的接近于0的很小值。y i为mini-batch B中的第i个训练样本x i经过BN处理后得到的输出结果。而缩放参数α和平移参数β,则是为解决由于归一化后使得网络的表达能力下降问题(因为
Figure PCTCN2018109208-appb-000051
基本会被限制在正态分布下)所引入的两个参数。且α和数β是在训练时网络自己学习得到的,便于CNN的自适应学习。
可以理解的是,在本申请中,上述每个计算核在计算核内均值μ以及上述x 2的均值时,也可以是先在核内计算总和,即只计算m个训练样本x的总和,以及m个训练样本x 2的总和,写入到原子操作累加单元之后,得到的值则为m*S1以及m*S2,因此在后续计算核分别计算全局均值和全局方差时,则需要多进行一次m的除运算,从而最终得到全局均值和全局方差,其它部分的实现方式与上述发明实施例原理类似,在此不再赘述。
上述实现过程,按照图6所示,由于在计算全局方差时,不再依赖于全局均值的结果,因此,在Batch Normalization过程中,n个mini-batch之间只需要做一次同步,即将计算出的全局均值向量μ B′和全局方差向量
Figure PCTCN2018109208-appb-000052
一次性同步到n个mini-batch对应的Core中,大大的减少了同步的时间。假如神经网络中的每一个网络层之后都有一层Batch Normalization,那么同步的开销将会极大的减小,极大的提升整个网路的训练速度。
请参见图7,图7是本发明实施例提供的一种数据处理方法流程示意图,可以应用于上述图4或图5对应的神经网络处理器。该方法可以包括以下步骤S701-步骤S703,可选的,还可以包括步骤S700、步骤S704和步骤S705。
步骤S700:获取特征图矩阵和权重矩阵,并根据所述特征图矩阵和所述权重矩阵计算得到所述n个输入矩阵中的任意一个输入矩阵。
步骤S701:针对n个输入矩阵中的每个输入矩阵作如下处理:根据输入矩阵计算所述输入矩阵的核内均值μ,并将u写入到所述原子操作累加单元,所述输入矩阵包括m个训练样本,所述核内均值μ为所述m个训练样本x的平均值,其中,m为大于或者等于1的整数;根据所述输入矩阵计算m个x 2的均值v,并将v写入到所述原子操作累加单元。
步骤S702:针对根据n个所述输入矩阵计算出的n个μ,以及n个v作如下处理:对n个μ进行累加得到S1;对n个v进行累加得到S2。
步骤S703:针对S1和S2作如下处理:根据所述S1和S2计算所述n个输入矩阵的全局方差。
在一种可能的实现方式中,所述根据所述S1和S2计算所述n个输入矩阵的全局方差,包括:根据计算公式:
Figure PCTCN2018109208-appb-000053
计算所述n个输入矩阵的全局方差。
步骤S704:根据公式
Figure PCTCN2018109208-appb-000054
分别对所述输入矩阵中的m个训练样本x进行归一化处理,其中,
Figure PCTCN2018109208-appb-000055
为所述n个输入矩阵的全局均值,δ 2为所述n个输入矩阵的全局方差,
Figure PCTCN2018109208-appb-000056
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,∈为大于0的值。
步骤S705:根据公式
Figure PCTCN2018109208-appb-000057
对进行归一化处理后的m个训练样本x进行缩放、平移处理;其中,
Figure PCTCN2018109208-appb-000058
为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,y i为x i经过批归一化BN处理后得到的输出结果,α为缩放参数,β为平移参数。
需要说明的是,本发明实施例中所描述的数据处理方法中的具体流程,可参见上述图1-图6中所述的发明实施例中的相关描述,此处不再赘述。
本发明实施例还提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时包括上述方法实施例中记载的任意一种的部分或全部步骤。
本发明实施例还提供一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行任意一种数据处理方法的部分或全部步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(Read-Only Memory,缩写:ROM)或者随机存取存储器(Random Access Memory,缩写:RAM)等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述 实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (13)

  1. 一种神经网络处理器,其特征在于,包括:n个计算核Core、原子操作累加单元和片上共享缓存,所述n个Core和所述片上共享缓存分别耦合于所述原子操作累加单元,其中,n为大于1的整数;
    所述n个Core中的每一个Core,用于:
    根据输入矩阵计算所述输入矩阵的核内均值μ,并将u写入到所述原子操作累加单元,所述输入矩阵包括m个训练样本,所述核内均值μ为所述m个训练样本x的平均值,其中,m为大于或者等于1的整数;
    根据所述输入矩阵计算m个x 2的均值v,并将v写入到所述原子操作累加单元;
    所述原子操作累加单元,用于对所述n个Core写入的n个μ进行累加得到S1,并将所述S1写入所述片上共享缓存;对所述n个Core写入的n个v的进行累加得到S2,并将所述S2写入所述片上共享缓存;
    所述n个Core中的每一个Core,还用于:
    从所述片上共享缓存获取S1和S2,并根据所述S1和S2计算所述n个Core的n个输入矩阵的全局方差。
  2. 根据权利要求1所述的处理器,其特征在于,所述n个Core中的每一个Core,还用于从所述片上共享缓存获取S1和S2,并根据所述S1和S2计算所述n个Core的n个输入矩阵的全局方差,包括:
    从所述片上共享缓存获取S1和S2,并根据计算公式:
    Figure PCTCN2018109208-appb-100001
    计算所述n个Core的n个输入矩阵的全局方差。
  3. 根据权利要求2所述的处理器,其特征在于,所述n个Core中的每一个Core,还用于:
    根据公式
    Figure PCTCN2018109208-appb-100002
    分别对所述输入矩阵中的m个训练样本x进行归一化处理,其中,
    Figure PCTCN2018109208-appb-100003
    为所述n个Core的n个输入矩阵的全局均值,δ 2为所述n个Core的n个输入矩阵的全局方差,
    Figure PCTCN2018109208-appb-100004
    为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,∈为大于0的值。
  4. 根据权利要求3所述的处理器,其特征在于,所述n个Core中的每一个Core,还用于:根据公式
    Figure PCTCN2018109208-appb-100005
    对进行归一化处理后的m个训练样本x进行缩放、平移处理;其中,
    Figure PCTCN2018109208-appb-100006
    为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,y i为x i经过批归一化BN处理后得到的输出结果,α为缩放参数,β为平移参数。
  5. 根据权利要求1-4任意一项所述的处理器,其特征在于,所述n个Core中的每一个Core,用于获取特征图矩阵和权重矩阵,并根据所述特征图矩阵和所述权重矩阵计算得到所述输入矩阵。
  6. 一种数据处理方法,其特征在于,包括:
    针对n个输入矩阵中的每个输入矩阵作如下处理:
    根据输入矩阵计算所述输入矩阵的核内均值μ,并将u写入到所述原子操作累加单元,所述输入矩阵包括m个训练样本,所述核内均值μ为所述m个训练样本x的平均值,其中,m为大于或者等于1的整数;
    根据所述输入矩阵计算m个x 2的均值v,并将v写入到所述原子操作累加单元;
    针对根据n个所述输入矩阵计算出的n个μ,以及n个v作如下处理:
    对n个μ进行累加得到S1;对n个v进行累加得到S2;
    针对S1和S2作如下处理:
    根据所述S1和S2计算所述n个输入矩阵的全局方差。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述S1和S2计算所述n个输入矩阵的全局方差,包括:
    根据计算公式:
    Figure PCTCN2018109208-appb-100007
    计算所述n个输入矩阵的全局方差。
  8. 根据权利要求7所述的方法,其特征在于,所述方法,还包括:
    根据公式
    Figure PCTCN2018109208-appb-100008
    分别对所述输入矩阵中的m个训练样本x进行归一化处理,其中,
    Figure PCTCN2018109208-appb-100009
    为所述n个输入矩阵的全局均值,δ 2为所述n个输入矩阵的全局方差,
    Figure PCTCN2018109208-appb-100010
    为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,∈为大于0的值。
  9. 根据权利要求8所述的方法,其特征在于,所述方法,还包括:
    根据公式
    Figure PCTCN2018109208-appb-100011
    对进行归一化处理后的m个训练样本x进行缩放、平移处理;其中,
    Figure PCTCN2018109208-appb-100012
    为所述m个训练样本中的第i个训练样本x i归一化处理后的结果,y i为x i经过批归一化BN处理后得到的输出结果,α为缩放参数,β为平移参数。
  10. 根据权利要求6-9任意一项所述的方法,其特征在于,所述方法,还包括:
    获取特征图矩阵和权重矩阵,并根据所述特征图矩阵和所述权重矩阵计算得到所述n个输入矩阵中的任意一个输入矩阵。
  11. 一种运算加速器,其特征在于,所述运算加速器为如权利要求1-5任意一项所述的神经网络处理器中的计算核Core,所述运算加速器用于执行如权利要求1-5任意一项所述的n个计算核Core中任意一个Core所执行的功能。
  12. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述权利要求6-10任意一项所述的方法。
  13. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被计算机执行时,使得所述计算机执行如权利要求6-10中任意一项所述的方法。
PCT/CN2018/109208 2018-09-30 2018-09-30 一种神经网络处理器、数据处理方法及相关设备 WO2020062299A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880098253.3A CN112789627B (zh) 2018-09-30 2018-09-30 一种神经网络处理器、数据处理方法及相关设备
PCT/CN2018/109208 WO2020062299A1 (zh) 2018-09-30 2018-09-30 一种神经网络处理器、数据处理方法及相关设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/109208 WO2020062299A1 (zh) 2018-09-30 2018-09-30 一种神经网络处理器、数据处理方法及相关设备

Publications (1)

Publication Number Publication Date
WO2020062299A1 true WO2020062299A1 (zh) 2020-04-02

Family

ID=69949555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/109208 WO2020062299A1 (zh) 2018-09-30 2018-09-30 一种神经网络处理器、数据处理方法及相关设备

Country Status (2)

Country Link
CN (1) CN112789627B (zh)
WO (1) WO2020062299A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881880A (zh) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 一种基于新型网络的票据文本识别方法
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置
CN115278360A (zh) * 2022-07-18 2022-11-01 天翼云科技有限公司 一种视频数据处理方法及电子设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117852573B (zh) * 2024-03-07 2024-06-07 山东云海国创云计算装备产业创新中心有限公司 算力执行***、算子计算流管理方法、装置、设备和介质
CN118095351B (zh) * 2024-04-12 2024-07-02 清华大学 层归一化计算的协同处理装置及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022468A (zh) * 2016-05-17 2016-10-12 成都启英泰伦科技有限公司 人工神经网络处理器集成电路及该集成电路的设计方法
CN106056211A (zh) * 2016-05-25 2016-10-26 清华大学 神经元计算单元、神经元计算模块及人工神经网络计算核
CN106844294A (zh) * 2016-12-29 2017-06-13 华为机器有限公司 卷积运算芯片和通信设备
CN108256638A (zh) * 2018-01-05 2018-07-06 上海兆芯集成电路有限公司 微处理器电路以及执行神经网络运算的方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10083395B2 (en) * 2015-05-21 2018-09-25 Google Llc Batch processing in a neural network processor
CN116468070A (zh) * 2015-11-12 2023-07-21 渊慧科技有限公司 使用规范化的目标输出训练神经网络
CN105488565A (zh) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 加速深度神经网络算法的加速芯片的运算装置及方法
US10831444B2 (en) * 2016-04-04 2020-11-10 Technion Research & Development Foundation Limited Quantized neural network training and inference
WO2017185335A1 (zh) * 2016-04-29 2017-11-02 北京中科寒武纪科技有限公司 一种用于执行batch normalization运算的装置和方法
KR102592721B1 (ko) * 2017-01-11 2023-10-25 한국전자통신연구원 이진 파라미터를 갖는 컨볼루션 신경망 시스템 및 그것의 동작 방법
CN108090565A (zh) * 2018-01-16 2018-05-29 电子科技大学 一种卷积神经网络并行化训练加速方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022468A (zh) * 2016-05-17 2016-10-12 成都启英泰伦科技有限公司 人工神经网络处理器集成电路及该集成电路的设计方法
CN106056211A (zh) * 2016-05-25 2016-10-26 清华大学 神经元计算单元、神经元计算模块及人工神经网络计算核
CN106844294A (zh) * 2016-12-29 2017-06-13 华为机器有限公司 卷积运算芯片和通信设备
CN108256638A (zh) * 2018-01-05 2018-07-06 上海兆芯集成电路有限公司 微处理器电路以及执行神经网络运算的方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881880A (zh) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 一种基于新型网络的票据文本识别方法
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置
CN115278360A (zh) * 2022-07-18 2022-11-01 天翼云科技有限公司 一种视频数据处理方法及电子设备
CN115278360B (zh) * 2022-07-18 2023-11-07 天翼云科技有限公司 一种视频数据处理方法及电子设备

Also Published As

Publication number Publication date
CN112789627B (zh) 2023-08-22
CN112789627A (zh) 2021-05-11

Similar Documents

Publication Publication Date Title
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2020238293A1 (zh) 图像分类方法、神经网络的训练方法及装置
WO2020062299A1 (zh) 一种神经网络处理器、数据处理方法及相关设备
WO2021155792A1 (zh) 一种处理装置、方法及存储介质
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2021057056A1 (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
WO2021147325A1 (zh) 一种物体检测方法、装置以及存储介质
WO2021008206A1 (zh) 神经网络结构的搜索方法、图像处理方法和装置
WO2022179492A1 (zh) 一种卷积神经网络的剪枝处理方法、数据处理方法及设备
WO2021051987A1 (zh) 神经网络模型训练的方法和装置
US20220157041A1 (en) Image classification method and apparatus
CN111898703B (zh) 多标签视频分类方法、模型训练方法、装置及介质
WO2021129668A1 (zh) 训练神经网络的方法和装置
WO2021227787A1 (zh) 训练神经网络预测器的方法、图像处理方法及装置
WO2021136058A1 (zh) 一种处理视频的方法及装置
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2023020613A1 (zh) 一种模型蒸馏方法及相关设备
WO2022111387A1 (zh) 一种数据处理方法及相关装置
WO2022267036A1 (zh) 神经网络模型训练方法和装置、数据处理方法和装置
WO2022179606A1 (zh) 一种图像处理方法及相关装置
WO2023165361A1 (zh) 一种数据处理方法及相关设备
US20240135174A1 (en) Data processing method, and neural network model training method and apparatus
WO2020192523A1 (zh) 译文质量检测方法、装置、机器翻译***和存储介质
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置
Yuan et al. Low-res MobileNet: An efficient lightweight network for low-resolution image classification in resource-constrained scenarios

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18934963

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18934963

Country of ref document: EP

Kind code of ref document: A1