CN111814983B

CN111814983B - Data processing method, device, chip and computer readable storage medium

Info

Publication number: CN111814983B
Application number: CN202010142406.9A
Authority: CN
Inventors: 闯小明; 杨龚轶凡; 郑瀚寻; 高雷; 钟居哲
Original assignee: Zhonghao Xinying Hangzhou Technology Co ltd
Current assignee: Zhonghao Xinying Hangzhou Technology Co ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2023-05-30
Anticipated expiration: 2040-03-04
Also published as: CN111814983A

Abstract

The embodiment of the invention discloses a data processing method, a data processing device, a chip and a computer readable storage medium, which are used for accelerating the operation of a batch standardization layer in deep learning model training. The multi-dimensional tensor data is stored in the first memory according to a preset rule, then is taken out in a two-dimensional data form to be operated, a fourth matrix is constructed through the cooperation of the register group and the second memory, and the respective element sum and element product sum of each row in the first matrix can be obtained simultaneously through matrix multiplication of the first matrix and the fourth matrix, so that the parallel calculation of the element sum and the element product sum is realized, and the calculation in a batch standardization layer is accelerated, and the problem of long operation time consumption caused by overlarge data in the batch standardization layer operation process is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

Description

Data processing method, device, chip and computer readable storage medium

Technical Field

The present invention relates to the field of deep learning model training, and in particular, to a data processing method, device, chip and computer readable storage medium.

Background

Deep learning is a new field in machine learning research, aimed at creating or simulating neural networks for analysis learning of the human brain, which imitate the mechanisms of the human brain to interpret data such as image, sound, text, etc. The deep learning model can be practically used after being trained by a large amount of data, and common deep learning models include convolutional neural networks (Convolutional Neural Network, CNN) and the like.

In the training process of the deep learning model, the layers of the deep learning model are processed by a batch normalization (Batch Normalization, BN) method, so that the sample variability of the network in the process of transferring each layer is reduced, and the batch normalization is a technology proposed in 2015 for improving the speed, performance and stability of the deep neural network (Deep Neural Network). We call the distribution of the output of the deep neural network middle layer as it goes through the training process changes to an internal covariate offset (Internal Covariate Shift), eliminating this phenomenon will accelerate the training process. The batch normalization layer (Batch Normalization Layer) fixes the mean and variance of the inputs by normalizing the inputs at the layer to reduce internal covariate offset, so the batch normalization can enable the network training to employ a greater learning rate, ultimately speeding up the training of the network. Meanwhile, batch standardization can make the network not rely on weight initialization again.

The batch normalization layer includes forward propagation (Forward Propagation) and backward propagation (Backward Propagation) in the training process of the neural network.

It is noted that in the back propagation process, an element sum operation is performed on the input data and an element product sum operation is performed on the input data. In the training of the deep learning model, the data to be processed by the batch standardization layer are often multidimensional tensor data, and huge data volume causes that the batch standardization layer needs to consume a great deal of time when performing element sum operation and element product sum operation, thereby influencing the training speed of the deep learning model.

Disclosure of Invention

In view of the above, the present invention provides a data processing method, apparatus, chip and computer readable storage medium, so as to solve the problems of long time consumption for batch normalization layer calculation and slow training speed of deep learning model in deep learning model training.

In a first aspect, an embodiment of the present invention provides a data processing method. The method is used for accelerating the operation of a batch normalization layer in the deep learning model. The method provides multi-dimensional tensor data as input of a batch normalization layer, wherein the dimensions in the multi-dimensional tensor data comprise channels, and the multi-dimensional tensor data comprise first tensor data and second tensor data; a first register set, a second register set, a first memory, and a second memory are also provided. The first register set and the second register set each comprise M rows and N columns of registers, the second memory can store at least N rows and K columns of data, wherein K is not less than 2M, and the method comprises the following steps:

Storing the multidimensional tensor data into a first memory according to a preset rule;

acquiring a first matrix, taking out at least part of first tensor data in a first memory according to channels, and putting the first tensor data into a first register group, wherein Q data in the same channel in the first tensor data can only be put into one row in the first register group, Q is not more than N, and when Q is less than N, N-Q0 are fed into the row in which the Q data in the first register group are positioned; when the data in the first register set includes at least part of the first tensor data, all the data in the first register set constitute a first matrix;

acquiring a second matrix, putting initial data into any row in a second register set, wherein the size of the initial data is 1 row and Q columns, the content is 1, and when the data in the second register set comprises the initial data, all the data in the second register set form the second matrix, and the data except the initial data in the second matrix are all 0; after the second matrix is obtained, the second matrix is transposed and placed into a second memory according to columns;

acquiring a third matrix comprising at least part of the second tensor data; the elements in the third matrix and the elements in the first matrix have a one-to-one correspondence in position, and Q data in the same channel in at least part of the second tensor data only exist in one row in the third matrix; after the third matrix is obtained, the third matrix is transposed and placed into a second memory according to columns;

All data in the second memory form a fourth matrix;

and carrying out matrix multiplication on the first matrix and the fourth matrix to obtain a multiplication result, wherein the multiplication result comprises respective element sums and element product sums of each row in the first matrix.

Further, the method further provides a third register set, wherein the third register set comprises M rows and N columns of registers; the acquiring the third matrix includes: and taking out at least part of the second tensor data in the first memory according to the channel, and putting the second tensor data into a third register group, wherein all data in the third register group form a third matrix. The third matrix is received and stored by setting the third register group, and parallel processing of the first matrix, the second matrix and the third matrix can be realized in the operation process, so that the operation speed of batch standardization operation is accelerated, and the time required by training the whole deep learning model is further improved.

Further, a third matrix is obtained by the second register set. The step of obtaining the third matrix is either before or after the step of obtaining the second matrix. The step of obtaining a third matrix comprises: fetching and placing at least part of the second tensor data in the first memory into a second register set according to the channel; when the data in the second register set includes part of the second tensor data, all the data in the second register set constitute a third matrix.

Further, a third matrix is obtained through the first register set. The step of obtaining the third matrix is preceded by the step of obtaining the first matrix. The step of obtaining the third matrix includes: fetching and placing at least part of the second tensor data in the first memory into the first register set according to the channel; when part of the second tensor data is included in the first register, all data in the first register group constitute a third matrix.

Further, the aforementioned multi-dimensional tensor data includes four-dimensional tensor data including a batch size B, a height H, a width W, and a channel C.

Still further, the preset rules include depositing in a cross-channel element order.

Still further, the preset rules include storing in sequential order the channel elements.

According to the method, multidimensional tensor data are stored in the first memory according to the preset rule, then are taken out in the form of two-dimensional data and are operated, the fourth matrix is constructed through the cooperation of the register groups and the second memory, and then the respective element sum and element product sum of each row in the first matrix can be obtained through matrix multiplication of the first matrix and the fourth matrix, so that parallel calculation of element sum and element product sum is achieved, and the problem of long operation time consumption caused by overlarge data in the operation process of the batch standardization layer is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

In a second aspect, an embodiment of the present invention provides a data processing apparatus for accelerating operations of a batch normalization layer in training of a deep learning model. The device comprises a first memory, a first register group, a second memory, a second register group and an arithmetic unit; the first register set is connected with the first memory and the arithmetic unit, and the second memory is connected with the second register set and the arithmetic unit; the first register set and the second register set comprise M rows and N columns of registers; the multidimensional tensor data is provided as input to a batch normalization layer, wherein,

the first memory is used for storing multi-dimensional tensor data, wherein the dimension in the multi-dimensional tensor data comprises a channel, and the multi-dimensional tensor data comprises first tensor data and second tensor data;

the first register set is used for storing a first matrix, and the first matrix comprises at least part of first tensor data;

the second register group is used for storing a second matrix, any row in the second matrix comprises initial data, the size of the initial data is 1 row and Q columns, the content is 1, and Q is not more than N;

the third memory is used for storing a fourth matrix, and the fourth matrix comprises a transposed matrix of the second matrix and a transposed matrix of the third matrix; the third matrix comprises at least part of second tensor data, and elements in the third matrix correspond to elements in the first matrix in a one-to-one mode in position;

The arithmetic unit is used for multiplying the first matrix and the fourth matrix, and the obtained multiplication result comprises respective element sums and element product sums of each row in the first matrix.

According to the device, the multidimensional tensor data are stored in the first memory according to the preset rule, then are taken out in the form of two-dimensional data and are operated, the fourth matrix is constructed through the cooperation of the register groups and the second memory, the respective element sum and element product sum of each row in the first matrix can be obtained simultaneously through matrix multiplication of the first matrix and the fourth matrix, and the parallel calculation of the element sum and the element product sum is realized, so that the calculation in a batch standardization layer is accelerated, and the problem of long operation time consumption caused by overlarge data in the batch standardization layer operation process is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

Further, the data processing apparatus further includes a third register set, where the third register set includes M rows and N columns of registers, the third register set is connected to the first memory and the second memory, and the third register set is used to store a third matrix. The third matrix is received and stored by setting the third register group, and parallel processing of the first matrix, the second matrix and the third matrix can be realized in the operation process, so that the operation speed of batch standardization operation is accelerated, and the time required by training the whole deep learning model is further improved.

Further, the second register set is further configured to store a third matrix. The second matrix and the third matrix are sequentially received and stored through the second register set, parallel processing of the first matrix and the second matrix or the third matrix can be achieved in the operation process, and the data processing device 700 can be used for executing operations of deriving gamma (gamma) and beta (beta) in parallel, so that the operation speed of batch standardization operation can be accelerated, the time required by training the deep learning model is prolonged, hardware can be saved, and the cost of the data processing device is reduced.

Further, the first register set is further used for storing a third matrix, and the first register set is further connected with the second memory. The first matrix and the third matrix are sequentially received and stored through the first register group, parallel processing of the first matrix and the second matrix or parallel processing of the third matrix and the second matrix can be realized in the operation process, the operation speed of batch standardization operation can be accelerated, the time required by training the whole deep learning model is improved, hardware can be saved, and the cost of the data processing device is reduced.

In a third aspect, an embodiment of the present invention provides a chip. The chip comprises at least the aforementioned data processing means.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having a computer program stored thereon, which when executed by an execution processor, implements the steps of the data processing method described above.

Further combinations of the present invention may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of four-dimensional tensor data according to an embodiment of the present invention;

FIG. 3A is a schematic diagram of first tensor data according to an embodiment of the present invention;

FIG. 3B is a diagram of second tensor data according to an embodiment of the present invention;

FIG. 4A is a schematic diagram of four-dimensional data stored in an SRAM memory according to a cross channel element order according to an embodiment of the present invention;

FIG. 4B is a schematic diagram of four-dimensional data stored in an SRAM according to a continuous sequence of channel elements according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data processing apparatus 500 according to an embodiment of the present invention;

FIG. 6A is a schematic diagram of another data processing apparatus 600 according to an embodiment of the present invention;

FIG. 6B is a schematic diagram of a register set according to an embodiment of the present invention;

FIG. 6C is a schematic diagram of a first matrix placement into a first register set 602 according to an embodiment of the invention;

FIG. 6D is a schematic diagram of a second matrix placement second register set 603 according to an embodiment of the present invention;

FIG. 6E is a schematic diagram of a third matrix placement into a third register set 606 according to an embodiment of the present invention;

FIG. 6F is a schematic diagram of a fourth matrix placed in the second memory 604 according to an embodiment of the present invention;

FIG. 6G is a schematic diagram of a multiplication result of a first matrix and a fourth matrix according to an embodiment of the present invention;

FIG. 7A is a schematic diagram of another data processing apparatus 700 according to an embodiment of the present invention;

FIG. 7B is a schematic diagram of a first matrix placement of a first register set 702 according to an embodiment of the present invention;

FIG. 7C is a schematic diagram of a second matrix placement into a second register set 703 according to an embodiment of the present invention;

FIG. 7D is a schematic diagram of a third matrix placement second register set 703 according to an embodiment of the present invention;

FIG. 7E is a schematic diagram of a fourth matrix placed in the second memory 704 according to an embodiment of the present invention;

FIG. 7F is a schematic diagram of a multiplication result of another first matrix and a fourth matrix according to an embodiment of the present invention;

FIG. 8A is a schematic diagram of another data processing apparatus 800 according to an embodiment of the present invention;

FIG. 8B is a schematic diagram of a third matrix placement of a first register set 802 according to an embodiment of the present invention;

FIG. 8C is a schematic diagram of a second matrix placement second register 803 according to an embodiment of the present invention;

FIG. 8D is a schematic diagram of a first matrix placement into a first register set 802 according to an embodiment of the present invention;

FIG. 8E is a schematic diagram of a fourth matrix placed in the second memory 804 according to an embodiment of the invention;

FIG. 8F is a schematic diagram of a multiplication result of another first matrix and a fourth matrix according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a chip according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It will be understood that when an element is referred to as being "connected" to "another element, it can be directly connected to the other element or be indirectly connected to the other element.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following describes embodiments of the present invention in detail.

The embodiment of the invention provides a data processing method which can be used for accelerating the operation of a batch standardization layer in the training of a deep learning model. The data processed by the method comprises multi-dimensional tensor data, wherein the dimensions in the multi-dimensional tensor data at least comprise channels, and the multi-dimensional tensor data comprises first tensor data and second tensor data. The method provides a first register set, a second register set, a first memory and a second memory, wherein the first register set and the second register set comprise M rows and N columns of registers, the second memory can store at least N rows and K columns of data, and K is not less than 2M. Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the invention. As shown in fig. 1, the method comprises the steps of:

storing the multidimensional data into a first memory according to a preset rule;

acquiring a first matrix, taking out and putting at least part of first tensor data in a first memory into a first register group according to a channel, wherein Q data in the same channel only exist in one row of the first register group, and Q is not more than N; when the data in the first register group comprises part of the first tensor data, all the data in the first register group form a first matrix;

Acquiring a second matrix, putting initial data of 1 row and Q columns into any row in a second register set, wherein the content of the initial data is 1, and when the data in the second register set comprises the initial data, forming the second matrix by all the data in the second register set; transpose the second matrix and then put it into the second memory;

acquiring a third matrix, wherein the third matrix comprises at least part of second tensor data, the elements in the third matrix and the elements in the first matrix have a one-to-one correspondence in position, and Q data in the same channel in the part of second tensor data only exist in one row in the third matrix; transpose the third matrix and then put it into the second memory; in some other preferred embodiments, the third matrix is a weight matrix of the first matrix;

all data in the second memory form a fourth matrix; and carrying out matrix multiplication on the first matrix and the fourth matrix to obtain a multiplication result, wherein the multiplication result comprises respective element sums and element product sums of each row in the first matrix.

In the specific implementation process, the execution sequence of the step of acquiring the first matrix and the step of acquiring the third matrix can be selected to be executed sequentially or in parallel according to specific conditions; the execution sequence of the step of acquiring the second matrix and the step of acquiring the third matrix can be selected to be executed sequentially or in parallel according to specific conditions; the step of obtaining the second matrix and the step of obtaining the first matrix may be performed in parallel.

In some preferred embodiments of the invention, the multi-dimensional tensor data is stored typically using SRAM memory, i.e. the first memory typically uses SRAM memory, in which each piece of data has an address corresponding to it, through which the data can be accessed. In a preferred embodiment, the multi-dimensional tensor data is four-dimensional tensor (4-dimensional tensor) data, composed of a batch of two-dimensional data. Fig. 2 is a schematic diagram of four-dimensional tensor data according to an embodiment of the present invention. As shown in fig. 2, the first dimension of the four-dimensional tensor data is a batch size B (batch size), the second dimension is a height H (height), the third dimension is a width W (width), and the fourth dimension is a channel number C (channel). Fig. 2 further includes first tensor data and second tensor data, and please refer to fig. 3A and fig. 3B, which are schematic diagrams of the first tensor data and the second tensor data according to an embodiment of the present invention. As shown in fig. 3A and 3B, the four dimensions of the first tensor data and the second tensor data are equal, b=1, h= 5,W =4, and c=64. The difference is that the specific values of the data elements of the two are different.

The first tensor data may be stored in the SRAM memory according to a channel element crossing order, where the channel element crossing order is to store the element at the same position in each channel in the SRAM memory before storing the element at another position in the SRAM memory. Fig. 4A is a schematic diagram of four-dimensional data stored in an SRAM memory according to a cross channel element order according to an embodiment of the present invention. As shown in fig. 4A, the first tensor data shown in fig. 3A is stored in the SRAM memory in the order of intersecting channel elements, and first, the first element (0, 20, …, 1260) in the channels C0 (i.e., c=0) to C63 (i.e., c=63) in fig. 3A is sequentially stored in the SRAM memory, that is, in the order of (0, 0), (0, 1), …, (0,0,0,63), (the positions of the elements in the multidimensional tensor data are indicated by "(B, H, W, C)"; the second element (1, 21, …, 1261) in channels C0-C63 of fig. 3A is then stored in SRAM memory, i.e., in the order of (0, 1, 0), (0, 1), …, (0,0,1,63); similarly, the last element (19, 39, …, 1279) in channels C0-C63 in FIG. 3A is stored in the SRAM memory, i.e., in the order of (0,3,4,0), (0,3,4,1), …, (0,3,4,63).

The first tensor data may also be stored in the SRAM memory according to a channel element continuous sequence, where the channel element continuous sequence is to store all elements in each channel first, then store all elements in the next channel, and so on. Fig. 4B is a schematic diagram of four-dimensional data stored in the SRAM memory according to a continuous sequence of channel elements according to an embodiment of the present invention. As shown in fig. 4B, the first tensor data shown in fig. 3A is stored in the SRAM memory in the sequential order of channel elements, and all the data (0, 1, …, 19) in the channel 0 in fig. 3A are first placed in the SRAM memory, that is, in the order of (0, 0), (0, 1, 0), …, (0,3,4,0); then, all data (20, 21, …, 39) in channel 1 in fig. 3A are put into the SRAM memory, i.e., in the order of (0, 1), (0, 1), …, (0,3,4,1); and so on, until all the data (1260,1261, …, 1279) in the last lane in fig. 3A is placed in the SRAM memory, i.e., in the order of (0,0,0,63), (0,0,1,63), …, (0,3,4,63). The second tensor data shown in fig. 3B may also be stored by using the two preset rules, and the specific process is the same as that described above, and will not be repeated here.

According to the method, multidimensional tensor data are stored in the first memory according to the preset rule, then are taken out in the form of two-dimensional data and are operated, a fourth matrix is constructed through the cooperation of the plurality of register groups and the second memory, and the respective element sum and element product sum of each row in the first matrix can be obtained simultaneously through matrix multiplication of the first matrix and the fourth matrix, so that the parallel calculation of the element sum and the element product sum is realized, and the calculation in a batch standardization layer is accelerated, and the problem of long operation time consumption caused by overlarge data in the batch standardization layer operation process is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

An embodiment of the present invention provides a data processing apparatus for accelerating operations of batch normalization layers in a deep learning model, and please refer to fig. 5, which is a schematic structural diagram of a data processing apparatus 500 according to an embodiment of the present invention. As shown in fig. 5, the data processing apparatus 500 includes a first memory 501, a first register group 502, a second register group 503, a second memory 504, and an operator 505. The first register group 502 is connected to the first memory 501 and the operator 505, and the second memory 504 is connected to the second register group 503 and the operator 505. The first register set 502 and the second register set 503 have the same structure, and each includes M rows and N columns of registers. The second memory 504 can store at least N rows of K columns of data, and K is not less than 2M. The multidimensional tensor data is provided as input to a batch normalization layer, wherein,

The first memory 501 is configured to store multi-dimensional tensor data, where one dimension of the multi-dimensional tensor data is a channel, and the multi-dimensional tensor data includes first tensor data and second tensor data;

the first register set 502 is used for storing a first matrix, and the first matrix includes data fetched from the first memory 501 according to channels;

the second register set 503 is used for storing a second matrix; any row in the second matrix comprises 1 row and 1 column of initial data, the content of the initial data is all 1, and Q is not more than N;

the second memory 504 is configured to store a fourth matrix, where the second memory 504 may selectively use common storage hardware for implementing matrix operation, where the fourth matrix includes a transposed matrix of the second matrix and a transposed matrix of the third matrix, and elements in the third matrix have a one-to-one correspondence with elements in the first matrix in positions;

the operator 505 is configured to multiply the first matrix and the fourth matrix, and the obtained result includes a sum of elements and a sum of products of the elements of each row in the first matrix.

For a better understanding of the present invention, the solution disclosed in the present invention will be described by way of example with reference to specific application scenarios.

In the calculation process of a batch normalization layer trained by a deep learning model, back propagation training is needed, and in back propagation, gamma (gamma) and beta (beta) are needed to be derived, wherein the formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is a known quantity representing the error back-propagated to the batch of normalized layers; />

Is normalized data and is also a known quantity. In the present invention, m=b×h×w (i.e. the product of batch size and height and width, i.e. the sum of the number of elements in each channel). The data processing device and method pair provided by the invention are usedGamma and beta are derived. Deriving beta, i.e. calculating +.>

Derivation of gamma, i.e. calculation +.>

And->

Element product sum. />

And

are four-dimensional tensor data, the present exemplary assumption is +.>

Specific data of (a) is the first tensor data shown in FIG. 3A, < >>

The specific data of (a) is the second tensor data shown in fig. 3B.

In a preferred embodiment of the invention the third matrix is stored by using a third register set. Fig. 6A is a schematic diagram of another data processing apparatus 600 according to an embodiment of the invention. As shown in fig. 6A, the data processing apparatus 600 includes a first memory 601, a first register group 602, a second register group 603, a second memory 604, an operator 605, and a third register group 606. The first register group 602 is connected to the first memory 601 and the operator 605, the second memory 604 is connected to the second register group 603 and the operator 605, and the third register group 606 is connected to the first memory 601 and the second memory 604. The first register set 602, the second register set 603, and the third register set 606 have the same structure, and each includes 4 rows and 32 columns of registers, as shown in fig. 6B. The second memory 604 can store 32 rows and 8 columns of data. The first memory 601 is an SRAM memory, and the present embodiment prefers an SRAM memory as the first memory 601, and it is understood that in some other embodiments of the present invention, the first memory 601 may be other types of memories. The specific process of deriving γ and β using the data processing apparatus 600 is as follows:

First, will

And->

Is placed in the first memory 601 in the cross order of the channel elements as described above;

then, the first tensor data in fig. 3A is fetched from the first memory 601 by the channel and put into the first register group 602 according to the memory address, that is, the data (0-19) of the channel C0 is put into the first row of the first register group 602, the data (20-39) of the channel C1 is put into the second row of the first register group 602, and so on, the data of the channel C2 is put into the third row, and the data of the channel C3 is put into the fourth row. A row of the first register file 602 may hold up to 32 data, and only 20 data per lane, i.e., 12 registers of each row of the first register file 602 may not hold data, and the 12 registers may be complemented by 0. At this time, all data in the first register group 602 constitutes a first matrix, and as shown in fig. 6C, the size of the first matrix is 4X32.

Initial data of 1 row and 20 column, the contents of which are all 1, is put into the first row in the second register set 603, and the remaining 12 registers of the first row and the remaining three rows of registers are all complemented with 0. At this time, all data in the second register group 603 constitute a second matrix, and the size of the second matrix is 4X32 as shown in fig. 6D. After the second matrix is obtained, the second matrix is transposed and then placed into the second memory 604.

The second tensor data in FIG. 3B is fetched and placed into the third register set 606 per lane, i.e., data (0-19) for lane C0 is placed into the first row of the third register set 606, data (0-19) for lane C1 is placed into the second row of the third register set 606, and so on, data for lane C2 is placed into the third row, and data for lane C4 is placed into the fourth row. A row of the third register set 606 can hold up to 32 data, and only 20 data per channel, that is, 12 registers of each row of the third register set 606 are not held, and the 12 registers are then complemented with 0. At this time, all the data in the third register group 606 constitute a third matrix, and as shown in fig. 6E, the size of the third matrix is 4X32. The elements in the third matrix and the elements in the first matrix have a one-to-one correspondence in position, and the correspondence in position refers to the one-to-one correspondence in position in the four-dimensional tensor data; the position of the four-dimensional tensor data is represented by (B, H, W, C), the position of the first element 0 of the channel C0 in fig. 3A is (0, 0), and the position of the last element 19 is (0,4,3,0); the second element 21 in the channel C1 is located at (0, 1), and the eleventh element 30 is located at (0, 1,2, 1); the sixth element 1265 in channel C63 is located (0,1,1,63) and the nineteenth element 1278 is located (0,4,2,63). For example, the number of the cells to be processed, the first element 0 (0, 0) of the first row and the second element 1 (0, 1, 0) of the first row in the third matrix and the third element 2 (0,0,2,0) of the first row correspond to the first element 0 (0, 0) of the first row in the first matrix, respectively a first row of second elements 1 (0, 1, 0), a first row of third elements 2 (0,0,2,0); the first element 0 (0, 1) of the second row and the second element 1 (0, 1) of the first row in the third matrix and the first the third element 2 (0,0,2,1) of a row corresponds to the first element 20 (0, 1) of the first row in the first matrix, respectively a first row of second elements 21 (0, 1), a first row of third elements 22 (0,0,2,1); and so on, finally realizing the one-to-one correspondence of the elements in the third matrix and the elements in the first matrix in position. After the third matrix is obtained, the third matrix is transposed and then placed into the second memory 604.

The above process of constructing the first matrix and the second matrix and the third matrix may be performed simultaneously. Both the second matrix and the third matrix are transposed and placed in the second memory 604 in columns, where the transposed matrix of the third matrix is placed after the transposed matrix of the second matrix, as shown in fig. 6F. All data in the second memory constitute a fourth matrix, the size of which is 32X8.

Finally, the first matrix in the first register set 602 and the fourth matrix in the second memory 605 are sent to the arithmetic unit 605, and the arithmetic unit 605 performs matrix multiplication on the first matrix and the fourth matrix to obtain a multiplication result as shown in fig. 6G. As shown in fig. 6G, the multiplication result is a matrix of 4X8, wherein four data elements from top to bottom in the first column are the sums of four elements from top to bottom in the first matrix, that is, the sums of all elements in the channels C0, C1, C2, and C3, respectively; the first element of the fifth column from top to bottom is the sum of the product of the elements of the first row in the first matrix and the elements of the first row in the third matrix, that is, the sum of the multiplication results of all the elements of the first row in the first matrix and all the elements of the first row in the third matrix multiplied in a one-to-one correspondence manner, specifically, the result of 0×0+1×1+ … +19×19+0×0+ … 0*0; the second element from top to bottom in the sixth column is the sum of the product of the elements in the second row in the first matrix and the elements in the second row in the third matrix, that is, the sum of the multiplication results of all the elements in the second row in the first matrix and all the elements in the second row in the third matrix multiplied in a one-to-one correspondence manner, specifically, the result of 20×0+21×1+ … +39×19+0×0+ … 0*0; the third element from top to bottom in the seventh column is the sum of the product of the elements in the third row in the first matrix and the elements in the third row in the third matrix, that is, the sum of the multiplication results of all the elements in the third row in the first matrix and all the elements in the third row in the third matrix multiplied in a one-to-one correspondence manner, specifically, the result of 40×0+41×1+ … +59×19+0×0+ … 0*0; the third element from top to bottom in the seventh column is the sum of the product of the elements in the fourth row in the first matrix and the elements in the fourth row in the third matrix, that is, the sum of the multiplication results of all the elements in the fourth row in the first matrix and all the elements in the fourth row in the third matrix multiplied in a one-to-one correspondence, specifically, the result of 60×0+61×1+ … +79×19+0×0+ … 0*0. The sum of the four elements is the derivative of beta (beta) of the channels C0 to C3; the product of the four elements is derived from gamma (gamma) for channels C0 through C3, respectively. The subsequent calculation method for gamma (gamma) and beta (beta) derivation of 60 channels is similar to the foregoing process, and will not be repeated here.

The embodiment of the invention receives and stores the third matrix by setting the third register group, and can realize the parallel processing of the first matrix, the second matrix and the third matrix in the operation process, because in the back propagation processing process of the batch standardization operation, the gamma (gamma) and beta (beta) of each channel in the four-dimensional tensor data are required to be derived, and at the moment, the gamma (beta) and beta (beta) of each channel in the four-dimensional tensor data are required to be derived

Summing, and requiring +_for each channel in the four-dimensional tensor data>

And->

The product is summed (i.e., the elements are multiplied by each other and then added). The data processing apparatus 600 in the present application may perform operations for deriving γ (gamma) and β (beta) in parallel, so that the calculation speed of the back propagation process in the calculation batch normalization operation is improved, thereby accelerating the calculation speed of the batch normalization operation, and further improving the time required for training the deep learning model as a whole.

In another preferred embodiment of the invention the third matrix is stored by using a second set of registers. Fig. 7A is a schematic diagram of another data processing apparatus 700 according to an embodiment of the invention. As shown in fig. 7A, the data processing apparatus 700 includes a first memory 701, a first register group 702, a second register group 703, a second memory 704, and an operator 705. The first register set 702 is connected to the first memory 701 and the operator 705, and the second memory 704 is connected to the second register set 703 and the operator 705. The first register set 702 and the second register set 703 have the same structure, and each includes 4 rows and 8 columns of registers. The second memory 704 can store 32 rows and 16 columns of data. Since the first register set 702 is composed of 4X8 registers, at most 8 data can be placed in one row of the first register set 702, the data processing device 700 is used to derive γ and β of each channel, four rounds are needed to be performed, the first eight data in the first round of processing the γ and β of each channel are obtained, the sum of elements and the product of elements of the first 8 elements in the channel are obtained, the second eight data in the second round of processing the channel are obtained, the sum of elements and the product of elements of the second 8 elements in the channel are obtained, the last 4 data are processed in the third round of processing the last 4 data are obtained, the sum of elements and the product of elements of the last 4 elements in the channel are obtained, and the fourth round of summing the sum of three element multipliers and the product of elements of the first three rounds is performed respectively, so that the derivation result of γ and β of each channel can be obtained.

The specific process of deriving γ and β using the data processing apparatus 700 is as follows:

first, will

And->

Is placed in the first memory 701 in the sequential order of the channel elements described above;

then, a first round of computation is performed, and the first tensor data in fig. 3A is fetched from the first memory 701 according to the memory address and put into the first register group 702, that is, the first eight data of the channels C0 to C3 are respectively put into the first register 702, so as to obtain a first matrix as shown in fig. 7B.

One initial data of 1 row and 8 columns, and the content of which is 1, is placed in any one row of the second register group 703, and the third row is selected in this embodiment, and the other registers in the second register group 703 are complemented with 0. At this time, all the data in the second register group 703 constitute a second matrix as shown in fig. 7C, and the size of the second matrix is 4X8. After the second matrix is obtained, the second matrix is transposed and then placed into the second memory 704.

After the second matrix is processed, the second tensor data in fig. 3B is fetched and put into the second register group 703 according to channels, that is, the first eight data of channels C0 to C3 are respectively put into the second registers 703, so as to obtain a third matrix as shown in fig. 7D. The size of the third matrix is 4X8. After the third matrix is obtained, the third matrix is transposed and then placed into the second memory 704.

Since the second matrix and the third matrix are stored through the second register set 703, the second register set 703 may be used to store the third matrix, and the second register set 703 may be used to store the second matrix after the third matrix is transposed into the second memory 704. Both the second matrix and the third matrix are transposed and placed in the second memory 604 in columns, where the transposed matrix of the second matrix is placed after the transposed matrix of the third matrix, as shown in fig. 7E. All data in the second memory constitute a fourth matrix, the size of which is 32X16. In the fourth matrix, other data than the transposed matrix of the second matrix and the transposed matrix of the third matrix may be 0 or dirty data (i.e., unnecessary unknown data).

Finally, the first matrix in the first register set 702 and the fourth matrix in the second memory 704 are sent to the operator 705, and the operator 705 performs matrix multiplication on the first matrix and the fourth matrix to obtain a multiplication result as shown in fig. 7F. As shown in fig. 7F, the multiplication result is a matrix of 4X16, wherein four data elements from top to bottom in column 7 are the sums of four elements from top to bottom in the first matrix, that is, the sums of the first eight elements in channels C0, C1, C2, and C3, respectively; the first element of the first column from top to bottom is the sum of the product of the first row element in the first matrix and the first row element in the third matrix; the second element of the second column from top to bottom is the sum of the product of the second row element in the first matrix and the second row element in the third matrix; the third element of the third column from top to bottom is the sum of the product of the third row element in the first matrix and the third row element in the third matrix; the fourth element of the fourth column from top to bottom is the sum of the product of the elements of the fourth row in the first matrix and the elements of the fourth row in the third matrix. After the process is finished, the second round of calculation is carried out.

In the second round of computation, the data in the first register set 702 needs to be updated to the next eight data of the channels C0 to C3 in fig. 3A; updating the data in the second register group 703 to the next eight data of the channels C0 to C3 in fig. 3B; and updating the fourth matrix, and finally multiplying the updated first matrix with the updated fourth matrix. And obtaining a multiplication result and carrying out third round of calculation. In the third round of computation, the last four data of lanes C0 through C3 in fig. 3A need to be put into the first register set 702, and 4 0's are complemented in each row; similarly, the last four data of the channels C0 to C3 in fig. 3B are put into the second register group 703, and are complemented with 4 0 s in each row, then the fourth matrix is updated, and finally the updated first matrix and the updated fourth matrix are multiplied. And obtaining a multiplication result to perform fourth-round calculation. In the fourth calculation, the sum of the product of the element sum and the element sum obtained in the previous three rounds is added up respectively to obtain the derivative results of gamma and beta (beta) of the channels C0 to C3. The subsequent calculation method for gamma (gamma) and beta (beta) derivation of 60 channels is similar to the foregoing process, and will not be repeated here.

According to the embodiment of the invention, the second matrix and the third matrix are sequentially received and stored through the second register group 703, parallel processing of the first matrix and the second matrix or parallel processing of the second matrix and the third matrix can be realized in the operation process, and the data processing device 700 can be used for executing operations for deriving gamma (gamma) and beta (beta) in parallel, so that the calculation speed of a back propagation process in calculation batch standardization operation can be improved, the calculation speed of the batch standardization operation can be accelerated, the time required by training the whole deep learning model can be improved, hardware can be saved, and the cost of the data processing device can be reduced.

In another preferred implementation of the invention, the third matrix is stored by using a first register set. Fig. 8A is a schematic diagram of another data processing apparatus 800 according to an embodiment of the invention. As shown in fig. 8A, the data processing apparatus 800 includes a first memory 801, a first register group 802, a second register group 803, a second memory 804, and an operator 805. The first register group 802 is connected to the first memory 801, the second memory 804, and the arithmetic unit 805, and the second memory 804 is connected to the second register group 803 and the arithmetic unit 805. The first register set 802 and the second register set 803 are identical in structure, and each includes 4 rows and 32 columns of registers. The second memory 804 can store at least 32 rows and 16 columns of data. The specific process of deriving γ and β using the data processing apparatus 800 is as follows:

First, four-dimensional tensor data is obtained

And->

Placing the channel elements in the first memory 801 in a sequential order;

then, the second tensor data in fig. 3B is fetched and put into the first register group 802 by lanes, i.e., the data of lanes C0 to C3 are respectively put into one to four rows of the first register group 802, and the last 12 0's of each row are complemented. The first register set 802 includes a portion of the second tensor data, and at this time, all the data in the first register set 802 form a third matrix, and as shown in fig. 8B, the size of the third matrix is 4X32. After the third matrix is obtained, the third matrix is transposed and then placed into the second memory 804.

One 1 row, 20 columns, and all 1's of initial data is put into the 2 nd row in the second register group 803, and the remaining 12 registers of the 2 nd row and the remaining three rows of registers are all complemented with 0. The initial data is included in the second register 803, and at this time, all the data in the second register 803 constitute a second matrix, and the size of the second matrix is 4X32 as shown in fig. 8C. After the second matrix is obtained, the second matrix is transposed and then placed into a second memory 804.

The first tensor data in fig. 3A is fetched and placed into the first register set 802 by lanes, i.e., the data of lanes C0 through C3 are placed into one to four rows of the first register set 802, respectively, and the last 12 0's of each row are complemented. The first register set 802 includes a portion of the first tensor data, and at this time, all the data in the first register set 802 form a first matrix, and as shown in fig. 8D, the size of the first matrix is 4X32.

In the above process, since the first matrix and the third matrix are stored through the first register set 802, the first register set 802 is generally used to receive and store the third matrix, and after the third matrix is transposed into the second memory 804, the first register set 802 is used to receive and store the first matrix. Wherein, the elements in the third matrix and the elements in the first matrix have a one-to-one correspondence in position. Both the second matrix and the third matrix are transposed and placed in the second memory 804 in columns, where the transposed matrix of the third matrix is placed after the transposed matrix of the second matrix, as shown in fig. 8E. All data in the second memory constitute a fourth matrix, the size of which is 32X8.

Finally, the first matrix in the first register set 802 and the fourth matrix in the second memory 804 are sent to the arithmetic unit 805, and the arithmetic unit 805 multiplies the first matrix and the fourth matrix by the fourth matrix, so as to obtain a multiplication result as shown in fig. 8F. As shown in fig. 8F, the multiplication result is a matrix of 4X16, wherein four data elements from top to bottom in column 2 are respectively the sums of four elements from top to bottom in the first matrix, that is, the sums of all elements in channels C0, C1, C2, and C3; the first element of the fifth column from top to bottom is the sum of the product of the first row element in the first matrix and the first row element in the third matrix; the second element from top to bottom of the sixth column is the sum of the product of the second row element in the first matrix and the second row element in the third matrix; the third element from top to bottom of the seventh column is the sum of the product of the third row element in the first matrix and the third row element in the third matrix; the fourth element of the eighth column from top to bottom is the sum of the product of the elements of the fourth row in the first matrix and the elements of the fourth row in the third matrix. The sum of the four elements is the derivative of beta (beta) of the channels C0 to C3; the product of the four elements is derived from gamma (gamma) for channels C0 through C3, respectively. The subsequent calculation method for gamma (gamma) and beta (beta) derivation of 60 channels is similar to the foregoing process, and will not be repeated here.

According to the embodiment of the invention, the third matrix is received and stored firstly through the first register group 802, and then the first matrix is received and stored, so that parallel processing of the second matrix and the first matrix can be realized in the operation process, parallel processing of the second matrix and the third matrix can be realized, and the data processing device 800 in the application can execute operations for deriving gamma (gamma) and beta (beta) in parallel, so that the calculation speed of a forward propagation process and a backward propagation process in calculation batch standardization operation can be improved, the operation speed of the batch standardization operation can be accelerated, the time required by training the deep learning model is prolonged, hardware can be saved, and the cost of the data processing device is reduced. It should be noted that, the specific process of the data processing method and the data processing apparatus provided in the present application when used for forward propagation in batch normalization operations is similar to the foregoing embodiment, and the difference is that, during forward propagation, the elements in the first matrix and the elements in the third matrix are from different four-dimensional tensor data, and during forward propagation, the elements in the third matrix and the elements in the first matrix are from the same four-dimensional tensor data.

Fig. 9 is a schematic structural diagram of a chip according to an embodiment of the present invention. As shown in fig. 9, the chip 900 includes one or more processors 901, a communication interface 902, and a computer-readable storage medium 903, where the processors 901, the communication interface 902, and the computer-readable storage medium 903 may be connected by a bus, or may communicate by other means such as wireless transmission. The embodiments of the present invention are illustrated as being connected by a bus 904. Wherein the computer readable storage medium 903 is configured to store instructions, the processor 901 includes the data processing apparatus disclosed in the above embodiment, and is configured to execute the instructions stored in the computer readable storage medium 903. The computer readable storage medium 903 stores a program code, and the processor 901 may invoke the program code stored in the computer readable storage medium 903 to implement the relevant functions of the foregoing data processing device, which is specifically referred to as relevant description in the foregoing embodiments, and will not be repeated herein.

It should be appreciated that in embodiments of the present invention, the processor 901 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The communication interface 902 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other modules or apparatus devices. For example, the communication interface 902 in the embodiment of the present application may be specifically configured to receive input data input by a user; or to receive data from an external device, etc.

The computer-readable storage medium 903 may include Volatile Memory (RAM), such as random access Memory (Random Access Memory); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the computer readable storage medium may also include a combination of the above types of memory. The computer readable storage medium may be used to store a set of program code such that the processor invokes the program code stored in the computer readable storage medium to perform the functions associated with the data processing apparatus as described above.

It should be noted that fig. 9 is only one possible implementation of the embodiment of the present invention, and the chip may further include more or fewer components in practical applications, which is not limited herein. For details not shown or described in the embodiments of the present invention, reference may be made to the related descriptions in the foregoing method embodiments, which are not repeated here.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, when the computer readable storage medium runs on a processor, the flow of the data processing method is realized. The storage medium includes ROM/RAM, magnetic disk, optical disk, etc.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be apparent to those skilled in the art that in the several embodiments provided herein, it is to be understood that the disclosed apparatus and methods may be embodied in other forms. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and alternative arrangements included within the spirit and scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A data processing method for accelerating the operation of a batch normalization layer in deep learning model training, characterized in that multi-dimensional tensor data is provided as an input of the batch normalization layer, wherein dimensions in the multi-dimensional tensor data comprise channels, and the multi-dimensional tensor data comprise first tensor data and second tensor data; providing a first register set, a second register set, a first memory and a second memory, wherein the first register set and the second register set comprise M rows and N columns of registers, the second memory can store at least N rows and K columns of data, and the K is not less than 2M, and the method comprises the following steps:

storing the multidimensional tensor data into the first memory according to a preset rule;

Acquiring a first matrix, taking out at least part of the first tensor data in the first memory according to the channel and putting the first tensor data into the first register group, wherein Q data in the same channel in the first tensor data can only be put into one row in the first register group, and Q is not more than N; when the Q is smaller than the N, supplementing N-Q0 s to the row where the Q data are located in the first register group; when the data in the first register set includes a portion of the first tensor data, all of the data in the first register set constitutes the first matrix;

acquiring a second matrix, and placing initial data into any row in the second register group, wherein the size of the initial data is 1 row and Q columns, and the content is 1; when the data in the second register group comprises the initial data, all the data in the second register group form the second matrix, and the data except the initial data in the second matrix are all 0; transpose the second matrix and then put it into the second memory according to columns;

obtaining a third matrix, the third matrix comprising at least part of the second tensor data; the elements in the third matrix and the elements in the first matrix have a one-to-one correspondence in position, and Q data in the same channel in at least part of the second tensor data only exist in one row in the third matrix; transpose the third matrix and then put the transposed third matrix into the second memory according to columns;

All data in the second memory form a fourth matrix;

2. The data processing method of claim 1, further providing a third register set, the third register set comprising M rows and N columns of registers; the obtaining the third matrix includes: taking out at least part of the second tensor data in the first memory according to the channel and putting the second tensor data into the third register group, wherein Q data in the same channel in the second tensor data can only be put into one row in the third register group, and Q is not more than N; when the Q is smaller than the N, supplementing N-Q0 in the row where the Q data are located in the third register group; all data in the third register set constitutes the third matrix.

3. The data processing method according to claim 1, wherein a third matrix is acquired through the second register group; the step of obtaining the third matrix is before or after the step of obtaining the second matrix; the obtaining the third matrix includes: fetching and placing at least part of the second tensor data in the first memory into the second register set according to the channel; q data in the same channel in the second tensor data can be put into only one row in the second register set, wherein Q is not more than N; when the Q is smaller than the N, supplementing N-Q0 s to the row where the Q data are located in the second register group; when the data in the second register set includes a portion of the second tensor data, all of the data in the second register set constitutes the third matrix.

4. The data processing method according to claim 1, wherein a third matrix is acquired through the first register group; the step of obtaining a third matrix is preceded by the step of obtaining a first matrix; the obtaining the third matrix includes: fetching and placing at least part of the second tensor data in the first memory into the first register set according to the channel; q data in the same channel in the second tensor data can only be put into one row in the first register group, Q is not more than N, and when Q is less than N, N-Q0 are added into the row where the Q data are in the first register group; when the data in the first register set includes a portion of the second tensor data, all of the data in the first register set constitutes the third matrix.

5. The data processing method of any of claims 1-4, wherein the multi-dimensional tensor data comprises four-dimensional tensor data, the four dimensions comprising a batch size B, a height H, a width W, and a channel C.

6. The data processing method of claim 5, wherein the predetermined rule comprises storing in a cross-channel element order.

7. The data processing method of claim 5, wherein the predetermined rule comprises storing in a sequential order of channel elements.

8. The data processing device is used for accelerating the operation of a batch normalization layer in deep learning model training and is characterized by comprising a first memory, a first register group, a second memory, an operator and a second register group; the first register set is connected with the first memory and the arithmetic unit, and the second memory is connected with the second register set and the arithmetic unit; the first register set and the second register set comprise M rows and N columns of registers; providing multi-dimensional tensor data as input to the batch normalization layer, wherein,

the first memory is used for storing the multi-dimensional tensor data, wherein the dimensions in the multi-dimensional tensor data comprise channels, and the multi-dimensional tensor data comprise first tensor data and second tensor data;

the first register set is used for storing a first matrix, and the first matrix comprises at least part of the first tensor data;

the second register set is used for storing a second matrix; any row in the second matrix comprises initial data, wherein the size of the initial data is 1 row and Q columns, the content is 1, and Q is not more than N;

The second memory is used for storing a fourth matrix, and the fourth matrix comprises a transpose matrix of the second matrix and a transpose matrix of a third matrix; the third matrix comprises at least part of the second tensor data, and elements in the third matrix are in one-to-one correspondence with elements in the first matrix in position;

9. The data processing apparatus of claim 8, further comprising a third register set comprising M rows and N columns of registers, the third register set connecting the first memory and the second memory; the third register set is used for storing the third matrix.

10. The data processing apparatus of claim 8, wherein the second register set is further configured to store a third matrix.

11. The data processing apparatus of claim 8, wherein the first register set is further coupled to the second memory, the first register set further configured to store the third matrix.

12. Chip, characterized in that it comprises at least a data processing device according to any one of claims 8 to 11.

13. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the data processing method of any of claims 1 to 7.