CN113050988A

CN113050988A - Data processing method and device

Info

Publication number: CN113050988A
Application number: CN201911377059.1A
Authority: CN
Inventors: 陈凯亮
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-06-29

Abstract

The embodiment of the specification provides a data processing method and a data processing device, wherein a first input data block in first input data and a second input data block in second input data are directly loaded to a register from a memory, and then matrix multiplication is carried out on the first input data block and the second input data block in the register; wherein the width of the first input data is much smaller than the width of the second input data.

Description

Data processing method and device

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method and apparatus.

Background

Matrix multiplication processes have an extremely important position in many high-performance computing scenarios. For example, in a Convolutional Neural Network (CNN) which is widely used at present, a large part of time consumption comes from time consumption caused by matrix multiplication. Therefore, optimizing the performance of matrix multiplication is of great importance for optimizing the time consumption of high performance computations.

Disclosure of Invention

The present disclosure provides a data processing scheme.

According to a first aspect of embodiments of the present disclosure, there is provided a data processing method, the method including: loading a first input data block in first input data from a memory to a first register; loading a second input data block in second input data from the memory to a second register; performing matrix multiplication on the first input data block in the first register and the second input data block in the second register to obtain an output data block of output data; wherein the width of the first input data is much smaller than the width of the second input data.

In some embodiments, said loading a second block of input data in second input data from said memory to a second register comprises: loading a first element stored in a first target address to the second register based on a first pointer pointing to the first target address in the memory, wherein the first element is a last element of a jth line in the second input data block; jumping a read pointer from the first pointer to a second pointer pointing to a second target address in the memory, wherein a distance between the second pointer and the first pointer is a difference between the number of rows of the second input data and the number of rows of the second input data block; loading a second element stored at the second target address to the second register based on the second pointer.

In some embodiments, the method further comprises: storing the output data block to a third register; and loading the output data block in the third register to the memory.

In some embodiments, the loading the output data block in the third register to the memory includes: storing a last element of an mth row of the output data block from the third register to a third target address in the memory based on a third pointer to the third target address; jumping a write pointer from the third pointer to a fourth pointer pointing to a fourth target address in the memory, wherein a distance between the third target address and the fourth target address is a difference between the number of rows of the output data and the number of rows of the output data block; storing a first element of an m +1 th row of the output data block from the third register to the fourth target address based on the fourth pointer.

In some embodiments, the loading a first block of input data in the first input data from the memory into the first register includes: loading a fifth element stored by a fifth target address of the memory to the first register based on a fifth pointer pointing to the fifth target address, wherein the fifth element is the last element of the ith column in the first input data block; jumping a read pointer from the fifth pointer to a sixth pointer pointing to a sixth target address in the memory, wherein a distance between the sixth target address and the fifth target address is a difference between the number of columns of the first input data and the number of columns of the first input data block; and loading the sixth element stored by the sixth target address to the first register based on the sixth pointer.

In some embodiments, the height of the first input data is much smaller than the width of the second input data.

In some embodiments, the first input data is a convolution kernel parameter of a neural network, and the second input data is a feature map of an image.

According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus, the apparatus comprising: the first loading module is used for loading a first input data block in first input data from the memory to the first register; the second loading module is used for loading a second input data block in second input data from the memory to a second register; the processing module is used for performing matrix multiplication processing on the first input data block in the first register and the second input data block in the second register to obtain an output data block of output data; wherein the width of the first input data is much smaller than the width of the second input data.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of the embodiments when executing the program.

According to the embodiment of the disclosure, the first input data block and the second input data block to be subjected to matrix multiplication are directly loaded to the register from the memory, and then the matrix multiplication is performed on the first input data block and the second input data block in the register, so that compared with a mode that the first input data block and the second input data block are loaded to the cache from the memory and then loaded to the register from the cache, an unnecessary data processing process is avoided, thereby reducing the time consumption of matrix multiplication and improving the data processing efficiency in the matrix multiplication process.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1A is a schematic diagram of a conventional matrix multiplication process.

Fig. 1B is a schematic diagram of a matrix multiplication process of an embodiment of the disclosure.

Fig. 2 is a flow chart of a data processing method of an embodiment of the present disclosure.

Fig. 3 is a schematic diagram of data chunking according to an embodiment of the disclosure.

FIG. 4A is a schematic diagram of data rearrangement in some embodiments.

Fig. 4B is a schematic diagram of the read pointer movement process in the buffer after data rearrangement.

Fig. 5A is a schematic diagram of loading data from a memory according to an embodiment of the disclosure.

Fig. 5B is a schematic diagram of a pointer movement process of an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of data chunking according to another embodiment of the present disclosure.

Fig. 7 is a block diagram of a data processing apparatus of an embodiment of the present disclosure.

FIG. 8 is a schematic diagram of a computer device of an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosed embodiments, as detailed in the appended claims.

The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present disclosure. As used in the disclosed embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information in the embodiments of the present disclosure, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In order to make the technical solutions in the embodiments of the present disclosure better understood and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

Matrix multiplication processes have an extremely important position in many high-performance computing scenarios. For example, in a Convolutional Neural Network (CNN) which is widely used at present, one implementation is to convert a convolution process into a set of matrix multiplications and obtain a result of the convolution process based on a result of the matrix multiplications. Of the time consumption of convolutional neural networks, a significant portion is derived from the time consumption of matrix multiplication. Therefore, optimizing the performance of matrix multiplication is of great importance for optimizing the time consumption of convolutional neural networks.

As shown in fig. 1A, in a conventional implementation manner of matrix multiplication, a data block in data is preloaded into a cache, and data multiplexing is performed by reading the same data block in the cache as many times as possible, where an access overhead to the cache is much smaller than that of a memory, so that an overall data reading time is saved, and a purpose of reducing data processing delay is achieved.

In general, the above method can obtain better performance. On one hand, however, loading data from the memory to the cache also generates a certain time overhead; on the other hand, in order to facilitate subsequent reading of data in the cache, when the data is loaded from the memory to the cache, the read data block is often rearranged, so that extra time is taken. In some cases, the delay associated with data loading and data reordering is even more than the time savings associated with data multiplexing.

Consider the following example of matrix multiplication: a ═ B ═ C, (M ═ K) × (K ═ N) (M × N). Where A, B and C are matrices, M is the height of matrix A and matrix C, N is the width of matrix B and matrix C, and K is the width of matrix A and the height of matrix B. When M and K are small and N is large, it can be seen that the matrices B and C are large, and it takes a long time to perform operations such as blocking and rearrangement of the cache level; meanwhile, the reusability of data in the cache is not high due to the fact that M and K are small. The two factors are combined, so that the above conventional matrix multiplication processing method can reduce the final overall performance. In the convolutional neural network MobileNetV2, for example, there are a number of such matrix multiplication processes, such as the two convolutions that occur in the front section of the overall network: the first one corresponds to a matrix multiplication process of M-32, N-12544, and K-32, and the second one corresponds to a matrix multiplication process of M-96, N-12544, and K-16. It can be seen that in these two matrix multiplication processes, N is much larger than M and K.

The embodiments of the present disclosure seek to provide a data processing method to perform targeted optimization for the above scenarios. As shown in fig. 1B, the embodiment of the present disclosure reduces unnecessary data processing processes by directly loading the data block in the memory to the register without loading the data block to the cache first and then loading the data block to the register. Fig. 2 is a flowchart of a data processing method according to an embodiment of the disclosure. The method may comprise:

step 201: loading a first input data block in first input data from a memory to a first register;

step 202: loading a second input data block in second input data from the memory to a second register;

step 203: performing matrix multiplication on the first input data block in the first register and the second input data block in the second register to obtain an output data block of output data; wherein the width of the first input data is much smaller than the width of the second input data.

In the disclosed embodiment, the first input data block (data block a) is a multiplicand (i.e., a data block preceding a multiplication symbol) and the second input data block (data block B) is a multiplier (i.e., a data block following a multiplication symbol). In some embodiments, the first input data block may be one data block obtained by blocking the first input data, and the second input data block may be one data block obtained by blocking the second input data. In the CNN scenario, the first input data may be convolution kernel parameters of a neural network, and the second input data may be a feature map (feature map) of an image. In one common scenario, the first input data may include tens of columns of data, and the second input data may include thousands to tens of thousands of columns of data. Unlike cache-based partitioning in previous data processing approaches, the partitioning in the embodiments of the present disclosure is not cache-based partitioning, but register-based partitioning. It should be noted that the partition is a logical concept, and an additional memory space is not needed to store the partition itself.

The first input data and the second input data may be stored in a memory in advance, and when matrix multiplication processing is required, the data block a in the first input data and the data block B in the second input data may be loaded into the first register and the second register from the memory, respectively. The number of available registers varies from model to model of processor, and therefore the number of first and second registers may be predetermined according to the model of processor. The blocking manner of blocking the input data may be determined according to the number of registers, the width of the registers, and the length of the input data. The first register and the second register may both be scalar registers or both be vector registers. Taking the first registers as an example, when the first registers are scalar registers, each first register may store one element in the first input data block; when the first registers are vector registers, each first register may store a plurality of elements in the first block of input data. The second register is similar to the first register and is not described herein again. When the vector registers are adopted, because each vector register can store a plurality of elements, when matrix multiplication is carried out, the matrix multiplication can be carried out on the plurality of elements each time, and the data processing efficiency is further improved.

A schematic diagram of data chunking of some embodiments is shown in fig. 3, where in fig. 3 the first input data is divided into three first input data chunks, one for each row; the second input data is divided into six second input data blocks, one for each column. Those skilled in the art will understand that the actual blocking manner is not limited to this, and the first input data may be blocked by columns, or simultaneously blocked by rows and columns; the second input data may also be blocked by rows or by both rows and columns.

According to the block division manner shown in fig. 3, a first input data block corresponding to a first row in the first input data and a second input data block corresponding to a first column in the second input data are subjected to matrix multiplication to obtain an output data block of the first row and the first column in the output data; performing matrix multiplication on a first input data block corresponding to a first row in the first input data and a second input data block corresponding to a second column in the second input data to obtain an output data block of the first row and the second column in the output data; and so on. Wherein the number of columns of each first input data block is equal to the number of rows of one second input data block.

In other data processing modes, the first input data and the second input data are partitioned based on the cache to obtain a first input data block and a second input data block, the first input data block and the second input data block are loaded into the cache from the memory, and then the first input data block and the second input data block are loaded into the register from the cache. In order to multiplex data and reduce the number of times of memory access, the calculation of a plurality of output elements can be realized by the data loaded from the memory to the cache each time. On the other hand, in order to enable the register to continuously and efficiently read data from the cache, the data is rearranged when the data is loaded from the memory into the cache.

FIG. 4A is a schematic diagram of data rearrangement according to some embodiments. Taking the second input data as an example, it is assumed that the arrangement of the second input data block in the memory is as shown in fig. 4A, and the second input data block is a 4 × 500 data block in the second input data. Then row 1, column 1 to 500 of the second input data block is stored in the first 500 consecutive addresses (memory address 1) in the memory, row 2, column 1 to 500 of the second input data block is stored in the next 500 consecutive addresses (memory address 2) of the memory address 1 in the memory, row 3, column 1 to 500 of the second input data block is stored in the next 500 consecutive addresses (memory address 3) of the memory address 2 in the memory, row 4, column 1 to 500 of the second input data block is stored in the next 500 consecutive addresses (memory address 4) of the memory address 3 in the memory, and the memory addresses 1 to 4 are not consecutive to each other.

The rearranged second input data block is arranged in the cache in sequence as follows: line 1, columns 1 to 500 are stored in the first 500 consecutive addresses (buffer address 1) in the buffer, line 2, columns 1 to 500 are stored in the 500 consecutive addresses (buffer address 2) next to buffer address 1 in the buffer, line 3, columns 1 to 500 are stored in the 500 consecutive addresses (buffer address 3) next to buffer address 2, and line 4, columns 1 to 500 are stored in the 500 consecutive addresses (buffer address 4) next to buffer address 3. And the cache addresses 1 to 4 in the cache are consecutive to each other.

Through data rearrangement, when the second register reads data, each time one datum is read from the cache address, the read pointer in the cache is sequentially moved to the next cache address. Fig. 4B is a schematic diagram illustrating the moving process of the read pointer in the buffer after the data rearrangement. It can be seen that the pointer moves from the 1 st cache address in the cache to the 2 nd cache address in the cache, then from the 2 nd cache address in the cache to the 3 rd cache address in the cache, … …, then from the k-1 th cache address in the cache to the k-th cache address in the cache, then from the k-th cache address in the cache to the k +1 th cache address in the cache, and so on. There is no jump in the movement of the read pointer.

When the width of the input data is large, the time consumption of data rearrangement is high; when the size difference of two input data blocks for matrix multiplication is large, the multiplexing rate of data is low. This results in that when the scales of two input data blocks are different from each other, the time consumption for rearranging data is larger than the time consumption for multiplexing data, so that loading the input data blocks from the memory into the cache in this case cannot improve the processing efficiency of matrix multiplication processing, but rather reduces the processing efficiency.

According to the embodiment of the disclosure, the first input data block and the second input data block are directly loaded to the register from the memory, and then the matrix multiplication processing is performed on the first input data block and the second input data block in the register, so that an unnecessary data processing process is avoided compared with a mode that the first input data block and the second input data block are loaded to the cache from the memory and then loaded to the register from the cache. When the width of the first input data is far smaller than that of the second input data, the time consumption of matrix multiplication processing can be effectively reduced, and the efficiency of the matrix multiplication processing is improved; in addition, since the time for performing matrix multiplication processing on the first input data block and the second input data block at a time is reduced, more input data blocks can be processed in a unit time, and the throughput is improved. Especially, when the height and the width of the first input data are both much smaller than the width of the second input data, the efficiency improvement effect is more remarkable.

In order to avoid an unnecessary data rearrangement process, the embodiments of the present disclosure read data in the memory in a skip mode, where in the skip process, a moving mode of a read pointer in the memory is a skip moving mode (i.e., a pointer position jumps between discontinuous addresses). In some embodiments, the loading a first block of input data in the first input data from the memory into the first register includes: loading a fifth element stored by a fifth target address of the memory to the first register based on a fifth pointer pointing to the fifth target address, wherein the fifth element is the last element of the ith column in the first input data block; jumping a read pointer from the fifth pointer to a sixth pointer pointing to a sixth target address in the memory, wherein a distance between the sixth target address and the fifth target address is a difference between the number of columns of the first input data and the number of columns of the first input data block; and loading the sixth element stored by the sixth target address to the first register based on the sixth pointer.

In other embodiments, the loading a second block of input data in the second input data from the memory to the second register includes: loading a first element stored in a first target address to the second register based on a first pointer pointing to the first target address in the memory, wherein the first element is a last element of a jth line in the second input data block; jumping a read pointer from the first pointer to a second pointer pointing to a second target address in the memory, wherein a distance between the second pointer and the first pointer is a difference between the number of rows of the second input data and the number of rows of the second input data block; loading a second element stored at the second target address to the second register based on the second pointer.

Fig. 5A shows a schematic diagram of loading data from the memory, which still takes the second input data block as an example. Assuming that the size of the second input data block is 4 × 500, when the second input data block is loaded from the memory to the second register, skipping the data, that is, the reading sequence is sequentially: line 1, columns 1 to 500 of the second input data block, line 2, columns 1 to 500 of the second input data, line 3, columns 1 to 500 of the second input data, line 4, columns 1 to 500 of the second input data. The above-described functions may be implemented by calling a program module. The read-write mode of the first input data block is similar to that of the second input data block, and the difference is only that the first input data block is read according to columns, which is not described herein again.

As shown in fig. 5B, when the second input data block is loaded into the second register, each time data is read from the memory address, the pointer in the memory is sequentially moved to the next memory address, but when the pointer is moved to the memory address corresponding to the last element of a row of the second input data block (for example, the 500 th memory address in the figure), the next movement is not sequentially moved to the next memory address of the memory address, but is moved to the memory address corresponding to the first element of the next row of the second input data block (for example, the K +1 th memory address in the figure, where K is the width of the second input data). In this process, the range of pointer movement is equal to the difference between the width of the second input data and the width of the second block of input data.

In some embodiments, the method further comprises: storing the output data block to a third register; and loading the output data block in the third register to the memory. Before storing the output data block into the third register, the third register may be initialized, where it may be set to 0 or read from an initial value. In some embodiments, as shown in fig. 6, each column of the first input data may be regarded as a first input data block, each row of the second input data may be regarded as a second input data block, and then the first input data block and the second input data block may be stored in the first register or the second register correspondingly. After the output data block is obtained, the output data block is updated into the third register. As can be seen from the definition of matrix multiplication, the specific operation here is to multiply each element of a column in the first input data block by each element of a row in the second input data block, and write each output element of the resulting output data block into the corresponding third register. And repeating the steps, directly reading elements of the next column of the first input data block and the next row of the second input data block each time to calculate an output data block, and accumulating the calculation result and the numerical value in the corresponding third register to update the numerical value in the third register. And after all the input data blocks are processed, directly writing the result back to the memory address for storing the output data block in the memory.

In some embodiments, the output data block is one data block in the output data; the loading the output data block in the third register to the memory includes: storing a last element of an mth row of the output data block from the third register to a third target address in the memory based on a third pointer to the third target address; jumping a write pointer from the third pointer to a fourth pointer pointing to a fourth target address in the memory, wherein a distance between the third target address and the fourth target address is a difference between the number of rows of the output data and the number of rows of the output data block; storing a first element of an m +1 th row of the output data block from the third register to the fourth target address based on the fourth pointer.

In the conventional data processing method, after the output data block is obtained from the register, the output data block needs to be loaded from the register to the cache and then from the cache to the memory. Moreover, in order to enable the memory to continuously and efficiently read data from the cache, when the output data block is loaded from the register to the cache, the data is also rearranged, and a data rearrangement mode when the output data block is loaded from the register to the cache is similar to a data rearrangement mode when the second input data block is loaded from the memory to the cache, and details are not repeated here. In the embodiment of the present disclosure, data is directly loaded from the register to the memory, and during loading, data is skipped without data rearrangement, and during the skipping, the movement mode of the write pointer in the memory is a skipping movement (i.e., the pointer position jumps between discontinuous memory addresses). The way of the write pointer moving in the skip process is similar to the way of the read pointer moving in the skip process, and is not described herein again. By adopting the scheme of the embodiment of the disclosure, unnecessary data loading and rearrangement processes are reduced.

In addition, in the process of calculating the data in the register, the process of loading the data from the memory to the register can be simultaneously realized by utilizing hardware resources on the processor, so that the data loading and the data calculation can be mutually overlapped, and the time delay in the matrix multiplication processing process is further reduced.

It should be noted that, in the case that the height of the second input data is much smaller than the height of the first input data, although the difference between the sizes of the first input data and the second input data is relatively large, since the data prefetching performance of the hardware itself is relatively good, the conventional matrix multiplication processing method can also obtain relatively good performance. And under the condition that the width of the first input data is far smaller than that of the second input data, the data prefetching performance of hardware is far inferior to the performance when the height of the second input data is far smaller than that of the first input data, so that the effect of adopting the traditional matrix multiplication processing mode is poor.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

As shown in fig. 7, the present disclosure also provides an apparatus comprising:

a first loading module 701, configured to load a first input data block in first input data from a memory to a first register;

a second loading module 702, configured to load a second input data block in second input data from the memory to a second register;

a processing module 703, configured to perform matrix multiplication on the first input data block in the first register and the second input data block in the second register to obtain an output data block of output data; wherein the width of the first input data is much smaller than the width of the second input data.

In some embodiments, the second load module comprises: a first loading unit, configured to load, to the second register, a first element stored in a first target address based on a first pointer pointing to the first target address in the memory, where the first element is a last element in a jth line in the second input data block; a first jumping unit, configured to jump a read pointer from the first pointer to a second pointer pointing to a second target address in the memory, where a distance between the second pointer and the first pointer is a difference between a row number of the second input data and a row number of the second input data block; and the second loading unit is used for loading a second element stored by the second target address to the second register based on the second pointer.

In some embodiments, the apparatus further comprises: the storage module is used for storing the output data block to a third register; and the third loading module is used for loading the output data block in the third register to the memory.

In some embodiments, the third load module comprises: a first storage unit to store a last element of an mth row of the output data block from the third register to a third target address in the memory based on a third pointer to the third target address; a second jumping unit, configured to jump a write pointer from the third pointer to a fourth pointer pointing to a fourth target address in the memory, where a distance between the third target address and the fourth target address is a difference between a number of rows of the output data and a number of rows of the output data block; a second storage unit, configured to store a first element of an m +1 th row of the output data block from the third register to the fourth target address based on the fourth pointer.

In some embodiments, the first load module comprises: a third loading unit, configured to load a fifth element stored in a fifth target address of the memory into the first register based on a fifth pointer pointing to the fifth target address, where the fifth element is a last element in an ith column in the first input data block; a third skipping unit, configured to skip a read pointer from the fifth pointer to a sixth pointer pointing to a sixth target address in the memory, where a distance between the sixth target address and the fifth target address is a difference between the number of columns of the first input data and the number of columns of the first input data block; and a fourth loading unit, configured to load, based on the sixth pointer, the sixth element stored in the sixth target address to the first register.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

In some embodiments, the disclosed embodiments also provide a computer storage medium having a computer program stored thereon, which when executed by a processor, implements the method of any of the embodiments.

The embodiments of the present disclosure also provide a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the method according to any embodiment is implemented.

The embodiments of the apparatus of the present specification can be applied to a computer device, such as a server or a terminal device. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor in which the device is located processes files reads corresponding computer program instructions in the nonvolatile memory to the memory, and then reads the computer program instructions from the memory to the processor to run. From a hardware aspect, as shown in fig. 8, the hardware structure of the computer device in which the apparatus of this specification is located is shown in fig. 8, except for the processor 801, the memory 802, the network interface 803, and the nonvolatile memory 804 shown in fig. 8, a server or an electronic device in which the apparatus is located in the embodiment may also include other hardware according to an actual function of the computer device, which is not described again.

The present disclosure may take the form of a computer program product embodied on one or more storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable commands, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

The above description is only exemplary of the present disclosure and should not be taken as limiting the disclosure, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

Claims

1. A method of data processing, the method comprising:

loading a first input data block in first input data from a memory to a first register;

loading a second input data block in second input data from the memory to a second register;

performing matrix multiplication on the first input data block in the first register and the second input data block in the second register to obtain an output data block of output data; wherein the width of the first input data is much smaller than the width of the second input data.

2. The method of claim 1, wherein loading a second block of input data in second input data from the memory to a second register comprises:

loading a first element stored in a first target address to the second register based on a first pointer pointing to the first target address in the memory, wherein the first element is a last element of a jth line in the second input data block;

jumping a read pointer from the first pointer to a second pointer pointing to a second target address in the memory, wherein a distance between the second pointer and the first pointer is a difference between the number of rows of the second input data and the number of rows of the second input data block;

loading a second element stored at the second target address to the second register based on the second pointer.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

storing the output data block to a third register;

and loading the output data block in the third register to the memory.

4. The method of claim 3, wherein loading the output data block in the third register to the memory comprises:

storing a last element of an mth row of the output data block from the third register to a third target address in the memory based on a third pointer to the third target address;

jumping a write pointer from the third pointer to a fourth pointer pointing to a fourth target address in the memory, wherein a distance between the third target address and the fourth target address is a difference between the number of rows of the output data and the number of rows of the output data block;

storing a first element of an m +1 th row of the output data block from the third register to the fourth target address based on the fourth pointer.

5. The method of any of claims 1 to 4, wherein the height of the first input data is much smaller than the width of the second input data.

6. The method of any one of claims 1 to 5, wherein the first input data is a convolution kernel parameter of a neural network, and the second input data is a feature map of an image.

7. A data processing apparatus, characterized in that the apparatus comprises:

the first loading module is used for loading a first input data block in first input data from the memory to the first register;

the second loading module is used for loading a second input data block in second input data from the memory to a second register;

the processing module is used for performing matrix multiplication processing on the first input data block in the first register and the second input data block in the second register to obtain an output data block of output data; wherein the width of the first input data is much smaller than the width of the second input data.

8. The apparatus of claim 7, wherein the second load module comprises:

a first loading unit, configured to load, to the second register, a first element stored in a first target address based on a first pointer pointing to the first target address in the memory, where the first element is a last element in a jth line in the second input data block;

a first jumping unit, configured to jump a read pointer from the first pointer to a second pointer pointing to a second target address in the memory, where a distance between the second pointer and the first pointer is a difference between a row number of the second input data and a row number of the second input data block;

and the second loading unit is used for loading a second element stored by the second target address to the second register based on the second pointer.

9. The apparatus of claim 7 or 8, further comprising:

the storage module is used for storing the output data block to a third register;

and the third loading module is used for loading the output data block in the third register to the memory.

10. The apparatus of claim 9, wherein the third load module comprises:

a first storage unit to store a last element of an mth row of the output data block from the third register to a third target address in the memory based on a third pointer to the third target address;

a second jumping unit, configured to jump a write pointer from the third pointer to a fourth pointer pointing to a fourth target address in the memory, where a distance between the third target address and the fourth target address is a difference between a number of rows of the output data and a number of rows of the output data block;

a second storage unit, configured to store a first element of an m +1 th row of the output data block from the third register to the fourth target address based on the fourth pointer.

11. The apparatus of any of claims 7 to 10, wherein the height of the first input data is substantially less than the width of the second input data.

12. The apparatus of any one of claims 7 to 11, wherein the first input data is a convolution kernel parameter of a neural network, and the second input data is a feature map of an image.

13. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 6.

14. A computer device comprising a memory for storing a computer program operable on a processor, wherein the processor implements the method of any one of claims 1 to 6 when executing the program.