CN117892050A

CN117892050A - Matrix operation method, device, equipment and medium based on multi-core hardware

Info

Publication number: CN117892050A
Application number: CN202410070308.7A
Authority: CN
Inventors: 张思磊; 刘君佩; 张昊; 张媛
Original assignee: Shanghai Silang Technology Co ltd
Current assignee: Shanghai Silang Technology Co ltd
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-04-16

Abstract

The embodiment of the invention discloses a matrix operation method, a device, equipment and a medium based on multi-core hardware. The multi-core hardware comprises a multi-core space, a data storage area and a data operation area, wherein the multi-core space of the multi-core hardware comprises a data storage area and a data operation area which are executed in parallel; the method comprises the following steps: acquiring an original left matrix and an original right matrix to be operated; determining the number of target blocks of matrix block operation according to the dimension of the matrix, the number of cores of multi-core hardware and the operation execution degree of each core; performing matrix partitioning processing to obtain a plurality of left matrix partitions and right matrix partitions; acquiring each matrix block through the data storage area, and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block; through the data operation area, matrix blocking data corresponding to matrix blocking operation is read in the data storage area to operate, and the operation result is updated and stored to a corresponding target matrix storage position, so that the hardware execution capacity can be fully utilized, and the hardware execution performance can be improved; the data transmission time is saved, and the operation performance is improved.

Description

Matrix operation method, device, equipment and medium based on multi-core hardware

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a matrix operation method, apparatus, device, and medium based on multi-core hardware.

Background

With the development of big data technology, the technology is required in more and more fieldsThe higher order matrix is operated on. "matrix multiplication" is one of the core modules commonly used for high performance computing (High Performance Computing, HPC), is a typical computationally intensive and memory intensive application, and requires very high multiplication and addition (Multiply Accumulate, MAC) capability and memory bandwidth for the processor, and has high computational time complexity, approximately O (N) ³ ) N is the matrix size.

In the prior art, matrix multiplication is implemented by adopting a traditional triple loop. The triple loop implementation method has lower calculation memory access, data deletion of the Cache (Cache) and large matrix data moving overhead ratio, so that the operation efficiency of the processor is lower.

In particular, for the operation of the high-order matrix, taking a 16384-order double-precision matrix as an example, one matrix needs to occupy 2G space, and when in operation, two input matrices need 4G space, and the storage of results needs 6G space in total. The amount of data that is not sufficiently large to be stored in the core space is quite insufficient, so that the input data needs to be stored in the off-core space, but the data is placed outside the core, which greatly increases the time for data transmission into the core before each calculation, resulting in inefficient hardware execution.

Disclosure of Invention

The invention provides a matrix operation method, a device, equipment and a medium based on multi-core hardware, so as to fully utilize the hardware execution capacity and improve the hardware execution performance; the data transmission time is saved, and the operation performance is improved.

According to an aspect of the present invention, there is provided a matrix operation method based on multi-core hardware, an intra-core space of the multi-core hardware, including: a data storage area and a data operation area which are executed in parallel; the method comprises the following steps:

acquiring an original left matrix and an original right matrix to be operated;

determining the number of target blocks of matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware and the operation execution degree of each core;

the original left matrix and the original right matrix are subjected to block division according to the target block number, so that a plurality of left matrix blocks and a plurality of right matrix blocks are obtained;

acquiring each left matrix block and each right matrix block through the data storage area, and setting a matrix storage position according to the corresponding relation of the left matrix block and the right matrix block;

and reading matrix partitioning data corresponding to matrix partitioning operation in the data storage area through the data operation area, performing operation, and updating and storing an operation result to a corresponding target matrix storage position.

According to another aspect of the present invention, there is provided a matrix operation device based on multi-core hardware, an intra-core space of the multi-core hardware, including: a data storage area and a data operation area which are executed in parallel; the device comprises:

the original matrix acquisition module is used for acquiring an original left matrix and an original right matrix to be operated;

the target block number determining module is used for determining the target block number of matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the core number of the multi-core hardware and the operation execution degree of each core;

the block processing module is used for carrying out block processing on the original left matrix and the original right matrix according to the number of the target blocks to obtain a plurality of left matrix blocks and a plurality of right matrix blocks;

the matrix storage position setting module is used for acquiring each left matrix block and each right matrix block through the data storage area and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block;

and the matrix partitioning operation module is used for reading matrix partitioning data corresponding to matrix partitioning operation in the data storage area through the data operation area, performing operation, and updating and storing an operation result to a corresponding target matrix storage position.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the matrix operation method based on multi-core hardware according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the matrix operation method based on multi-core hardware according to any one of the embodiments of the present invention when executed.

According to the technical scheme, an original left matrix and an original right matrix to be operated are obtained; determining the number of target blocks of matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware and the operation execution degree of each core; the original left matrix and the original right matrix are subjected to block division according to the target block number, so that a plurality of left matrix blocks and a plurality of right matrix blocks are obtained; acquiring each left matrix block and each right matrix block through the data storage area, and setting a matrix storage position according to the corresponding relation of the left matrix block and the right matrix block; the data operation area is used for reading matrix blocking data corresponding to matrix blocking operation in the data storage area to operate, and updating and storing operation results to corresponding target matrix storage positions, so that the problem of high-order matrix operation by fully utilizing hardware resources is solved, and the execution capacity of hardware can be fully utilized by carrying out blocking processing on the matrix operation, and the operation performance is improved; the data transmission time can be saved through parallel data storage and operation, and the hardware execution performance is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a matrix operation method based on multi-core hardware according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of matrix partitioning of an original left matrix and an original right matrix according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a matrix partitioned storage according to a first embodiment of the present invention;

FIG. 4 is a flowchart of a matrix operation method based on multi-core hardware according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of an input matrix partition provided according to a second embodiment of the present invention;

FIG. 6 is a diagram of a core distribution according to a second embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative matrix provided in accordance with a second embodiment of the present invention;

FIG. 8 is a schematic diagram of MPU and SPU execution provided in accordance with a second embodiment of the present invention;

fig. 9 is a schematic structural diagram of a matrix computing device based on multi-core hardware according to a third embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device implementing a matrix operation method based on multi-core hardware according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a matrix operation method based on multi-core hardware according to a first embodiment of the present invention, where the method may be performed by a matrix operation device based on multi-core hardware, and the matrix operation device based on multi-core hardware may be implemented in the form of hardware and/or software, and the matrix operation device based on multi-core hardware may be configured in an electronic device, such as a computer. Wherein, the nuclear space of the multicore hardware includes: a data storage area and a data operation area which are executed in parallel. As shown in fig. 1, the method includes:

step 110, an original left matrix and an original right matrix to be operated are obtained.

For matrix operation a×b, it can be understood that a is the original left matrix and B is the original right matrix. The original left matrix and the original right matrix may be matrices of arbitrary order. The original left matrix and the original right matrix may be square matrices or rectangular matrices, which is not limited in the embodiment of the present invention.

And 120, determining the number of target blocks of matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware and the operation execution degree of each core.

For an n-order original left matrix or an original right matrix, the matrix dimension is n. The operation execution degree refers to the maximum matrix dimension supported by one operation. The target block number is the basis for the block processing of the original left matrix and the original right matrix.

In the embodiment of the present invention, there may be various ways to determine the number of target blocks. For example, the square of the resulting value may be rounded up by the ratio of the matrix dimension to the degree of execution of the operation, as the target block number, to make full use of the degree of execution of each core in the hardware. Alternatively, the number of cores may be set as the target number of blocks to make full use of all cores in hardware.

Further, to combine with the hardware architecture to perform more efficient operation and improve the hardware execution degree, in an optional implementation manner of the embodiment of the present invention, determining the number of target blocks of the matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware, and the operation execution degree of each core includes: and if the matrix dimension is not greater than the product of the core number and the operation execution degree, determining the target block number according to the matrix dimension and the operation execution degree. The determining the target block number according to the matrix dimension and the operation execution degree may be to square up the value obtained by rounding up the ratio of the matrix dimension and the operation execution degree, as the target block number.

When the dimension of the matrix is larger than the product of the number of cores and the operation execution degree, matrix splitting can be performed in advance according to the product of the number of cores and the operation execution degree to obtain a matrix smaller than or equal to the product of the number of cores and the operation execution degree, the splitting result can be a matrix which needs to be performed by each core during cyclic execution, and the corresponding target block number can be determined for the matrix during cyclic execution.

Or when the dimension of the matrix is larger than the product of the number of cores and the operation execution degree, the number of cores can be used as the target block number to perform matrix blocking. At this time, the obtained matrix block is larger than the execution force of each core, and each core can be solved in various ways. For example, each core may continue with matrix partitioning operations for a number of operations; or the execution strength of each core can be expanded.

And 130, performing block processing on the original left matrix and the original right matrix according to the number of the target blocks to obtain a plurality of left matrix blocks and a plurality of right matrix blocks.

In the block processing, the original left matrix and the original right matrix are divided into the same number of blocks along the rows or columns. For example, the original left matrix and the original right matrix may be divided into n _block ×n _block Is divided into blocks. Number of rows and columns (n _line ) For the dimension of the matrix and n _block The ratio is rounded up. In the blocking process, if the dimension of the matrix block of the last row or column is insufficient, the matrix block can be expanded to n by filling element 0 _line ×n _line The method ensures that the sizes of all matrix blocks are consistent, and is convenient for calculation.

Fig. 2 is a schematic diagram illustrating matrix blocking of an original left matrix and an original right matrix according to a first embodiment of the present invention. In FIG. 2, A _ij Matrix partitioning of an ith row and a jth column obtained by original left matrix partitioning processing is represented; b (B) _ij And the matrix block of the ith row and the jth column obtained by the original right matrix block processing is represented.

And 140, acquiring each left matrix block and each right matrix block through the data storage area, and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block.

The corresponding relationship of the left matrix block and the right matrix block can be the row-column corresponding relationship of the left matrix block and the right matrix block. The matrix blocks may be stored in a corresponding relationship on storage resources corresponding to the cores performing the operations. In order to facilitate determination of the final result of matrix operation, the operation result may be stored in correspondence with the left and right matrix blocks.

Specifically, in an alternative embodiment of the present invention, three memory subareas are provided in the data storage area; acquiring each left matrix block and each right matrix block through the data storage area, and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block, wherein the method comprises the following steps: setting the target left matrix block in the first memory subarea for storage according to the corresponding relation of the left matrix block and the right matrix block, and setting the corresponding target right matrix block in the second memory subarea for storage; and taking the third memory subarea as a memory position of a target operation result obtained by performing matrix partitioning operation on the corresponding target left matrix partition and the target right matrix partition.

Fig. 3 is a schematic diagram illustrating a matrix block storage according to a first embodiment of the present invention. As shown in FIG. 3, A _ij And B is connected with _ij The first memory sub-region memory and the second memory sub-region memory, which may be stored in the data storage region corresponding to the corresponding core, respectively. The third memory subarea in the data storage area corresponding to the corresponding core can be used as A _ij And B is connected with _ij And calculating the storage position of the corresponding result.

And 150, reading matrix partitioning data corresponding to the matrix partitioning operation in the data storage area through the data operation area, performing operation, and updating and storing an operation result to a corresponding target matrix storage position.

For the operation of the matrix block data, multiple operations can be set according to the rule required by the final operation result. The matrix blocking data required for each operation may be different. For example, for A ₁₁ And B ₁₁ Results C of the operation ₁₁ The operation rule can be C ₁₁ ＝A ₁₁ ×B ₁₁ +A ₁₂ ×B ₂₁ +A ₁₃ ×B ₃₁ +A ₁₄ ×B ₄₁ +…+A _1,block ×B _block,1 。

The data operation area and the data storage area are executed in parallel. Specifically, in the time slot of the data operation area for performing the operation, the data storage area acquires matrix block data required by the next operation from the storage area outside the core, that is, performs data transmission and operation in a manner similar to ping-pong transmission. By parallel execution of the data operation area and the data storage area, the time of data transmission can be hidden in operation, the execution force of a hardware architecture can be fully exerted, and the operation efficiency is improved.

Specifically, the data operation area may be a vector processor (MPU), and the data storage area may be a Scalar Processor (SPU).

According to the technical scheme, an original left matrix and an original right matrix to be operated are obtained; determining the number of target blocks of matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of multi-core hardware and the operation execution degree of each core; performing block division processing on the original left matrix and the original right matrix according to the number of the target blocks to obtain a plurality of left matrix blocks and a plurality of right matrix blocks; acquiring each left matrix block and each right matrix block through the data storage area, and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block; the data storage area is used for acquiring each left matrix block and each right matrix block, and matrix storage positions are set according to the corresponding relation of the left matrix block and the right matrix block, so that the problem of high-order matrix operation by fully utilizing hardware resources is solved, and the execution capacity of hardware can be fully utilized by carrying out block processing on matrix operation, and the operation performance is improved; the data transmission time can be saved through parallel data storage and operation, and the hardware execution performance is improved.

Example two

Fig. 4 is a flowchart of a matrix operation method based on multi-core hardware according to a second embodiment of the present invention, where the technical solution in this embodiment is further refined, and the technical solution in this embodiment may be combined with each of the alternatives in one or more embodiments. As shown in fig. 4, the method includes:

step 210, obtaining an original left matrix and an original right matrix to be operated.

And 220, determining the number of target blocks of matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware and the operation execution degree of each core.

In an optional implementation manner of the embodiment of the present invention, determining the number of target blocks of the matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware, and the operation execution degree of each core includes: and if the matrix dimension is not greater than the product of the core number and the operation execution degree, determining the target block number according to the matrix dimension and the operation execution degree.

And 230, performing block processing on the original left matrix and the original right matrix according to the number of the target blocks to obtain a plurality of left matrix blocks and a plurality of right matrix blocks.

And 240, acquiring each left matrix block and each right matrix block through the data storage area, and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block.

In an alternative implementation of the embodiment of the present invention, three memory subareas are provided in the data storage area; acquiring each left matrix block and each right matrix block through the data storage area, and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block, wherein the method comprises the following steps: setting the target left matrix block in the first memory subarea for storage according to the corresponding relation of the left matrix block and the right matrix block, and setting the corresponding target right matrix block in the second memory subarea for storage; and taking the third memory subarea as a memory position of a target operation result obtained by performing matrix partitioning operation on the corresponding target left matrix partition and the target right matrix partition.

Step 250, determining the total operation times of matrix block operation according to the target block number.

The square of the total operation times of the matrix block operation is the target block number. Exemplary, the target block number is n _block ×n _block The total operation times of matrix block operation is n _block 。

And 260, reading matrix blocking data corresponding to the matrix blocking operation in the data storage area according to the current operation times of the matrix blocking operation through the data operation area, performing operation, and superposing the operation result of the current times to a corresponding target matrix storage position.

In order to reduce data to be stored in matrix partitioning operation, different data reading rules can be set according to operation times, and data to be operated are obtained to obtain operation results. For example, for C ₁₁ ＝A ₁₁ ×B ₁₁ +A ₁₂ ×B ₂₁ +A ₁₃ ×B ₃₁ +A ₁₄ ×B ₄₁ +…+A _1,block ×B _block,1 Core 1 may calculate A in a first operation ₁₁ ×B ₁₁ In the second calculation, calculate A ₁₂ ×B ₂₁ … …, calculate A in the block-th calculation _1,block ×B _block,1 . And according to the current operation times, correspondingly reading the needed matrix data to perform operation. And the results of each operation are superimposed. The calculation sequence can be adjusted according to actual conditions.

And 270, when the current times reach the total operation times, taking the result stored in the storage position of the target matrix as the matrix block operation result of the corresponding left matrix block and right matrix block.

After each calculation, the calculation results of each calculation may be superimposed as the final calculation result. The matrix operation results of the original left matrix and the original right matrix can be obtained by directly splicing the operation results of each matrix block.

On the basis of the above embodiment, optionally, according to the current operation times of the matrix blocking operation, the data storage area reads matrix blocking data corresponding to the matrix blocking operation to perform the operation, and superimposes the operation result of the current times on a corresponding target matrix storage position, where the method includes: according to the execution operation core and the current times of matrix block operation, respectively determining an original address and an offset address corresponding to matrix block data; and reading matrix block data in the data storage area for operation according to the original address and the offset address through the data operation area, and superposing the operation result of the current times to the corresponding target matrix storage position.

In this embodiment, the execution core is used to determine the original address. The current number of times is used to determine the offset address. For example, for C ₁₁ ＝A ₁₁ ×B ₁₁ +A ₁₂ ×B ₂₁ +A ₁₃ ×B ₃₁ +A ₁₄ ×B ₄₁ +…+A _1,block ×B _block,1 Executing the operation core as core 1 can determine the origin of core 1The start address is the storage location corresponding to (1, 1). Where (1, 1) represents a matrix block with a row number of 1 and a column number of 1. In the first operation, the offset address of core 1 may be calculation A ₁₁ ×B ₁₁ At this time, based on the matrix block with 1 row number and 1 column number, A is obtained ₁₁ And B ₁₁ Is set in the above-described range. In the second calculation, the offset address of core 1 may be calculation A ₁₂ ×B ₂₁ Based on A ₁₁ And B ₁₁ Acquisition A ₁₂ And B ₂₁ Is set in the above-described range. In the third calculation, the offset address of core 1 may be calculation A ₁₃ ×B ₃₁ Based on A ₁₁ And B ₁₁ Acquisition A ₁₃ And B ₃₁ Is set in the above-described range. By analogy, the original address and the offset address can be determined during each operation. Accordingly, the operation can be performed according to the matrix data required for reading corresponding to the original address and the offset address. And the results of each operation are superimposed. The calculation sequence can be adjusted according to actual conditions.

Specifically, in an optional implementation manner of the embodiment of the present invention, according to an execution operation core and a current number of times of matrix blocking operation, determining an original address and an offset address corresponding to matrix blocking data respectively includes: when the current times are 1, determining that offset addresses corresponding to the left matrix blocks of the ith row in the original left matrix are circularly moved leftwards by i minus 1 position, and obtaining an alternative left matrix; determining offset addresses corresponding to the right matrix blocks of the j th row in the original right matrix as upwards circularly moving j minus 1 position to obtain an alternative right matrix; wherein i and j are natural numbers greater than or equal to 1.

In an optional implementation manner of the embodiment of the present invention, according to an execution operation core and a current number of times of matrix blocking operation, determining an original address and an offset address corresponding to matrix blocking data respectively includes: when the current times is greater than 1, determining that the offset address of the original left matrix is the current times of right cyclic movement minus 1 position of the left matrix blocks of each row in the alternative left matrix according to the current times of matrix block operation; when the current frequency is greater than 1, determining that the offset address of the original right matrix is equal to the current frequency minus 1 position of the right matrix block of each column in the alternative right matrix according to the current frequency of matrix block operation.

With n _block =4, for example, is explained. At this time, the number of target blocks is 16, and matrix partitioning operation can be performed using 16 cores. Fig. 5 is a schematic diagram of an input matrix partitioning according to a second embodiment of the present invention. Fig. 6 is a schematic diagram of a core distribution according to a second embodiment of the present invention. As shown in fig. 6, the cores may correspond to the corresponding matrix partitions in order from left to right, top to bottom. In the first calculation of each core, the offset address corresponding to the left matrix block of the i-th row in the original left matrix is circularly moved leftwards by i minus 1 position, so as to obtain an alternative left matrix. I.e. row 1 moves 0 positions, row 2 moves 1 position to the left, and so on. In the first calculation of each core, the offset address corresponding to the right matrix block of the j-th row in the original right matrix as shown in fig. 5 is circularly moved upwards by j minus 1 position, so as to obtain an alternative right matrix. I.e. column 1 moves 0 positions, column 2 moves 1 position upwards, and so on. Fig. 7 is a schematic diagram of an alternative matrix according to a second embodiment of the present invention. The alternative left matrix and the alternative right matrix are shown in fig. 7.

At each calculation after the first calculation of each core, the left matrix blocks of each row in the alternative left matrix can be circularly moved to the right by 1 position on the basis of the alternative matrix as shown in fig. 7; the right matrix blocks of each column in the right matrix candidate are circularly moved upwards by the current times minus 1 position.

For example, at the time of the second calculation, each row in the alternative left matrix may be shifted to the right by one position in a block-wise cycle on the basis of the alternative matrix as shown in fig. 7; each column of the alternate right matrix is shifted up one position in a block-wise cycle.

And combining the results obtained after four times of calculation by each core to obtain the multiplication result of the original high-order matrix. Taking core 1 as an example, the result of its four calculations is C ₁₁ ＝A ₁₁ ×B ₁₁ +A ₁₂ ×B ₂₁ +A ₁₃ ×B ₃₁ +A ₁₄ ×B ₄₁ I.e. the upper left corner block of the higher order matrixResults at the location.

FIG. 8 is a schematic diagram of MPU and SPU execution according to a second embodiment of the present invention. As shown in fig. 8, for the four calculations described above, data transmission may be performed in the gap of the calculations, hiding the data transmission time. The more the number of transmissions, the more significant the degree of time performance optimization when there are more partitions.

According to the technical scheme, an original left matrix and an original right matrix to be operated are obtained; determining the number of target blocks of matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of multi-core hardware and the operation execution degree of each core; performing block division processing on the original left matrix and the original right matrix according to the number of the target blocks to obtain a plurality of left matrix blocks and a plurality of right matrix blocks; acquiring each left matrix block and each right matrix block through the data storage area, and setting matrix storage positions according to the corresponding relation of the left matrix block and the right matrix block; determining the total operation times of matrix block operation according to the number of target blocks; according to the current operation times of matrix blocking operation, reading matrix blocking data corresponding to the matrix blocking operation in a data storage area for operation through a data operation area, and superposing operation results of the current times to corresponding target matrix storage positions; when the current times reach the total operation times, the result stored in the target matrix storage position is used as the matrix blocking operation result of the corresponding left matrix blocking and right matrix blocking, so that the problem of high-order matrix operation by fully utilizing hardware resources is solved, and the execution capacity of hardware can be fully utilized by carrying out blocking processing on the matrix operation, and the operation performance is improved; the data transmission time can be saved through parallel data storage and operation, and the hardware execution performance is improved.

Example III

Fig. 9 is a schematic structural diagram of a matrix operation device based on multi-core hardware according to a third embodiment of the present invention. An intra-core space of multi-core hardware, comprising: a data storage area and a data operation area which are executed in parallel. As shown in fig. 9, the apparatus includes: an original matrix acquisition module 910, a target block number determination module 920, a block processing module 930, a matrix storage location setting module 940, and a matrix block operation module 950. Wherein:

an original matrix obtaining module 910, configured to obtain an original left matrix and an original right matrix to be operated;

the target block number determining module 920 is configured to determine the target block number of the matrix partitioning operation according to the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware, and the operation execution degree of each core;

the block processing module 930 is configured to perform a block processing on the original left matrix and the original right matrix according to the number of target blocks, to obtain a plurality of left matrix blocks and a plurality of right matrix blocks;

the matrix storage position setting module 940 is configured to obtain each left matrix block and each right matrix block through the data storage area, and set a matrix storage position according to a corresponding relationship between the left matrix block and the right matrix block;

The matrix blocking operation module 950 is configured to read matrix blocking data corresponding to the matrix blocking operation from the data storage area through the data operation area, perform the operation, and update and store the operation result to the corresponding target matrix storage location.

Optionally, three memory subareas are arranged in the data storage area;

a matrix storage location setting module 940, comprising:

the first matrix storage position setting unit is used for setting the target left matrix block in the first storage subarea for storage according to the corresponding relation of the left matrix block and the right matrix block, and setting the corresponding target right matrix block in the second storage subarea for storage;

and the second matrix storage position setting unit is used for taking the third storage subarea as a storage position of a target operation result obtained by performing matrix partitioning operation on the corresponding target left matrix partition and the target right matrix partition.

Optionally, the matrix partitioning operation module 950 includes:

the total operation frequency determining unit is used for determining the total operation frequency of matrix block operation according to the target block number;

the matrix partitioning operation unit is used for reading matrix partitioning data corresponding to the matrix partitioning operation in the data storage area for operation according to the current operation times of the matrix partitioning operation through the data operation area, and superposing operation results of the current times to corresponding target matrix storage positions;

And the matrix blocking operation result determining unit is used for taking the result stored in the target matrix storage position as the matrix blocking operation result of the corresponding left matrix blocking and right matrix blocking when the current times reach the total operation times.

Optionally, the matrix partitioning operation unit includes:

the address determining subunit is used for respectively determining an original address and an offset address corresponding to the matrix block data according to an execution operation core and the current times of matrix block operation;

and the matrix block operation subunit is used for reading matrix block data from the data storage area through the data operation area according to the original address and the offset address, performing operation, and superposing the operation result of the current times to the corresponding target matrix storage position.

Optionally, the address determining subunit is specifically configured to:

when the current times are 1, determining that offset addresses corresponding to the left matrix blocks of the ith row in the original left matrix are circularly moved leftwards by i minus 1 position, and obtaining an alternative left matrix; determining offset addresses corresponding to the right matrix blocks of the j th row in the original right matrix as upwards circularly moving j minus 1 position to obtain an alternative right matrix;

wherein i and j are natural numbers greater than or equal to 1.

Optionally, the address determining subunit is specifically configured to:

according to the execution operation core and the current times of matrix block operation, respectively determining an original address and an offset address corresponding to matrix block data, including:

when the current times is greater than 1, determining that the offset address of the original left matrix is the current times of right cyclic movement minus 1 position of the left matrix blocks of each row in the alternative left matrix according to the current times of matrix block operation;

when the current frequency is greater than 1, determining that the offset address of the original right matrix is equal to the current frequency minus 1 position of the right matrix block of each column in the alternative right matrix according to the current frequency of matrix block operation.

Optionally, the target block number determining module 920 includes:

and the target block number determining unit is used for determining the target block number according to the matrix dimension and the operation execution degree if the matrix dimension is not larger than the product of the core number and the operation execution degree.

The matrix operation device based on the multi-core hardware provided by the embodiment of the invention can execute the matrix operation method based on the multi-core hardware provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 10 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 10, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as a matrix operation method based on multi-core hardware.

In some embodiments, the matrix operation method based on multi-core hardware may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the above-described multi-core hardware-based matrix operation method may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform a matrix operation method based on multi-core hardware in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. The matrix operation method based on the multi-core hardware is characterized in that the intra-core space of the multi-core hardware comprises the following steps: a data storage area and a data operation area which are executed in parallel; the method comprises the following steps:

acquiring an original left matrix and an original right matrix to be operated;

2. The method of claim 1, wherein three memory sub-regions are provided in the data storage region;

acquiring each left matrix block and each right matrix block through the data storage area, and setting a matrix storage position according to the corresponding relation of the left matrix block and the right matrix block, wherein the method comprises the following steps:

setting the target left matrix block in the first memory subarea for storage according to the corresponding relation of the left matrix block and the right matrix block, and setting the corresponding target right matrix block in the second memory subarea for storage;

and taking the third memory subarea as a memory position of a target operation result obtained by performing matrix partitioning operation on the corresponding target left matrix partition and the target right matrix partition.

3. The method according to claim 1, wherein reading matrix-partitioning data corresponding to matrix-partitioning operation in the data storage area through the data operation area to perform operation, and updating and storing operation results to corresponding target matrix storage locations, comprises:

determining the total operation times of matrix block operation according to the target block number;

reading matrix partitioning data corresponding to matrix partitioning operation in the data storage area according to the current operation times of matrix partitioning operation through the data operation area, performing operation, and superposing operation results of the current times to corresponding target matrix storage positions;

and when the current times reach the total operation times, taking the result stored in the storage position of the target matrix as a matrix block operation result of the corresponding left matrix block and right matrix block.

4. The method according to claim 3, wherein reading matrix block data corresponding to the matrix block operation in the data storage area according to the current operation times of the matrix block operation through the data operation area, and superposing operation results of the current times to corresponding target matrix storage positions, comprises:

According to the execution operation core and the current times of matrix block operation, respectively determining an original address and an offset address corresponding to matrix block data;

and reading the matrix block data in the data storage area according to the original address and the offset address through the data operation area, performing operation, and superposing the operation result of the current times to a corresponding target matrix storage position.

5. The method of claim 4, wherein determining the original address and the offset address corresponding to the matrix-block data, respectively, based on the execution core and the current number of matrix-block operations, comprises:

wherein i and j are natural numbers greater than or equal to 1.

6. The method of claim 5, wherein determining the original address and the offset address corresponding to the matrix-block data, respectively, based on the execution core and the current number of matrix-block operations, comprises:

7. The method of claim 1, wherein determining the number of target blocks for a matrix partitioning operation based on the matrix dimensions of the original left matrix and the original right matrix, the number of cores of the multi-core hardware, and the degree of operation execution for each core, comprises:

and if the matrix dimension is not greater than the product of the number of cores and the operation execution degree, determining the number of target blocks according to the matrix dimension and the operation execution degree.

8. A matrix operation device based on multi-core hardware, wherein an intra-core space of the multi-core hardware comprises: a data storage area and a data operation area which are executed in parallel; the device comprises:

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multi-core hardware-based matrix operation method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to implement the multi-core hardware-based matrix operation method of any one of claims 1-7 when executed.