CN109471612B

CN109471612B - Arithmetic device and method

Info

Publication number: CN109471612B
Application number: CN201811085786.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2020-08-21
Anticipated expiration: 2038-09-18
Also published as: CN109471612A

Abstract

The present disclosure belongs to the field of computers, and more particularly relates to an arithmetic device and a method, wherein the arithmetic device comprises: the operation control module is used for receiving or determining the block information; and the operation module is used for partitioning the operation matrix according to the partitioning information to obtain a partitioning matrix and transposing the partitioning matrix according to the operation instruction to obtain a transposing matrix of the partitioning matrix. The device and the method can realize that the operation of matrixes with any size can be completed within the complexity of constant time by using a single instruction. Compared with the traditional operation implementation method, the operation time complexity is reduced, and meanwhile, the operation is simpler and more efficient to use.

Description

Arithmetic device and method

Technical Field

The present disclosure belongs to the field of computer technology, and more particularly, to an arithmetic device and method.

Background

The matrix transposition operation is a basic mathematical operation with high frequency in various fields. In addition to ordinary matrix transposition (even if the rows and columns of the matrix are interchanged), there are special matrix transposition operations such as row transposition, column transposition, reverse transposition, 90 ° left-hand inversion, 90 ° right-hand inversion, etc., which are also commonly used in matrix operations. At the present stage, the common method for matrix transposition (including unconventional transposition, hereinafter referred to as transposition) operation using a computer is to write two-layer cycles with a general-purpose processor for data exchange at different addresses, with a time complexity of O (n ^ 2). Such a time-complex matrix transpose operation becomes one of the bottlenecks in improving performance in a complex system, especially in the case of a large number of matrix elements.

Disclosure of Invention

In view of the above problems, the present disclosure is directed to an arithmetic device and method for solving at least one of the above problems.

In order to achieve the above object, as one aspect of the present disclosure, there is provided an arithmetic device comprising:

the operation control module is used for receiving or determining the block information;

and the operation module is used for partitioning the operation matrix according to the partitioning information to obtain a partitioning matrix and transposing the partitioning matrix according to the operation instruction to obtain a transposing matrix of the partitioning matrix.

In some embodiments, the operation module is further configured to perform a combining operation after the blocking and transposing operations, and combine the transpose matrices of the blocking matrices to obtain the transpose matrices of the operation matrices.

In some embodiments, the operational instruction comprises a conventional transpose instruction, a row transpose instruction, a column transpose instruction, a reverse transpose instruction, a 90 ° left flip transpose instruction, a 90 ° right flip transpose instruction.

In some embodiments, the computing device further comprises:

the address storage module is used for storing the address information of the operation matrix; and

the data storage module is used for storing the operation matrix and storing the calculated transposition matrix;

the operation control module is used for receiving address information and block information of the operation matrix, or extracting the address information of the operation matrix from the address storage module, and analyzing the address information of the operation matrix to obtain the block information;

the operation module is used for acquiring address information and blocking information of an operation matrix from the operation control module, extracting the operation matrix from the data storage module according to the address information of the operation matrix, performing blocking, transposition and merging operation on the operation matrix to obtain a transposition matrix of the operation matrix, and feeding back the transposition matrix of the operation matrix to the data storage module.

In some embodiments, the operation module comprises a matrix blocking unit, a matrix operation unit, and a matrix merging unit, wherein:

the matrix blocking unit is used for acquiring address information and blocking information of an operation matrix from the operation control module, extracting the operation matrix from the data storage module according to the address information of the operation matrix, and blocking the operation matrix according to the blocking information to obtain n blocking matrixes;

the matrix operation unit is used for acquiring the n block matrixes and performing transposition operation on the n block matrixes according to the operation instruction to obtain a transposition matrix of the n block matrixes;

the matrix merging unit is used for acquiring and merging the transpose matrixes of the n block matrixes to obtain the transpose matrix of the operation matrix and feeding back the transpose matrix of the operation matrix to the data storage module, wherein n is a natural number; the matrix merging unit merges the transpose matrixes of the n block matrixes according to a merging mode corresponding to the transpose mode obtained by the operation instruction.

In some embodiments, the operation module further includes a buffer unit, configured to buffer the n block matrices for the matrix operation unit to obtain.

In some embodiments, the operation control module includes an instruction processing unit, an instruction cache unit, and a matrix determination unit, wherein:

the instruction cache unit is used for storing the operation instruction to be executed;

the instruction processing unit is used for acquiring the operation instruction from the instruction cache unit, decoding the operation instruction and acquiring address information of an operation matrix from the address storage module according to the decoded operation instruction;

and the matrix judgment unit is used for analyzing the address information of the operation matrix to obtain the block information.

In some embodiments, the operation control module further includes a dependency processing unit, configured to determine whether address information of the decoded operation instruction and operation matrix conflicts with a previous operation, and if so, temporarily store the address information of the decoded operation instruction and operation matrix; and if no conflict exists, transmitting the decoded operation instruction and the address information of the operation matrix to the matrix judgment unit.

In some embodiments, the operation control module further includes an instruction queue memory configured to buffer address information of the decoded operation instruction and operation matrix with a conflict, and transmit the buffered address information of the decoded operation instruction and operation matrix to the matrix judgment unit after the conflict is eliminated.

In some embodiments, the instruction processing unit comprises an instruction fetch unit and a decode unit, wherein:

an instruction fetch unit for fetching the operation instruction from the instruction cache unit and transmitting the operation instruction to the decode unit;

and the decoding unit is used for decoding the operation instruction, extracting address information of an operation matrix from the address storage module according to the decoded operation instruction, and transmitting the decoded operation instruction and the extracted address information of the operation matrix to the dependency relationship processing unit.

In some embodiments, the computing device further comprises:

and the input and output module is used for inputting the operation matrix data to the data storage module, acquiring the operated transposition matrix from the data storage module and outputting the operated transposition matrix.

In some embodiments, the computing device further comprises: the address storage module is used for storing the address information of the operation matrix; the address storage module comprises a scalar register file or a general memory unit; the data storage module comprises a high-speed temporary storage memory or a general memory unit; the address information of the operation matrix is initial address information and matrix size information of the matrix.

According to another aspect of the present disclosure, there is provided an arithmetic method including the steps of:

the operation control module receives or determines the block information;

the operation module divides the operation matrix into blocks according to the block information to obtain a block matrix, and transposes the block matrix according to the operation instruction to obtain a transpose matrix of the block matrix.

In some embodiments, after the step of transpose operation, the method further comprises: and combining operation, wherein the operation module combines the transpose matrixes of the block matrixes to obtain the transpose matrix of the operation matrix.

In some embodiments, the step of the arithmetic control module determining blocking information comprises:

the operation control module extracts address information of the operation matrix from the address storage module; and

and the operation control module determines the block information according to the address information of the operation matrix.

In some embodiments, the step of the operation control module extracting the address information of the operation matrix from the address storage module includes:

the instruction fetching unit fetches the operation instruction and sends the operation instruction to the decoding unit;

the decoding unit decodes the operation instruction, acquires the address information of the operation matrix from the address storage module according to the decoded operation instruction, and sends the decoded operation instruction and the address information of the operation matrix to the dependency relationship processing unit;

the dependency relationship processing unit analyzes whether the decoded operation instruction and the previous instruction which is not executed and ended have a dependency relationship on data or not; if there is a dependency relationship, the decoded operation instruction and the address information of the corresponding operation matrix need to wait in the instruction queue memory until there is no more dependency relationship on data between the decoded operation instruction and the previous instruction that has not been executed.

In some embodiments, the operation module performs blocking, transposing, and combining operations on an operation matrix according to the blocking information to obtain a transposed matrix of the operation matrix, including:

a matrix partitioning unit of the operation module extracts an operation matrix from the data storage module according to the address information of the operation matrix; dividing the operation matrix into n block matrixes according to the block information;

a matrix operation unit of an operation module performs transposition operation on the n block matrixes respectively according to the operation instruction to obtain transposition matrixes of the n block matrixes; and

a matrix merging unit of the operation module merges the transpose matrixes of the n block matrixes to obtain the transpose matrixes of the operation matrix and feeds the transpose matrixes back to the data storage module;

wherein n is a natural number.

In some embodiments, the step of merging the transpose matrices of the n block matrices by the operation module to obtain the transpose matrix of the operation matrix, and feeding the transpose matrix back to the data storage module includes:

the matrix merging unit receives the transpose matrix of each block matrix, and after the number of the transpose matrixes of the received block matrixes reaches the total number of the blocks, matrix merging operation is carried out on all the blocks to obtain the transpose matrix of the operation matrix; feeding back the transposed matrix to a designated address of a data storage module;

the input and output module directly accesses the data storage module and reads the transposition matrix of the operation matrix obtained by operation from the data storage module.

According to the operation device and the operation method, the operation matrix can be partitioned, the transposed matrixes of the multiple partition matrixes are obtained by respectively conducting the transposition operation on the multiple partition matrixes, and finally the transposed matrixes of the multiple partition matrixes are combined to obtain the transposed matrix of the operation matrix, so that the operation of transposing any size of matrix can be completed by using a single instruction within constant time complexity. Compared with the traditional matrix transposition operation implementation method, the matrix transposition operation is simpler and more efficient to use while the operation time complexity is reduced.

Drawings

Fig. 1 is a schematic structural diagram of an arithmetic device according to the present disclosure.

Fig. 2 is a schematic information flow diagram of the computing device proposed in the present disclosure.

Fig. 3 is a schematic structural diagram of an operation module in the operation device according to the present disclosure.

Fig. 4 is a schematic diagram of the operation module provided in the present disclosure performing matrix operation.

Fig. 5 is another schematic diagram of the operation module of the present disclosure performing matrix operation.

Fig. 6 is a schematic structural diagram of an arithmetic control module in the arithmetic device according to the present disclosure.

Fig. 7 is a detailed structural diagram of an arithmetic device according to an embodiment of the disclosure.

Fig. 8 is a flowchart of an operation method according to another embodiment of the disclosure.

Fig. 9 is another schematic structural diagram of the computing device according to the disclosure.

Fig. 10 is a schematic diagram of another information flow of the computing device proposed in the present disclosure.

Fig. 11 is a schematic structural diagram of an iterative blocking unit proposed by the present disclosure.

Fig. 12 is another schematic structural diagram of a matrix operation unit proposed in the present disclosure.

Fig. 13 is another structural schematic diagram of the iterative merge operation unit proposed in the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

The present disclosure provides an arithmetic device including:

The operation module is further configured to perform a combining operation after the blocking operation and the transposing operation, and combine the transposing matrices of the blocking matrices to obtain the transposing matrices of the operation matrices.

The operation module performs transposition operation on the matrix according to a transposition mode obtained by the operation instruction. The operation instruction comprises a conventional transposition instruction (the conventional transposition is that all elements of the matrix A are subjected to mirror surface inversion around a ray which starts from the 1 st row and the 1 st column of elements and is 45 degrees at the lower right, so that the transposition of the matrix A is obtained), a row transposition instruction, a column transposition instruction, an inversion transposition instruction, a 90-degree left-turning transposition instruction, a 90-degree right-turning transposition instruction and the like. The specific format of the operation instruction at least including the instruction type, the data type, the matrix head address, the matrix row number, the matrix column number and the target storage address is shown in table 1.

TABLE 1 operation instruction Format

The following illustrates an operation instruction related to the transpose operation of the present disclosure and a transpose method corresponding to the operation instruction.

Taking the operation instruction as an example for solving the conventional transposition MTRAN by using the matrix, solving the conventional transposition matrix operation for the given matrix. In a specific implementation, a square matrix a is given, and the transpose of the matrix a is calculated according to the following formula (taking a 4 × 4 matrix as an example).

Taking the operation instruction as an example of matrix line transposition MRTRAN, the line transposition matrix operation is carried out on a given matrix. In a specific implementation, a square matrix a is given, and the transpose of the matrix a is calculated according to the following formula (taking a 4 × 4 matrix as an example).

Taking the operation instruction as an MCTRAN for matrix column transposition as an example, a given matrix column transposition matrix operation is performed. In a specific implementation, a square matrix a is given, and the transpose of the matrix a is calculated according to the following formula (taking a 4 × 4 matrix as an example).

Taking the operation instruction as the matrix negation transposition MOTRAN as an example, the given matrix negation transposition matrix operation is carried out. In a specific implementation, a square matrix a is given, and the transpose of the matrix a is calculated according to the following formula (taking a 4 × 4 matrix as an example).

Taking the operation instruction as an example of solving the left-turning 90-degree transposition MLTTRAN for the matrix, solving the left-turning 90-degree transposition matrix for the given matrix. In a specific implementation, a square matrix a is given, and the transpose of the matrix a is calculated according to the following formula (taking a 4 × 4 matrix as an example).

Taking the operation instruction as an example for solving a 90-degree rightwards-turned-over MRTTRAN as a matrix, solving a 90-degree rightwards-turned-over matrix operation for a given matrix. In a specific implementation, a square matrix a is given, and the transpose of the matrix a is calculated according to the following formula (taking a 4 × 4 matrix as an example).

Specifically, the blocking information may include at least one of blocking size information, blocking mode information, and blocking merge information. The block size information indicates size information of each block matrix obtained after the operation matrix is blocked. The partition manner information indicates a manner of partitioning the operation matrix. The separate merging information indicates a manner of re-merging to obtain a transposed matrix of the operation matrix after performing the transpose operation on each of the block matrices.

The operation device can divide the operation matrix into blocks, obtain the transpose matrixes of the plurality of block matrixes by respectively performing transpose operation on the plurality of block matrixes, and finally merge the transpose matrixes of the plurality of block matrixes to obtain the transpose matrixes of the operation matrix, so that the transpose operation of matrixes with any size can be completed by using a single instruction within constant time complexity. Compared with the traditional matrix transposition operation implementation method, the matrix transposition operation is simpler and more efficient to use while the complexity of operation time is reduced.

As shown in fig. 1-2, in some embodiments of the present disclosure, the computing device further includes: the device comprises an operation control module 2, an operation module 3, an address storage module 1 and a data storage module 4.

Specifically, the operation control module is configured to receive or determine blocking information;

the operation module is used for carrying out blocking and transposition operation on the operation matrix so as to obtain a transposition matrix of the operation matrix; further, the operation module divides the operation matrix into blocks according to the block information to obtain a block matrix, and transposes the block matrix according to the operation instruction to obtain a transpose matrix of the block matrix. Furthermore, the operation module is further configured to perform a combining operation after the blocking operation and the transposing operation, and combine the transposing matrices of the blocking matrices to obtain the transposing matrices of the operation matrices.

the data storage module is used for storing original matrix data, wherein the original matrix data comprises the operation matrix and stores the calculated transposed matrix;

the operation control module is used for receiving address information and block information of the operation matrix, or is used for extracting the address information of the operation matrix from the address storage module and analyzing the address information of the operation matrix to obtain the block information; the operation module is used for acquiring address information and blocking information of an operation matrix from the operation control module, extracting the operation matrix from the data storage module according to the address information of the operation matrix, blocking the operation matrix according to the blocking information, transposing the blocking matrix according to an operation instruction, merging the obtained matrixes after the transposition of the blocking matrixes to obtain a transposing matrix of the operation matrix, and feeding back the transposing matrix of the operation matrix to the data storage module.

As shown in fig. 3, in some embodiments of the present disclosure, the operation module includes a matrix blocking unit, a matrix operation unit, and a matrix merging unit, where:

the matrix blocking unit 31 is configured to acquire address information and blocking information of an operation matrix from the operation control module, extract the operation matrix from the data storage module according to the address information of the operation matrix, and perform blocking operation on the operation matrix according to the blocking information to obtain n blocking matrices;

the matrix operation unit 32 is configured to obtain n block matrices, and perform transposition operation on the n block matrices according to a transposition manner obtained by the operation instruction, respectively, to obtain transposition matrices of the n block matrices;

the matrix combining unit 33 is configured to obtain a corresponding combining manner according to a transposition manner obtained by the operation instruction (the combining manner is the same as the corresponding transposition manner, that is, each partition is transposed in the corresponding transposition manner as an element), obtain and combine transpose matrices of n partition matrices, and obtain a transpose matrix of the operation matrix, where n is a natural number.

For example, as shown in fig. 4, for an operation matrix X stored in the data storage module, the matrix blocking unit of the operation module extracts the operation matrix X from the data storage module, and performs a blocking operation on the operation matrix X according to the blocking information to obtain 4 blocking matrices X₁、X₂、X₃、X₄And output to the matrix operation unit; the matrix operation unit acquires the 4 block matrixes from the matrix block unit, and transposes the 4 block matrixes respectively to obtain a transpose matrix X of the 4 block matrixes₁ ^T、X₂ ^T、X₃ ^T、X₄ ^TAnd output to the matrix merging unit; the matrix merging unit acquires the transpose matrixes of the 4 block matrixes from the matrix operation unit and merges the transpose matrixes to obtain a transpose matrix X of the operation matrix^TThe transpose matrix X can be further transposed^TAnd outputting the data to a data storage module.

As shown in fig. 5, it is a process of implementing 90 ° left-turning transposition (taking 4 × 4 as an example) of the matrix in some embodiments of the present disclosure. (the process is that the matrix is inverted by 90 degrees leftwards and transposed in the blocks, then all the blocks are inverted by 90 degrees leftwards and transposed, other transposing operations are similar to the above, and are not described herein again, and in addition, the blocking mode is not unique and is only exemplarily described here.)

In some embodiments of the present disclosure, please continue to refer to fig. 3, the operation module further includes a buffer unit 34 for buffering the n block matrices for the matrix operation unit to obtain.

In some embodiments of the present disclosure, the matrix combining unit may further include a memory configured to temporarily store the obtained transposed matrices of the block matrices, and after the matrix operation unit completes operations on all the block matrices, the matrix combining unit may obtain the transposed matrices of all the block matrices, and then select a corresponding combining mode according to different transposing modes to perform a combining operation on the transposed matrices of the n block matrices, so as to obtain transposed matrices, and write back an output result to the data storage module.

It should be understood by those skilled in the art that the matrix blocking unit, the matrix operation unit and the matrix combination unit may be implemented in the form of hardware, or may be implemented in the form of software program modules. The matrix partitioning unit and the matrix merging unit may include one or more control elements, and the matrix operation unit may include one or more control elements and calculation elements.

As shown in fig. 6, in some embodiments of the present disclosure, the operation control module includes an instruction processing unit 22, an instruction cache unit 21, and a matrix determination unit 23, where:

the instruction cache unit is used for storing a matrix operation instruction to be executed;

the instruction processing unit is used for acquiring the matrix operation instruction from the instruction cache unit, decoding the matrix operation instruction and extracting the address information of the operation matrix from the address storage module according to the decoded matrix operation instruction;

and the matrix judging unit is used for judging whether the block division is needed or not according to the address information of the operation matrix and obtaining the block division information according to the judgment result.

In some embodiments of the present disclosure, please refer to fig. 6 again, the operation control module further includes a dependency processing unit 24, configured to determine whether the decoded matrix operation instruction and the address information of the operation matrix conflict with a previous operation, and if so, temporarily store the decoded matrix operation instruction and the address information of the operation matrix; and if no conflict exists, transmitting the decoded matrix operation instruction and the address information of the operation matrix to a matrix judgment unit.

In some embodiments of the present disclosure, please refer to fig. 6, the operation control module further includes an instruction queue memory 25, configured to buffer the decoded matrix operation instruction and the address information of the operation matrix that have a conflict, and transmit the buffered decoded matrix operation instruction and the address information of the operation matrix to the matrix judgment unit after the conflict is eliminated.

Specifically, when the matrix operation instruction accesses the data storage module, the previous instruction and the next instruction may access the same block of storage space, and in order to ensure the correctness of the instruction execution result, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the instruction must wait in the instruction queue memory until the dependency relationship is eliminated.

In some embodiments of the present disclosure, please refer to fig. 6, wherein the instruction processing unit includes an instruction fetching unit 221 and a decoding unit 222, wherein:

the instruction fetching unit is used for obtaining a matrix operation instruction from the instruction cache unit and transmitting the matrix operation instruction to the decoding unit;

and the decoding unit is used for decoding the matrix operation instruction, extracting the address information of the operation matrix from the address storage module according to the decoded matrix operation instruction, and transmitting the decoded matrix operation instruction and the extracted address information of the operation matrix to the dependency relationship processing unit.

In some embodiments of the present disclosure, the above operation device further includes an input/output module, configured to input the operation matrix to the data storage module, and further configured to obtain the operated transpose matrix from the data storage module, and output the operated transpose matrix.

In some embodiments of the present disclosure, the address information of the operation matrix is start address information and matrix size information of the matrix.

In some embodiments of the present disclosure, the address information of the operation matrix is a storage address of the matrix in the data storage module.

In some embodiments of the present disclosure, the address storage module is a scalar register file or a general purpose memory unit; the data storage module is a high-speed temporary storage memory or a general memory unit.

In some embodiments of the present disclosure, the address storage module may be a scalar register file, which provides scalar registers required during operation, and the scalar registers may store not only the matrix addresses, but also scalar data. After the blocking operation is performed when transposing a large-scale matrix, the scalar data in the scalar register can be used to record the number of matrix blocks.

In some embodiments of the present disclosure, the data storage module may be a scratch pad memory capable of supporting matrix data of different sizes.

In some embodiments of the present disclosure, the matrix determining unit is configured to determine a size of the matrix, and if the size exceeds a predetermined size threshold M, the matrix needs to be partitioned, and the matrix determining unit obtains partitioning information according to a result of the determination.

In some embodiments of the present disclosure, the instruction cache unit is configured to store a matrix operation instruction to be executed. When an instruction is executed, if the instruction is the earliest instruction in uncommitted instructions in the instruction cache unit, the instruction will be committed, and once the instruction is committed, the change of the device state by the operation performed by the instruction cannot be cancelled. The instruction cache unit may be a reorder cache.

In some embodiments of the present disclosure, the matrix operation instruction is a matrix transposition operation instruction (abbreviated as an operation instruction), and includes an operation code and an operation domain, where the operation code is used to indicate a function of the matrix transposition operation instruction, the matrix operation control module determines to perform matrix transposition operation by identifying the operation code, and the operation domain is used to indicate data information of the matrix transposition operation instruction, where the data information may be an immediate number or a register number, for example, when a matrix is to be obtained, a matrix start address and a matrix scale may be obtained in a corresponding register according to the register number, and then a matrix stored in a corresponding address is obtained in the data storage module according to the matrix start address and the matrix scale.

In some embodiments, as shown in fig. 7, the arithmetic device of the present embodiment includes an address storage module, an arithmetic control module, an arithmetic module, a data storage module, and an input/output module 5, wherein

Optionally, the operation control module includes an instruction cache unit, an instruction processing unit, a dependency relationship processing unit, an instruction queue memory, and a matrix judgment unit, where the instruction processing unit includes an instruction fetch unit and a decoding unit;

optionally, the operation module includes a matrix blocking unit, a matrix caching unit, a matrix operation unit, and a matrix merging unit;

optionally, the address storage module is a scalar register file;

optionally, the data storage module is a scratch pad memory; the input/output module is an IO memory access module.

The method and the device have the advantages that the new operation structure is used for realizing the transposition operation of the matrix simply and efficiently, and the time complexity of the operation is reduced.

The present disclosure also provides an operation method, including the following steps:

the operation control module receives or determines the block information;

After the step of transpose operation, the method may further include: and combining operation, wherein the operation module combines the transpose matrixes of the block matrixes to obtain the transpose matrix of the operation matrix.

In some embodiments, as shown in fig. 8, the operation method of the present disclosure includes the following steps:

step 1, an operation control module extracts address information of an operation matrix from an address storage module;

step 2, the operation control module obtains block information according to the address information of the operation matrix and transmits the address information and the block information of the operation matrix to the operation module;

step 3, the operation module extracts the operation matrix from the data storage module according to the address information of the operation matrix; dividing the operation matrix into n block matrixes according to the block information;

step 4, the operation module respectively performs transposition operation on the n block matrixes according to the operation instruction to obtain transposition matrixes of the n block matrixes;

step 5, the operation module merges the transpose matrixes of the n block matrixes to obtain the transpose matrix of the operation matrix and feeds the transpose matrix back to the data storage module;

wherein n is a natural number.

In some embodiments, the present embodiment provides an operation method for performing a transpose operation of a large-scale matrix, which specifically includes the following steps:

step 1, the operation control module extracts address information of an operation matrix from an address storage module, and the method specifically comprises the following steps:

step 1-1, an instruction fetching unit extracts an operation instruction and sends the operation instruction to a decoding unit;

step 1-2, the decoding unit decodes the operation instruction, acquires the address information of the operation matrix from the address storage module according to the decoded operation instruction, and sends the decoded operation instruction and the address information of the operation matrix to the dependency relationship processing unit;

step 1-3, the dependency relationship processing unit analyzes whether the decoded operation instruction and the previous instruction which is not executed and ended have dependency relationship on data; specifically, the dependency processing unit may determine whether the register has a condition to be written in according to an address of a register to be read by the operation instruction, if so, the dependency exists, and the operation instruction may be executed only after the data is written back.

If the dependency exists, the decoded operation instruction and the address information of the corresponding operation matrix need to wait in the instruction queue memory until the dependency does not exist on the data with the previous unexecuted instruction;

step 2, the operation control module obtains block information according to the address information of the operation matrix;

specifically, when the dependency relationship does not exist, the instruction queue memory transmits the decoded operation instruction and the address information of the corresponding operation matrix to the matrix judgment unit, judges whether the matrix needs to be partitioned, and the matrix judgment unit obtains the partitioning information according to the judgment result and transmits the partitioning information and the address information of the operation matrix to the matrix partitioning unit;

specifically, the matrix blocking unit takes out a required operation matrix from the data storage module according to the address information of the transmitted operation matrix, then divides the operation matrix into n blocking matrixes according to the transmitted blocking information, and transmits each blocking matrix to the matrix cache unit in sequence after blocking is completed;

step 4, the operation module respectively performs transposition operation on the n block matrixes according to a transposition mode obtained by decoding to obtain transposition matrixes of the n block matrixes;

specifically, the matrix operation unit sequentially extracts the block matrixes from the matrix cache unit, transposes each extracted block matrix, and then transmits the obtained transpose matrix of each block matrix to the matrix merging unit.

Step 5, the operation module merges the transpose matrixes of the n block matrixes to obtain the transpose matrix of the operation matrix, and feeds the transpose matrix back to the data storage module, and the method specifically comprises the following steps:

step 5-1, a matrix merging unit receives the transpose matrix of each block matrix, and when the number of the transpose matrixes of the received block matrixes reaches the total number of the blocks, matrix merging operation is carried out on all the blocks according to a transpose mode obtained by decoding to obtain the transpose matrix of an operation matrix; feeding back the transposed matrix to a designated address of a data storage module;

and 5-2, directly accessing the data storage module by the input and output module, and reading the transposed matrix of the operation matrix obtained by operation from the data storage module.

The present disclosure also provides another arithmetic device, including:

and the iterative operation module is used for performing iterative partitioning, transposition and iterative merging operation on the operation matrix according to the partitioning information to obtain a transposition matrix of the operation matrix.

In some embodiments of the present disclosure, as shown in fig. 9 to 10, the computing device of the present disclosure includes an address storage module, a data storage module, the computing control module, and a computing module.

The address storage module is used for storing address information of the operation matrix;

the data storage module is used for storing original matrix data and storing the calculated transposition matrix;

the operation control module is used for extracting the address information of the operation matrix from the address storage module and analyzing the address information of the operation matrix to obtain the block information;

the operation module is the iterative operation module 3' and is used for acquiring the address information and the blocking information of the operation matrix from the operation control module, extracting the operation matrix from the data storage module according to the address information of the operation matrix, performing iterative blocking, transposition and iterative combination operation on the operation matrix according to the blocking information to obtain a transposition matrix of the operation matrix, and feeding back the transposition matrix of the operation matrix to the data storage module.

According to the method and the device, iterative partitioning can be performed on the operation matrix to obtain the to-be-transposed partitioning matrix which accords with the expected scale, the transposing matrixes of the multiple partitioning matrixes are obtained by performing transposing operation on the multiple partitioning matrixes respectively, and finally the transposing matrixes of the multiple partitioning matrixes are subjected to iterative merging to obtain the transposing matrix of the operation matrix, so that the transposing operation of any size of matrix can be completed within constant time complexity by using a single instruction. Compared with the traditional matrix transposition operation implementation method, the matrix transposition operation is simpler and more efficient to use while the operation time complexity is reduced.

As shown in fig. 9, in some embodiments of the present disclosure, the iterative operation module includes an iterative blocking unit 31 ', a matrix operation unit 32 ', and an iterative combining unit 33 '.

As shown in fig. 11, the iterative block partitioning unit 31 'is configured to obtain address information and block partitioning information of an operation matrix from an operation control module, extract the operation matrix from a data storage module according to the address information of the operation matrix, and perform iterative block partitioning operation on the operation matrix according to the block partitioning information to obtain n block matrixes, and includes a matrix judgment unit 311', a matrix block partitioning unit 312 ', and a cache unit 313'; the matrix blocking unit blocks the operation matrix according to the blocking information, and after the blocking, the blocking matrix is sent to the cache unit; the matrix judgment unit is used for receiving the matrix information sent by the cache unit and judging the scale of the matrix; the cache unit may re-input the blocking matrix into the matrix judgment unit of the iterative blocking unit, and if there is still a blocking matrix exceeding the predetermined size threshold M, it is necessary to send the blocking matrix exceeding the specified maximum size to the matrix blocking unit to continue the blocking operation, and if so, the iteration is performed until the size of any blocking matrix meets the requirement that is less than or equal to the predetermined size threshold M.

As shown in fig. 12, the matrix operation unit 32 'is configured to obtain n block matrices, and perform a transpose operation on the n block matrices, respectively, to obtain a transpose matrix of the n block matrices, where the transpose matrix includes an address mapping generation unit 321', an address counter 322 ', and an element exchange unit 323'; the address mapping generation unit is used for generating an address mapping table according to input matrix scale information, address information and transposition mode information (obtained by decoding a matrix transposition instruction); the element exchange unit is used for exchanging the elements at the corresponding positions of the matrix according to the address mapping table; the counter unit is used for judging whether the matrix is completely processed.

As shown in fig. 13, the iterative combining unit 33' is configured to obtain and iteratively combine transpose matrices of n block matrices to obtain a transpose matrix of an operation matrix, where n is a natural number; specifically, the method is used for receiving and temporarily caching the transferred block matrixes, and after all the block matrixes are subjected to transposition operation, performing iterative merging operation on the transposition matrixes of the n block matrixes to obtain the transposition matrixes of the operation matrixes; it includes buffer unit 334 ', address mapping generation unit 331 ', address counter 332 ', matrix merging unit 333 ' and matrix judgment unit 335 '; after the matrix operation unit finishes transposing all the block matrixes, a cache unit of the iterative merging unit receives and temporarily caches the transposed block matrixes; the matrix merging unit receives and merges the transposed block matrixes, the address mapping generation unit and the address counter take each block matrix as an element, the element exchange operation in the matrix operation unit is carried out, namely the batch processing operation of the element exchange of the selected block matrix is carried out, and then the matrix judgment unit is utilized to ensure the merging to the original non-block scale.

In this embodiment, the iterative merging unit may further include an element exchanging unit, that is, element-by-element exchanging of the corresponding address, or may further include a memory exchanging unit, that is, batch exchanging of matrix blocks (transposed small matrix blocks) under the corresponding address.

In this embodiment, the operation module may not include a matrix judgment unit, and when the matrix blocking unit and the matrix combining unit perform iterative blocking and iterative combining, matrix scale judgment is directly performed through the matrix judgment unit included in the operation control module.

The structure of the iteration blocking unit, the iteration merging unit and the matrix operation unit in this embodiment is not limited to the structure shown in fig. 11 to 13. In other words, the matrix determination unit in the computing apparatus may be used in a non-shared manner or in a shared manner, the operation module may include an address mapping generation unit, an address counter, an element exchange unit, and a cache unit, and these units may be used in a non-shared manner or in a shared manner, and are not limited to the iteration blocking unit and the iteration merging unit, which respectively include the address mapping generation unit, the address counter, the element exchange unit, and the cache unit.

It should be noted that, since the matrix transposition and the matrix merging are separated in time sequence, that is, after all the partitions are transposed, the corresponding device of the operation unit is idle, and at this time, the matrix merging is performed to multiplex the corresponding unit. However, if the method is applied to a pipeline, for example, a three-stage pipeline, when a matrix is subjected to matrix transposition and is about to enter a matrix combining unit, the next block of the matrix to be transposed enters an operation unit, and the next matrix to be transposed enters a matrix blocking unit, that is, the three units of blocking, operation and combining are all occupied, so that any small modules of the blocking unit, the transposition unit and the combining unit are not reusable.

Correspondingly, the present disclosure also provides an operation method, including the following steps:

the operation control module receives or determines the block information;

and the iterative operation module performs iterative partitioning, transposition and iterative combination operation on the operation matrix according to the partitioning information to obtain a transposition matrix of the operation matrix.

In some embodiments, the method of operation comprises:

the operation control module extracts address information of the operation matrix from the address storage module;

the operation control module obtains the block information according to the address information of the operation matrix and transmits the address information and the block information of the operation matrix to the operation module;

the operation module extracts the operation matrix from the data storage module according to the address information of the operation matrix; the operation matrix is divided into n block matrixes meeting the transposition requirement scale in an iteration mode according to the block information;

the operation module respectively performs transposition operation on the n block matrixes to obtain transposition matrixes of the n block matrixes; the operation module iteratively merges the transpose matrixes of the n block matrixes to obtain the transpose matrix of the operation matrix and feeds the transpose matrix back to the data storage module; wherein n is a natural number.

In the step of performing iterative block operation on the operation matrix by the iterative block unit according to the block information:

the matrix blocking unit blocks the operation matrix according to the blocking information and sends the blocking matrix to the cache unit;

the cache unit inputs the block matrixes to the matrix judgment unit, and the matrix judgment unit judges that if the block matrixes exceed the preset scale threshold value M, the block matrixes are sent to the matrix block unit to continue block operation, and if so, iteration is performed until the scale of any block matrix is smaller than or equal to the preset scale threshold value M.

In the step of obtaining and iteratively combining the transpose matrices of the n block matrices by the iterative combining unit of the iterative operation module:

the cache unit receives the transferred block matrix;

the matrix merging unit receives and merges the transformed block matrixes sent by the cache unit;

the address mapping generation unit and the element exchange unit take each block matrix in the merged matrix as an element to carry out element exchange operation;

the matrix judgment unit determines the scale of the matrix obtained by combination.

In the foregoing embodiment, the block information may be obtained by analyzing address information, or may be directly obtained from input data, that is, the input data of the operation control module includes the block information

The following further describes in detail the components of the computing device of the present disclosure:

the instruction fetching unit is responsible for fetching a next operation instruction to be executed from the instruction cache unit and transmitting the operation instruction to the decoding unit;

the decoding unit is responsible for decoding the operation instruction, sending the decoded operation instruction to the scalar register file to obtain address information of an operation matrix fed back by the scalar register file, and transmitting the decoded operation instruction and the obtained address information of the operation matrix to the dependency relationship processing unit;

and the dependency relationship processing unit is used for processing the storage dependency relationship which can exist between the operation instruction and the previous instruction. The matrix operation instruction accesses the scratch pad memory, and the previous and subsequent instructions may access the same block of memory space. In order to ensure the correctness of the instruction execution result, if the current operation instruction is detected to have a dependency relationship with the data of the previous operation instruction, the operation instruction must be cached in an instruction queue memory until the dependency relationship is eliminated; if the current operation instruction does not have a dependency relationship with the previous operation instruction, the dependency relationship processing unit directly transmits the address information of the operation matrix and the decoded operation instruction to the matrix judgment unit.

The instruction queue memory is used for caching the decoded operation instructions with conflicts and the address information of the corresponding operation matrix in consideration of possible dependency relationship on the scalar registers corresponding to/appointed by different operation instructions, and transmitting the decoded operation instructions and the address information of the corresponding operation matrix to the matrix judgment unit after the dependency relationship is met;

and the matrix judgment unit is used for judging the size of the matrix according to the address information of the operation matrix, if the size exceeds a preset size threshold value M, the matrix needs to be subjected to blocking operation, the matrix judgment unit analyzes the blocking information according to the judgment result, and transmits the address information of the operation matrix and the obtained blocking information to the matrix blocking unit.

And the matrix blocking unit is used for extracting the operation matrix needing transposition operation from the high-speed temporary storage according to the address information of the operation matrix and blocking the operation matrix according to the blocking information to obtain n blocking matrixes. The matrix caching unit is used for caching n partitioned matrixes after partitioning and sequentially transmitting the n partitioned matrixes to the matrix operation unit for transposition operation;

the matrix operation unit is responsible for extracting the block matrixes from the matrix cache unit in sequence, performing transposition operation according to a matrix transposition mode obtained by the decoding unit, and transmitting the transposed block matrixes to the matrix merging unit;

and the matrix merging unit is responsible for receiving and temporarily caching the transferred block matrixes, and after all the block matrixes are subjected to transposition operation, the matrix merging unit performs merging operation on the transposition matrixes of the n block matrixes according to the matrix transposition mode obtained by the decoding unit to obtain the transposition matrixes of the operation matrixes.

The scalar register file is used for providing scalar registers required by the device in the operation process and providing address information of an operation matrix for operation;

the module is a temporary storage device special for matrix data and can support matrix data with different sizes.

And the IO memory access module is used for directly accessing the scratch pad memory and is responsible for reading data from the scratch pad memory or writing data into the scratch pad memory.

It should be noted that, in the operation device and the operation method of the present disclosure, the operation control module may directly receive the address information and the block information of the operation matrix, or is configured to extract the address information of the operation matrix from the address storage module, and analyze the address information of the operation matrix to obtain the block information.

In addition, in some embodiments, the present disclosure also provides a chip including the above-mentioned arithmetic device.

In some embodiments, the present disclosure also provides a chip packaging structure, which includes the above chip.

In some embodiments, the present disclosure also provides a board card including the above chip package structure.

In some embodiments, the present disclosure also provides an electronic device, which includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. An arithmetic device comprising:

the operation module is an iterative operation module and is used for carrying out iterative partitioning, transposition and iterative merging operation on an operation matrix according to the partitioning information to obtain a transposition matrix of the operation matrix;

wherein, the iterative operation module comprises:

the iteration blocking unit is used for blocking the operation matrix according to the blocking information, judging whether the scale of the matrix after blocking exceeds a preset scale threshold value, blocking the matrix with the scale larger than the preset scale threshold value again, and iterating and blocking until the scale of any matrix after blocking is not larger than the preset scale threshold value to obtain n blocking matrixes;

the matrix operation unit is used for acquiring the n block matrixes and respectively performing transposition operation on the n block matrixes to obtain transposition matrixes of the n block matrixes;

the iteration merging unit is used for acquiring and iteratively merging the transpose matrixes of the n block matrixes to obtain the transpose matrix of the operation matrix, wherein n is a natural number;

the operation module transposes the operation matrix according to a transposing mode obtained by an operation instruction; the operation instruction comprises an instruction type, a data type, a matrix head address, a matrix row number, a matrix column number and a target storage address;

the iterative merging unit merges the transpose matrixes of the n block matrixes in the same manner as the transpose matrixes of the n block matrixes of the matrix operation unit, and the iterative merging unit acquires the merging manner according to the number of rows or the number of columns of the matrixes.

2. The arithmetic device according to claim 1, wherein the blocking information includes at least one of blocking size information, blocking mode information, and blocking merge information; the block size information represents size information of each block matrix obtained after the operation matrix is blocked; the blocking mode information indicates a mode of blocking the operation matrix; the block merging information indicates a manner of re-merging to obtain a transposed matrix of the operation matrix after performing the transpose operation on each block matrix.

3. The arithmetic device of claim 1, wherein the arithmetic instruction comprises a normal transpose instruction, a row transpose instruction, a column transpose instruction, a reverse transpose instruction, a 90 ° left flip transpose instruction, a 90 ° right flip transpose instruction.

4. The arithmetic device of claim 1 or 3, further comprising:

the operation module is used for acquiring address information and blocking information of an operation matrix from the operation control module, extracting the operation matrix from the data storage module according to the address information of the operation matrix, performing iterative blocking, transposition and iterative combination operation on the operation matrix to obtain a transposition matrix of the operation matrix, and feeding back the transposition matrix of the operation matrix to the data storage module.

5. The arithmetic device of claim 4 wherein the arithmetic module further comprises a cache unit for caching the n partitioned matrices for retrieval by the matrix arithmetic unit.

6. The arithmetic device of claim 4, wherein the arithmetic control module comprises an instruction processing unit, an instruction cache unit, and a matrix judgment unit, wherein:

7. The arithmetic device of claim 6, wherein the arithmetic control module further comprises a dependency processing unit, configured to determine whether the address information of the decoded arithmetic instruction and the arithmetic matrix conflicts with a previous operation, and if so, temporarily store the address information of the decoded arithmetic instruction and the arithmetic matrix; and if no conflict exists, transmitting the decoded operation instruction and the address information of the operation matrix to the matrix judgment unit.

8. The arithmetic device according to claim 7, wherein the arithmetic control module further comprises an instruction queue memory configured to buffer address information of the decoded arithmetic instruction and arithmetic matrix in which the conflict exists, and transmit the buffered address information of the decoded arithmetic instruction and arithmetic matrix to the matrix determination unit when the conflict is resolved.

9. The arithmetic device of claim 7, wherein the instruction processing unit comprises an instruction fetch unit and a decode unit, wherein:

10. The arithmetic device of claim 1, further comprising:

11. The computing device of claim 4, the address storage module comprising a scalar register file or a general purpose memory unit; the data storage module comprises a high-speed temporary storage memory or a general memory unit; the address information of the operation matrix is initial address information and matrix size information of the matrix.

12. An arithmetic method comprising the steps of:

the operation control module receives or determines the block information;

the iterative operation module carries out iterative partitioning, transposition and iterative combination operation on an operation matrix according to the partitioning information to obtain a transposition matrix of the operation matrix;

the iterative operation module performs iterative partitioning, transposition and iterative merging operation on the operation matrix according to the partitioning information, and the step of obtaining the transposition matrix of the operation matrix comprises the following steps:

an iteration blocking unit of the iteration operation module blocks the operation matrix according to blocking information, judges whether the scale of the blocked matrix is a preset scale threshold value or not, blocks the matrix with the scale larger than the preset scale threshold value again, and iterates the blocking until the scale of any blocked matrix is not larger than the preset scale threshold value to obtain n blocking matrixes;

a matrix operation unit of the iterative operation module acquires the n block matrixes, and transposes the n block matrixes respectively to obtain transpose matrixes of the n block matrixes;

an iteration merging unit of the iteration operation module acquires and iteratively merges the transpose matrixes of the n block matrixes to obtain the transpose matrix of the operation matrix, wherein n is a natural number;

13. The operation method according to claim 12, wherein the blocking information includes at least one of blocking size information, blocking mode information, and blocking merge information; the block size information represents size information of each block matrix obtained after the operation matrix is blocked; the blocking mode information indicates a mode of blocking the operation matrix; the block merging information indicates a manner of re-merging to obtain a transposed matrix of the operation matrix after performing the transpose operation on each block matrix.

14. The method of claim 12, wherein the operation instruction comprises a normal transpose instruction, a row transpose instruction, a column transpose instruction, a reverse transpose instruction, a 90 ° left flip transpose instruction, a 90 ° right flip transpose instruction.

15. The arithmetic method of any one of claims 12 to 14, wherein the arithmetic control module determining blocking information comprises:

16. The operation method of claim 15, wherein the operation control module extracts address information of the operation matrix from the address storage module, comprising:

17. The operation method of claim 16, wherein the step of combining the transpose of the n block matrices by the iterative operation module to obtain the transpose of the operation matrix and feeding the transpose back to the data storage module comprises:

the iterative merging unit receives the transpose matrix of each block matrix, and after the number of the transpose matrixes of the received block matrixes reaches the total number of the blocks, the iterative merging operation is carried out on all the blocks to obtain the transpose matrix of the operation matrix; feeding back the transposed matrix to a designated address of a data storage module;