CN114996649A - Method for realizing matrix decomposition and lower triangular matrix inversion - Google Patents

Method for realizing matrix decomposition and lower triangular matrix inversion Download PDF

Info

Publication number
CN114996649A
CN114996649A CN202210499747.0A CN202210499747A CN114996649A CN 114996649 A CN114996649 A CN 114996649A CN 202210499747 A CN202210499747 A CN 202210499747A CN 114996649 A CN114996649 A CN 114996649A
Authority
CN
China
Prior art keywords
matrix
data
block
processing
register
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210499747.0A
Other languages
Chinese (zh)
Inventor
李皇
孙长江
王文青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
STMicroelectronics Shenzhen R&D Co Ltd
Original Assignee
STMicroelectronics Shenzhen R&D Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by STMicroelectronics Shenzhen R&D Co Ltd filed Critical STMicroelectronics Shenzhen R&D Co Ltd
Priority to CN202210499747.0A priority Critical patent/CN114996649A/en
Publication of CN114996649A publication Critical patent/CN114996649A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The application is suitable for the technical field of data processing, and provides a method for realizing matrix decomposition and lower triangular matrix inversion. In the embodiment of the application, the matrix dimension of a matrix to be solved is obtained; when the matrix dimension is larger than the number of registers of a preset register group, partitioning the matrix to be solved according to the number of the registers to obtain a plurality of partitioned matrices; performing data processing on the matrix data of each block matrix in a register set according to a decomposition algorithm corresponding to each block matrix, sequentially performing matrix calculation on the matrix data of each block matrix after the data processing through a preset two-dimensional pulse array, and determining a target matrix corresponding to each block matrix; determining a decomposition algorithm corresponding to the block matrix according to the relative position of the block matrix; and separating the results of the target matrixes, and determining an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved, so that the throughput of the matrix in decomposition calculation is improved.

Description

Method for realizing matrix decomposition and lower triangular matrix inversion
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a method for realizing matrix decomposition and lower triangular matrix inversion.
Background
Along with the development of society, modern information processing is more and more emphasized by people, matrix operation is an important component of modern information processing, for example, matrix decomposition calculation is generally used in the fields of real-time control, kalman filtering, radar signal processing and the like, and the matrix decomposition calculation is performed through a single floating point in the prior art, so that the operation throughput is low.
Disclosure of Invention
The embodiment of the application provides a matrix decomposition implementation method, a matrix decomposition implementation device, terminal equipment and a storage medium, and can solve the problem of low throughput when matrix decomposition calculation is performed.
In a first aspect, an embodiment of the present application provides a method for implementing matrix decomposition, including:
acquiring the matrix dimension of a matrix to be solved;
when the matrix dimension is larger than the number of registers of a preset register group, partitioning the matrix to be solved according to the number of the registers to obtain a plurality of partitioned matrices;
performing data processing on the matrix data of each block matrix in a register group according to a decomposition algorithm corresponding to each block matrix, sequentially performing matrix calculation on the matrix data of each block matrix after data processing through a preset two-dimensional pulse array, and determining a target matrix corresponding to each block matrix; determining a decomposition algorithm corresponding to the block matrix according to the relative position of the block matrix;
and separating the results of the target matrixes, and determining an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
In a second aspect, an embodiment of the present application provides a method for implementing inversion of a lower triangular matrix, including:
acquiring the dimension of a lower triangular matrix of the lower triangular matrix to be solved;
when the dimensionality of the lower triangular matrix is larger than the number of registers of a preset register group, the lower triangular matrix is partitioned along the main diagonal direction of the lower triangular matrix according to the number of the registers to obtain a plurality of lower triangular partitioned matrices;
performing data processing on the matrix data of each lower triangular block matrix in a register set according to an inversion algorithm corresponding to each lower triangular block matrix, performing matrix calculation on the matrix data of each lower triangular block matrix after data processing through a preset two-dimensional pulse array, and determining a lower triangular target matrix corresponding to each lower triangular block matrix; determining an inversion algorithm corresponding to the lower triangular block matrix according to the relative position of the lower triangular block matrix;
and determining an inversion result of the lower triangular matrix according to the lower triangular target matrices.
In a third aspect, an embodiment of the present application provides an apparatus for implementing matrix factorization, including:
the dimension acquisition module is used for acquiring the matrix dimension of the matrix to be solved;
the matrix blocking module is used for blocking the matrix to be solved according to the number of registers to obtain a plurality of blocking matrixes when the matrix dimension is larger than the number of registers of a preset register group;
the matrix calculation module is used for carrying out data processing on the matrix data of each block matrix in the register group according to the decomposition algorithm corresponding to each block matrix, carrying out matrix calculation on the matrix data of each block matrix after data processing in sequence through a preset two-dimensional pulse array, and determining a target matrix corresponding to each block matrix; determining a decomposition algorithm corresponding to the block matrix according to the relative position of the block matrix;
and the result separation module is used for performing result separation on the plurality of target matrixes and determining an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
In a fourth aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements, when executing the computer program, any one of the implementation methods of matrix decomposition or any one of the implementation methods of lower triangular matrix inversion.
In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, implements the implementation method of any one of the above matrix decompositions or the implementation method of any one of the above lower triangular matrix inversions.
In a sixth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to perform an implementation method of matrix decomposition in any one of the above first aspects or an implementation method of lower triangular matrix inversion in any one of the above second aspects.
In the embodiment of the application, the matrix dimension of the matrix to be solved is obtained, when the matrix dimension is greater than the number of registers of a preset register bank, the matrix to be solved is partitioned according to the number of registers to obtain a plurality of partitioned matrices, so that the partitioned matrices are respectively processed, the processing speed of the matrix to be solved is improved, specifically, matrix data of each partitioned matrix is processed in the register bank according to a decomposition algorithm corresponding to each partitioned matrix, matrix calculation is sequentially performed on the matrix data of each partitioned matrix after data processing through a preset two-dimensional pulse array, a target matrix corresponding to each partitioned matrix is determined, the decomposition algorithm corresponding to the partitioned matrices is determined according to the relative position of the partitioned matrices, finally, the target matrices are subjected to result separation, and an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved are determined according to the separation result of the target matrices, therefore, when the matrix to be solved is subjected to decomposition calculation, the throughput of the matrix in the decomposition calculation is improved by processing through the two-dimensional systolic array.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for implementing matrix decomposition according to an embodiment of the present disclosure;
FIG. 2 is a block diagram of a data processing system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an in-situ replication process provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a vertical diffusion mapping process provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a horizontal diffusion mapping process provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of a cross-bank one-way in-situ mapping process provided in an embodiment of the present application;
FIG. 7 is a schematic diagram of a diagonal normalization process provided in an embodiment of the present application;
fig. 8 is a schematic flowchart of a method for implementing inversion of a lower triangular matrix according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an intra-group horizontal diffusion process provided by an embodiment of the present application;
FIG. 10 is a schematic diagram of a lower triangular inversion process provided in an embodiment of the present application;
FIG. 11 is a schematic diagram of a lower triangular inversion process without a main diagonal according to an embodiment of the present application;
FIG. 12 is a schematic structural diagram of an apparatus for implementing matrix factorization provided in an embodiment of the present application;
fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Example one
Fig. 1 is a schematic flowchart illustrating a method for implementing matrix factorization in an embodiment of the present application, where an execution main body of the method may be a terminal device, and as shown in fig. 1, the method for implementing matrix factorization may include the following steps:
and S101, acquiring the matrix dimension of the matrix to be solved.
In one embodiment, as shown in fig. 2, fig. 2 is a schematic structural diagram of a data processing system, which can implement a matrix decomposition implementation method and a lower triangular matrix inversion implementation method, and may include a controller 21, a register bank 22, a two-dimensional systolic array 23, a high-speed bus 24, and a data storage module 25. The controller 21 is responsible for starting and controlling each component in the data processing system to implement a matrix decomposition implementation method and a lower triangular matrix inversion implementation method, that is, the method is equivalent to the terminal device in this embodiment; the register group 22 may be composed of N × N registers, that is, the data may be divided into N groups, each group includes N registers, where N is a power number of 2, the register bit width may be set according to a user requirement, and a zero clearing operation needs to be performed on the register group when the register group is calculated; the two-dimensional pulse array 23 is a rectangular structure and can be composed of N × N multiplication and addition units of the same type, and the two-dimensional pulse array 23 can be used for calculating matrix alignment addition, subtraction, multiplication, matrix multiplication and subtraction; the high-speed bus 24 inputs or outputs data to or from the data storage module 25 by means of high-speed interconnection; the data storage module 25 is used for storing data.
It is understood that the register set 22 may include at least one register set, for example, in the present embodiment, three register sets with identical structures and functions may be used to implement the matrix decomposition implementation method and the lower triangular matrix inversion implementation method. Specifically, the three register sets are a first register set 221, a second register set 222, and a third register set 223, respectively, where the first register set 221 includes a register set a and a register set BA, the second register set 222 includes a register set B and a register set BB, and the third register set 223 includes a register set C and a register set BC. The register bank BA, the register bank BB, and the register bank BC may be used for caching data, but are not interconnected with the compute array, that is, are equivalent to a cache register bank, and the register bank a, the register bank B, and the register bank C may be used for caching and processing data, that is, are equivalent to a processing register bank.
Step S102, judging whether the matrix dimension is larger than the register number of a preset register group.
If yes, go to step S103 to step S105; if not, step S106 and the following steps are executed.
In this embodiment, the number of registers in the register set is N × N, and denoted by k.
And S103, partitioning the matrix to be solved according to the number of the registers to obtain a plurality of partitioned matrixes.
In this embodiment, the terminal device groups the data of the matrix to be solved by using k as granularity, that is, the data length of the matrix to be solved is divided by k to obtain a quotient q and a remainder r, that is, q +1 block matrices are obtained, for example, if the data of the matrix to be solved is n × n and the matrix to be solved is divided into 4 block matrices, the 4 block matrices sequentially follow a sequence from left to right and from top to bottom: a size of
Figure BDA0003635015510000041
Is of the size of
Figure BDA0003635015510000042
Is of the size of
Figure BDA0003635015510000043
With a block matrix G and a size of
Figure BDA0003635015510000044
The block matrix H.
And S104, performing data processing on the matrix data of each block matrix in a register group according to a decomposition algorithm corresponding to each block matrix, sequentially performing matrix calculation on the matrix data of each block matrix after data processing through a preset two-dimensional pulse array, and determining a target matrix corresponding to each block matrix.
In this embodiment, since the partition matrices have different decomposition algorithms due to different relative positions, the terminal device may process the matrix data of the partition matrix according to the decomposition algorithm corresponding to the partition matrix. The decomposition algorithm can be based on a non-singular matrix LU decomposition control algorithm and is determined according to the relative position of the block matrixes.
In one embodiment, the register sets include a first processing register set, register set A, of the first register set 221; a second processing register set, i.e., register set B in the second register set 222; the result register group is connected to the cache register group, the result register group is the register group C in the third register group 223, and the cache register group connected to the result register group is the register group BC in the third register group 223.
Step S104 may include: inputting matrix data of a block matrix into a cache register group; determining at least one processing mode from in-situ copy processing, longitudinal diffusion mapping processing, horizontal diffusion mapping processing and cross-group one-way in-situ mapping processing according to a decomposition algorithm; and performing data processing on the matrix data of the block matrix in the cache register group according to the processing mode, and inputting the matrix data of the block matrix after the data processing into the register group corresponding to the processing mode, wherein the register group corresponding to the processing mode comprises at least one of a first processing register group, a second processing register group and a result register group.
Specifically, the in-place copy process may be used to copy the data in the cache register set to the processing register set in the same group as the processing register set, or vice versa, that is, copy the data in the processing register set to the cache register set in the same group as the processing register set, as shown in fig. 3, when N is 8, copy the data in the register set BA in the first register set 221 to the register set a in the first register set 221.
Specifically, the above-mentioned vertical diffusion mapping processing may be used to copy the data in the cache register group to the processing register group in the same group as the cache register group, and may select to vertically copy the data in any one register on the main diagonal line in the cache register group to all registers in the column direction where the register is located, and then map the data to the processing register group, or vice versa. As shown in fig. 4, when N is 8, when data of the register group BA in the first register group 221 is copied to the register group a in the first register group 221, data in any one register on a main diagonal line in the register group BA is selected to be longitudinally copied to all registers in a column direction of the register, and then mapped to the register group a, such as x10 in fig. 4.
Specifically, the horizontal diffusion mapping process may be used to copy the data in the cache register group to the processing register group in the same group as the cache register group, and may select to copy the data in each register on the main diagonal line in the cache register group to all registers in the row where the register is located and then map the data to the processing register group, or vice versa. As shown in fig. 5, when N is 8, when the data of the register group BA in the first register group 221 is copied to the register group a in the first register group 221, the data in each register on the main diagonal line in the register group BA is selected to be copied to all registers in the row where the register is located, and then mapped to the register group a.
Specifically, the cross-bank unidirectional in-place mapping process may be used to copy a certain register bank of one register bank to a certain register bank of another register bank in a unidirectional in-place manner when a plurality of register banks exist. As shown in FIG. 6, register set BA in one register set can map data across sets in one direction to register set B or register set C in another register set; a register group BB in one register group can map data to a register group B or a register group A in another register group in a one-way in-situ mode across groups; the register BC in one register group can map data to a register A or a register B in another register group in a one-way and in-situ mode across groups; register set C in one register set can map data to register set A or register set B in another register set in one way across sets.
In an embodiment, the matrix calculation of the matrix data of each block matrix after the data processing by the preset two-dimensional systolic array in step S104 may include: determining the position of a target register on a main diagonal line corresponding to longitudinal diffusion mapping processing; determining an enabling signal of the two-dimensional systolic array according to the position of the target register; controlling a two-dimensional systolic array to perform matrix calculation on a first processing register set and a second processing register set according to an enable signal, and inputting a calculation result to a result register set; correspondingly, the determining the target matrix corresponding to each block matrix in step S103 includes: and determining a target matrix corresponding to the block matrix according to the matrix data in the result register group.
In an embodiment, the matrix calculation of the matrix data of each block matrix after the data processing by the preset two-dimensional systolic array in step S104 may include: if the adjacent block matrix exists on the left side and/or the upper side of the block matrix, determining an adjacent target matrix corresponding to the adjacent block matrix on the left side and/or the upper side of the block matrix; inputting the matrix data of the adjacent target matrix into a target register group, and performing matrix calculation on the matrix data of the block matrix after data processing and the target register group through a two-dimensional systolic array.
Illustratively, the processing size is
Figure BDA0003635015510000061
The step S104 includes the following steps:
firstly, carrying matrix data of a block matrix E into a register group BC according to the storage position of a matrix to be solved, and setting an initial value of a cyclic variable i to be 1.
And secondly, determining an in-situ copy processing mode according to the decomposition algorithm, and mapping the data in the register group BC into the register group B according to the in-situ copy processing mode. And determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group BC into the register group C according to the cross-group one-way in-situ mapping processing mode.
And thirdly, determining a longitudinal diffusion mapping processing mode according to a decomposition algorithm, and mapping the ith data in the diagonal line of the register group BC into the register group A according to the longitudinal diffusion mapping processing mode.
And fourthly, starting the (i +1) th to the (k) th enabling signals of the ith column of calculating units in the two-dimensional systolic array to prompt the two-dimensional systolic array to calculate the matrix alignment division calculation results of the register group A and the register group B, and storing the matrix alignment division calculation results in the register group C, namely, the formula is equivalent to a formula C ═ B./A, and the formula can be described by Matlab language.
And fifthly, determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group C into the register group A according to the cross-group unidirectional in-situ mapping processing mode.
And sixthly, starting the (i +1) th to (k) th enabling signals of the (i +1) th to (k) th column calculation units in the two-dimensional systolic array to prompt the matrix multiplication and subtraction calculation results of the register group A and the register group B of the two-dimensional systolic array to be calculated, and storing the matrix multiplication and subtraction calculation results in the register group C, namely, the formula C is equivalent to a formula C-A B, and the formula can be described by Matlab language.
And seventhly, determining an in-situ copy processing mode according to a decomposition algorithm, mapping data in the register group C into the register group BC according to the in-situ copy processing mode, updating the value of a cycle variable i to i +1, returning to execute the third step, and carrying n data to a data storage module from the register group C by taking the storage position of the U matrix as a calculation result as a first address after the third step is circularly executed to the seventh step for k times, wherein the n data is the data of the target matrix corresponding to the block matrix E.
Illustratively, the processing size is
Figure BDA0003635015510000062
The step S104 includes the following steps:
firstly, grouping the block matrix F by taking k as granularity, namely dividing the data length equivalent to the block matrix F by k to obtain a quotient q1 and a remainder r1, and setting the initial value of a cyclic variable i as 1.
And secondly, inputting matrix data of a target matrix corresponding to the block matrix E to the register group A from the data storage module.
And thirdly, moving the continuous k matrix data of the block matrix F into a register group BC according to the storage position of the block matrix F. And determining an in-situ copy processing mode according to a decomposition algorithm, and mapping the data in the register group BC into the register group B according to the in-situ copy processing mode. And determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group BC into the register group C according to the cross-group one-way in-situ mapping processing mode.
And fourthly, starting the (i +1) th to k-th enabling signals of the (i +1) th to k-th column calculation units in the two-dimensional systolic array to prompt the matrix multiplication and subtraction calculation results of the register group A and the register group B to be calculated, and storing the matrix multiplication and subtraction calculation results in the register group C, namely, the formula C is equivalent to a formula C-A B, and the formula can be described by Matlab language.
Seventhly, determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, mapping data in a register group C into a register group B according to the cross-group one-way in-situ mapping processing mode, updating the value of a cyclic variable i to i +1, returning to execute the fourth step, circularly executing the fourth step for k times to obtain k calculation results, and carrying the k calculation results to a data storage module, so that the second step to the seventh step q1+1 times are circularly executed to obtain
Figure BDA0003635015510000071
A result of the calculation, the
Figure BDA0003635015510000072
The data is the data of the target matrix corresponding to the block matrix F.
Illustratively, the processing size is
Figure BDA0003635015510000073
The step S104 includes the following steps:
firstly, grouping the block matrixes G by taking k as granularity, namely dividing the data length equivalent to the block matrixes G by k to obtain quotient q2 and remainder r2, and setting the initial value of a cyclic variable i as 1.
And secondly, inputting matrix data of a target matrix corresponding to the block matrix E to a register group BA from the data storage module.
And thirdly, determining a horizontal diffusion mapping processing mode according to a decomposition algorithm, horizontally diffusing data on a main diagonal line of the register group BA according to the horizontal diffusion mapping processing mode, and then mapping the data to the register group A.
And fourthly, moving the continuous k matrix data of the block matrix G into a register group BC according to the storage position of the block matrix G. And determining an in-situ copy processing mode according to a decomposition algorithm, and mapping the data in the register group BC into the register group B according to the in-situ copy processing mode. And determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group BC into the register group C according to the cross-group one-way in-situ mapping processing mode.
And fifthly, starting an enabling signal of an ith column of calculating units in the two-dimensional systolic array to prompt the two-dimensional systolic array to calculate the matrix alignment division calculation results of the register group A and the register group B, and storing the matrix alignment division calculation results in the register group C, namely, the formula is equivalent to a formula C which is B./A and can be described by Matlab language.
And sixthly, determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group BA into the register group B according to the cross-group unidirectional in-situ mapping processing mode.
And seventhly, starting enable signals of the computing units from the (i +1) th column to the k th column in the two-dimensional systolic array to prompt the two-dimensional systolic array to compute the matrix multiplication and subtraction computation results of the register group A and the register group B, and storing the matrix multiplication and subtraction computation results in the register group C, namely, the formula C is equivalent to C-A B, and can be described by Matlab language.
Eighthly, determining a cross-group one-way in-situ mapping processing mode and an in-situ copying processing mode according to a decomposition algorithm, mapping data in a register group C into a register group B according to the cross-group one-way in-situ mapping processing mode, mapping data in a register group BA into a register group A according to the in-situ copying processing mode, updating the value of a cycle variable i to i +1, returning to the fifth step, obtaining k calculation results after the fifth step and the eighth step are circularly executed for k times, and then carrying the k calculation results to a data storage module, so that q2+1 times are obtained by circularly executing the second step and the eighth step for q2+1 times
Figure BDA0003635015510000074
A result of the calculation, the
Figure BDA0003635015510000075
The individual data is data of a target matrix corresponding to the block matrix G.
Illustratively, the processing size is
Figure BDA0003635015510000076
The step S104 includes the following steps:
in the first step, the block matrix H is grouped with k as granularity, that is, the data length equivalent to the block matrix H is divided by k to obtain a quotient q3 and a remainder r 3.
And secondly, inputting matrix data of a target matrix corresponding to the block matrix G to the register group A from the data storage module.
Thirdly, inputting matrix data of a target matrix corresponding to the block matrix F to the register group B from the data storage module.
And fourthly, moving the continuous k matrix data of the block matrix H into a register group BC according to the storage position of the block matrix H.
And fifthly, calculating a matrix multiplication and subtraction calculation result of the register group A and the register group B through the two-dimensional systolic array, and storing the matrix multiplication and subtraction calculation result in the register group C, namely, the matrix multiplication and subtraction calculation result is equivalent to a formula C-A B, wherein the formula can be described through Matlab language.
Sixthly, circularly executing the first step to the fifth step (q3+1) 2 Then, obtain
Figure BDA0003635015510000081
And (6) calculating results.
It will be appreciated that the q and r values may be further evaluated to determine if unprocessed matrix data is still present in the matrix to be evaluated, to update the q value to the q value minus 1, and to evaluate the q value.
When q is not zero, updating the storage position of the L matrix as the calculation result to the sum of the last round of storage positions of the L matrix as the calculation result and the word length of the k data storages; updating the storage position of the U matrix as the calculation result to the sum of the last round of storage positions of the U matrix as the calculation result and the word length of k data storages, updating the n value to n-k, and continuously processing the rest data in the matrix to be solved according to the steps; when the q value is 0, the r value is judged.
When the value of r is not zero, updating the storage position of the L matrix as the calculation result to the sum of the last round of the storage position of the L matrix as the calculation result and the product of n and the word length of the k data storages; updating the storage position of the U matrix as a calculation result to the sum of the last round of storage positions of the U matrix as the calculation result and the product of n and the word length of k data storages, updating the value of n to the value of r, and processing data corresponding to the value of r according to the operation step of the block matrix E, so as to finally obtain matrix data of at least one target matrix corresponding to the matrix to be solved, wherein the at least one target matrix comprises a target matrix corresponding to q and a target matrix corresponding to r; and when the r value is zero, finishing the calculation, wherein at least one current target matrix corresponds to the whole matrix to be solved.
And S105, separating results of the target matrixes, and determining an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
In one embodiment, step S105 may include: inputting matrix data of a target matrix into a register group; carrying out in-situ copy processing, diagonal normalization processing and cross-group unidirectional in-situ mapping processing on the matrix data of the target matrix in the register group to obtain the matrix data of the processed target matrix; and performing matrix calculation on the matrix data of the processed target matrix through the two-dimensional pulse array, and determining an upper triangular matrix and a lower triangular matrix corresponding to the target matrix according to the calculation result.
Specifically, the diagonal normalization process may be used to assign the values of all registers on the main diagonal in a certain register group to 1 value. As shown in fig. 7, the values of all registers on the main diagonal line in register a are assigned to 1.
Exemplarily, step S105 includes the steps of:
firstly, configuring an enabling signal of a two-dimensional systolic array as a lower triangular mode, and sequentially moving data of target matrixes corresponding to k matrixes to be solved into a register group BA from a data storage module according to a storage position of a U matrix serving as a calculation result.
Secondly, determining a diagonal line 1-return processing mode according to a decomposition algorithm, and mapping data in a register group BA to a register group A according to the diagonal line 1-return processing mode; and determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping data in the register group BA to the register group C according to the cross-group unidirectional in-situ mapping processing mode.
And thirdly, calculating a matrix multiplication calculation result of the register group A and the register group B through the two-dimensional systolic array, and storing the matrix multiplication calculation result in the register group C, namely, the matrix multiplication calculation result is equivalent to a formula C (A x B), and the formula can be described through Matlab language. And finally, carrying the matrix multiplication calculation result to a data storage module from the register group C according to the storage position of the L matrix serving as the calculation result, thereby obtaining the decomposed L matrix.
Fourthly, resetting the register group B, negating the enable signal, determining a diagonal line 1-return processing mode according to a decomposition algorithm, and mapping data in the register group BA into the register group A according to the diagonal line 1-return processing mode; and determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping data in the register group BA to the register group C according to the cross-group unidirectional in-situ mapping processing mode.
And fifthly, calculating a matrix multiplication calculation result of the register group A and the register group B through the two-dimensional systolic array, and storing the matrix multiplication calculation result in the register group C, namely, the matrix multiplication calculation result is equivalent to a formula C (A x B), wherein the formula can be described through Matlab language. And finally, carrying the matrix multiplication calculation result from the register group C to a data storage module according to the storage position of the U matrix serving as the calculation result, thereby obtaining the decomposed U matrix.
And S106, carrying out data processing on the matrix to be solved in the register group by using a decomposition algorithm corresponding to the matrix to be solved, carrying out matrix calculation on the matrix to be solved after data processing by using a preset two-dimensional pulse array, and determining a target matrix.
Exemplarily, the step S106 includes the steps of:
firstly, carrying matrix data of a matrix to be solved into a register group BC according to a storage position of the matrix to be solved, and setting an initial value of a cyclic variable i to be 1.
And secondly, determining an in-situ copy processing mode according to the decomposition algorithm, and mapping the data in the register group BC into the register group B according to the in-situ copy processing mode. And determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group BC into the register group C according to the cross-group one-way in-situ mapping processing mode.
And thirdly, determining a longitudinal diffusion mapping processing mode according to a decomposition algorithm, and mapping the ith data in the diagonal line of the register group BC into the register group A according to the longitudinal diffusion mapping processing mode.
And fourthly, starting the (i +1) th to the (k) th enabling signals of the ith column of calculating units in the two-dimensional systolic array to prompt the two-dimensional systolic array to calculate the matrix alignment division calculation results of the register group A and the register group B, and storing the matrix alignment division calculation results in the register group C, namely, the formula is equivalent to a formula C ═ B./A, and the formula can be described by Matlab language.
And fifthly, determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group C into the register group A according to the cross-group unidirectional in-situ mapping processing mode.
And sixthly, starting the (i +1) th to (k) th enabling signals of the (i +1) th to (k) th column calculation units in the two-dimensional systolic array to prompt the matrix multiplication and subtraction calculation results of the register group A and the register group B of the two-dimensional systolic array to be calculated, and storing the matrix multiplication and subtraction calculation results in the register group C, namely, the formula C is equivalent to a formula C-A B, and the formula can be described by Matlab language.
And seventhly, determining an in-situ copy processing mode according to a decomposition algorithm, mapping data in the register group C into the register group BC according to the in-situ copy processing mode, updating the value of a circulation variable i to i +1, returning to execute the third step, and carrying n data to a data storage module by taking the storage position of the U matrix as a calculation result as a first address from the register group C after the third step to the seventh step are circularly executed for k times, wherein the n data are the data of the target matrix corresponding to the matrix to be solved.
And S107, separating the result of the target matrix, and determining an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
Exemplarily, the step S107 includes the steps of:
and firstly, configuring an enabling signal of the two-dimensional systolic array as a lower triangular mode, and moving data of target matrixes corresponding to n × n matrixes to be solved into a register group BA from a data storage module according to a storage position of a U matrix serving as a calculation result.
Secondly, determining a diagonal line 1-return processing mode according to a decomposition algorithm, and mapping data in a register group BA to a register group A according to the diagonal line 1-return processing mode; and determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, and mapping data in the register group BA to the register group C according to the cross-group one-way in-situ mapping processing mode.
And thirdly, calculating a matrix multiplication calculation result of the register group A and the register group B through the two-dimensional systolic array, and storing the matrix multiplication calculation result in the register group C, namely, the matrix multiplication calculation result is equivalent to a formula C (A x B), and the formula can be described through Matlab language. And finally, carrying the matrix multiplication calculation result to a data storage module from the register group C according to the storage position of the L matrix serving as the calculation result, thereby obtaining the decomposed L matrix.
Fourthly, resetting the register group B, negating the enable signal, determining a diagonal line 1-return processing mode according to a decomposition algorithm, and mapping data in the register group BA into the register group A according to the diagonal line 1-return processing mode; and determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping data in the register group BA to the register group C according to the cross-group unidirectional in-situ mapping processing mode.
And fifthly, calculating a matrix multiplication calculation result of the register group A and the register group B through the two-dimensional systolic array, and storing the matrix multiplication calculation result in the register group C, namely, the matrix multiplication calculation result is equivalent to a formula C (A x B), wherein the formula can be described through Matlab language. And finally, carrying the matrix multiplication calculation result from the register group C to a data storage module according to the storage position of the U matrix serving as the calculation result, thereby obtaining the decomposed U matrix.
In the embodiment of the application, the matrix dimension of the matrix to be solved is obtained, when the matrix dimension is greater than the number of registers of a preset register bank, the matrix to be solved is partitioned according to the number of registers to obtain a plurality of partitioned matrices so as to process the partitioned matrices respectively and improve the processing speed of the matrix to be solved, specifically, the matrix data of each partitioned matrix is processed in the register bank according to a decomposition algorithm corresponding to each partitioned matrix, matrix calculation is sequentially performed on the matrix data of each partitioned matrix after data processing through a preset two-dimensional pulse array to determine a target matrix corresponding to each partitioned matrix, the decomposition algorithm corresponding to the partitioned matrices is determined according to the relative positions of the partitioned matrices, and finally, the result separation is performed on the plurality of target matrices to determine an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved according to the separation result of the plurality of target matrices, therefore, when the matrix to be solved is subjected to decomposition calculation, the throughput of the matrix in the decomposition calculation is improved by processing through the two-dimensional systolic array.
Example two
Fig. 8 is a schematic flowchart illustrating a method for implementing the inversion of the lower triangular matrix in the embodiment of the present application, where an execution main body of the method may be a terminal device, as shown in fig. 8, where the method for implementing the inversion of the lower triangular matrix may include the following steps:
step S801, acquiring the dimension of a lower triangular matrix of the lower triangular matrix to be solved.
In this embodiment, the lower triangular matrix may be the lower triangular matrix obtained by the matrix decomposition method of the first embodiment.
Step S802, judging whether the dimensionality of the lower triangular matrix is larger than the number of registers of a preset register group.
If yes, go to step S803 to step S805; if not, go to step S806.
In this embodiment, if the number of registers in the register set is N × N, and is denoted by k.
And S803, partitioning the lower triangular matrix along the main diagonal direction of the lower triangular matrix according to the number of registers to obtain a plurality of lower triangular partitioned matrices.
In this embodiment, the terminal device groups the lower triangular matrices with k as granularity, that is, the data length equivalent to the lower triangular matrices is divided by k to obtain a quotient q4 and a remainder r4, that is, the lower triangular matrices are partitioned along the main diagonal direction of the lower triangular matrices to obtain q4 lower triangular partitioned matrices with k size and 1 lower triangular partitioned matrix with r4 size.
Step S804, performing data processing on the matrix data of each lower triangular block matrix in a register set according to an inversion algorithm corresponding to each lower triangular block matrix, performing matrix calculation on the matrix data of each lower triangular block matrix after data processing through a preset two-dimensional pulse array, and determining a lower triangular target matrix corresponding to each lower triangular block matrix; and determining the inversion algorithm corresponding to the lower triangular block matrix according to the relative position of the lower triangular block matrix. The inversion algorithm may be a lower triangular inversion control algorithm.
And step S805, determining an inversion result of the lower triangular matrix according to the lower triangular target matrices.
In one embodiment, in step S804, performing data processing on the matrix data of each lower triangular block matrix in the register set according to the inversion algorithm corresponding to each lower triangular block matrix, includes: determining at least one processing mode from in-situ copy processing, diagonal line normalization 1 processing, in-group horizontal diffusion processing, lower triangular inversion processing without main diagonal lines and cross-group unidirectional in-situ mapping processing according to an inversion algorithm; and performing data processing on the matrix data of the lower triangular block matrix in the register group according to the processing mode.
Specifically, the above-mentioned intra-group horizontal diffusion process may be used to copy data in a register on a main diagonal in a register group to all registers in the same row. As shown in fig. 9, the data in the register on the main diagonal line of the register group BA can be copied to all the registers in the same row.
Specifically, the lower triangle negation process described above may be used to negate all of the lower triangle data values within the register set. As shown in FIG. 10, the lower triangle data values of register set A are all inverted.
Specifically, the lower triangle negation process without the main diagonal may be used to negate all other lower triangle data values within the register set that do not contain data on the main diagonal. As shown in fig. 11, all the other lower triangle data values of the register group a excluding the data on the main diagonal are inverted.
Exemplarily, the step S804 may include the following steps:
inputting k calculated data from a data storage module into a register group BA according to the storage position of a lower triangular matrix, determining an in-situ copy processing mode according to a decomposition algorithm, and mapping the data in the register group BA into a register group A according to the in-situ copy processing mode; and determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping data in the register group BA to the register group B according to the cross-group unidirectional in-situ mapping processing mode.
And secondly, determining a diagonal line 1-returning processing mode according to the decomposition algorithm, and setting the values of the diagonal lines of the register group B to be 1 according to the diagonal line 1-returning processing mode.
And thirdly, executing the para-position division operation through the two-dimensional systolic array to obtain a para-position division operation result, and storing the para-position division operation result in a register group C, namely, the result is equivalent to a formula C ═ B./A, and the formula can be described through Matlab language.
And fourthly, determining an intra-group horizontal diffusion processing mode and a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, mapping the data in the register group C into the register group BC according to the intra-group horizontal diffusion processing mode, and mapping the data in the register group C into the register group BC according to the cross-group unidirectional in-situ mapping processing mode.
And fifthly, starting an enabling signal of a computing unit below a main diagonal line of the two-dimensional systolic array, prompting the two-dimensional systolic array to execute matrix alignment multiplication operation to obtain an alignment multiplication operation result, and storing the result in a register bank C, namely, the result is equivalent to a formula C ═ A. times.B, and the formula can be described by Matlab language.
And sixthly, determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group C to the register group B and the register group A according to the cross-group one-way in-situ mapping processing mode.
And seventhly, determining a lower triangle negation processing mode without the main diagonal line according to the decomposition algorithm, and remapping the value of the register group A into the register group A according to the lower triangle negation processing mode without the main diagonal line.
And eighthly, determining an in-situ copy processing mode according to the decomposition algorithm, and mapping the value of the register group BC into the register group C according to the in-situ copy processing mode.
And ninthly, starting enable signals of the 1 st to the ith computing units in the i +1 th row of the two-dimensional systolic array, and prompting the two-dimensional systolic array to perform matrix multiplication computation, namely, the two-dimensional systolic array is equivalent to a formula C ═ A × B, wherein the formula can be described by Matlab language. And carrying k data from the register group C to the data storage module according to the storage position of the calculation result, updating the value of q4 to be the value of q4 minus 1, and judging the value of q 4.
Tenth, when the value of q4 is not 0, updating the storage location of the lower triangular matrix to the sum of the last round of storage location of the lower triangular matrix and the product of n and the word length of k data stores; updating the storage position of the calculation result to the sum of the storage position of the previous round of the calculation result and the product of the word length of n and k data storages, and returning to execute the first step; when the q4 value is 0, the r4 value is judged.
A tenth step of updating the storage location of the lower triangular matrix to a sum of the last round of storage location of the lower triangular matrix and a product of n and the word length of k data stores when the value of r4 is not 0; updating the storage position of the calculation result to the sum of the storage position of the previous round of the calculation result and the product of the word length of n and k data storages, returning to execute the first step once, updating the value of q4 to the initial value of the value of q4, and continuing to execute the twelfth step; when the value of r4 is 0, the value of q4 is updated to the value of q4 minus 1, the value of r4 is updated to k, and the twelfth step is continued.
And step ten, inputting the k calculation result data obtained in the step nine into the register group A from the data storage module according to the storage position of the calculation result.
And step thirteen, updating the storage position of the lower triangular matrix to be the difference value between the last round of storage position of the lower triangular matrix and the word length of k data storages, and inputting k matrix data from the data storage module to the register group B according to the updated storage position of the lower triangular matrix.
And fourteenth, determining a lower triangle negation processing mode according to a decomposition algorithm, and remapping the data of the register A to the register A according to the lower triangle negation processing mode.
And fifteenth, performing matrix multiplication operation through the two-dimensional systolic array to obtain a matrix multiplication operation result, and storing the result in a register bank C, namely, a formula C ═ a × B, which can be described by Matlab language.
Sixthly, determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, mapping the value of the register group C to the register group A according to the cross-group one-way in-situ mapping processing mode, and clearing the register group C.
Seventeenth, updating the storage position of the calculation result to a difference value between the previous round of storage position of the calculation result and the product of the word length of n and k data storages, and inputting k calculation results from the data storage module to the register group B according to the updated storage position of the calculation result.
Eighteenth, performing a matrix multiplication operation through the two-dimensional systolic array to obtain a result of the matrix multiplication operation, and storing the result in a register bank C, that is, a formula C ═ a × B, which can be described by Matlab language.
Nineteenth step, setting the variable position as the sum of the storage position of the calculation result and the product of the word length of n and k data storages, carrying the result data from the self-register group C to the data storage module according to the variable position, updating the value of q4 to the value of q4 minus 1, and judging the value of q 4.
And twentieth, when the q value is not 0, updating the storage position of the calculation result to be the product of the difference value and the word length of the k data storages, wherein the difference value is the product of the storage position of the previous round of the calculation result minus the word length of the k data storages minus n, updating the r4 value to be the sum of the initial value of r4 and k, and returning to execute the twelfth step until the q4 value is 0, and ending the calculation.
Step 806, performing data processing on the lower triangular matrix to be solved in a register group by using a decomposition algorithm corresponding to the lower triangular matrix to be solved, performing matrix calculation on the lower triangular matrix to be solved after the data processing by using a preset two-dimensional pulse array, and determining an inversion result of the lower triangular matrix.
Exemplarily, the step S806 may include the following steps:
inputting n x n calculated data from a data storage module into a register group BA according to the storage position of the lower triangular matrix, determining an in-situ copy processing mode according to a decomposition algorithm, and mapping the data in the register group BA into a register group A according to the in-situ copy processing mode; and determining a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, and mapping data in the register group BA to the register group B according to the cross-group unidirectional in-situ mapping processing mode.
And secondly, determining a diagonal line 1-returning processing mode according to the decomposition algorithm, and setting the values of the diagonal lines of the register group B to be 1 according to the diagonal line 1-returning processing mode.
And thirdly, executing the para-position division operation through the two-dimensional systolic array to obtain a para-position division operation result, and storing the para-position division operation result in a register group C, namely, the result is equivalent to a formula C ═ B./A, and the formula can be described through Matlab language.
And fourthly, determining an intra-group horizontal diffusion processing mode and a cross-group unidirectional in-situ mapping processing mode according to a decomposition algorithm, mapping the data in the register group C into the register group BC according to the intra-group horizontal diffusion processing mode, and mapping the data in the register group C into the register group BC according to the cross-group unidirectional in-situ mapping processing mode.
And fifthly, starting an enabling signal of a computing unit below a main diagonal line of the two-dimensional systolic array, prompting the two-dimensional systolic array to execute matrix alignment multiplication operation to obtain an alignment multiplication operation result, and storing the result in a register bank C, namely, the result is equivalent to a formula C ═ A. times.B, and the formula can be described by Matlab language.
And sixthly, determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, and mapping the data in the register group C to the register group B and the register group A according to the cross-group one-way in-situ mapping processing mode.
And seventhly, determining a lower triangle negation processing mode without the main diagonal line according to the decomposition algorithm, and remapping the value of the register group A into the register group A according to the lower triangle negation processing mode without the main diagonal line.
And eighthly, determining an in-situ copy processing mode according to the decomposition algorithm, and mapping the value of the register group BC into the register group C according to the in-situ copy processing mode.
And a ninth step, starting enable signals of 1 st to i th calculating units in the i +1 th row of the two-dimensional systolic array, and prompting the two-dimensional systolic array to perform matrix multiplication calculation, namely, a formula C is equivalent to A and B, and the formula can be described by Matlab language.
And tenth, determining a cross-group one-way in-situ mapping processing mode according to a decomposition algorithm, mapping data in the register group C into the register group B according to the cross-group one-way in-situ mapping processing mode, updating the value of a loop variable i to be i +1, jumping to the ninth step, circularly executing the ninth step to the tenth step for k times, and then carrying n x n data from the register group C into a data storage module according to the storage position of a calculation result until the calculation is finished.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the second embodiment described above may refer to the corresponding process in the first embodiment, and details are not described herein again.
EXAMPLE III
Corresponding to the implementation method of matrix factorization described in the first embodiment, fig. 12 is a schematic structural diagram of an implementation apparatus of matrix factorization in the embodiment of the present application, and as shown in fig. 12, the implementation apparatus of matrix factorization may include:
the dimension obtaining module 121 is configured to obtain a matrix dimension of the matrix to be solved.
And the matrix blocking module 122 is configured to, when the matrix dimension is greater than the number of registers of the preset register group, block the matrix to be solved according to the number of registers to obtain a plurality of block matrices.
The matrix calculation module 123 is configured to perform data processing on the matrix data of each block matrix in the register set according to a decomposition algorithm corresponding to each block matrix, and perform matrix calculation on the matrix data of each block matrix after data processing in sequence through a preset two-dimensional systolic array, so as to determine a target matrix corresponding to each block matrix; and determining the decomposition algorithm corresponding to the block matrix according to the relative position of the block matrix.
And the result separation module 124 is configured to perform result separation on the multiple target matrices, and determine an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
In one embodiment, the register sets include a first processing register set, a second processing register set, a result register set and a cache register set, and the result register set and the cache register set are connected, and the matrix calculation module 123 may include:
and the first data input unit is used for inputting the matrix data of the block matrix into the cache register group.
And the first mode determining unit is used for determining at least one processing mode from the in-situ copy processing, the longitudinal diffusion mapping processing, the horizontal diffusion mapping processing and the cross-group unidirectional in-situ mapping processing according to the decomposition algorithm.
The first data processing unit is used for processing the matrix data of the partitioned matrix in the cache register set according to the processing mode and inputting the matrix data of the partitioned matrix after data processing into the register set corresponding to the processing mode, wherein the register set corresponding to the processing mode comprises at least one of a first processing register set, a second processing register set and a result register set.
In an embodiment, the matrix calculation module 123 may further include:
and the position determining unit is used for determining the position of the target register on the main diagonal line corresponding to the longitudinal diffusion mapping processing.
And the signal determining unit is used for determining an enabling signal of the two-dimensional systolic array according to the target register position.
And the first matrix calculation unit is used for controlling the two-dimensional systolic array to perform matrix calculation on the first processing register group and the second processing register group according to the enable signal and inputting the calculation result to the result register group.
And the first matrix determining unit is used for determining a target matrix corresponding to the block matrix according to the matrix data in the result register group.
In one embodiment, the result separation module 124 may include:
and the second data input unit is used for inputting the matrix data of the target matrix into the register group.
And the second data processing unit is used for carrying out in-situ copy processing, diagonal normalization processing and cross-group unidirectional in-situ mapping processing on the matrix data of the target matrix in the register group to obtain the matrix data of the processed target matrix.
And the second matrix calculation unit is used for performing matrix calculation on the matrix data of the processed target matrix through the two-dimensional systolic array and determining an upper triangular matrix and a lower triangular matrix corresponding to the target matrix according to the calculation result.
In an embodiment, the matrix calculation module 123 may further include:
and the second matrix determining unit is used for determining an adjacent target matrix corresponding to the left and/or upper adjacent block matrix of the block matrix if the left and/or upper adjacent block matrix of the block matrix exists.
And the third matrix calculation unit is used for inputting the matrix data of the adjacent target matrix into the target register group and performing matrix calculation on the matrix data of the block matrix after data processing and the target register group through the two-dimensional systolic array.
In the embodiment of the application, the matrix dimension of the matrix to be solved is obtained, when the matrix dimension is greater than the number of registers of a preset register bank, the matrix to be solved is partitioned according to the number of registers to obtain a plurality of partitioned matrices, so that the partitioned matrices are respectively processed, the processing speed of the matrix to be solved is improved, specifically, matrix data of each partitioned matrix is processed in the register bank according to a decomposition algorithm corresponding to each partitioned matrix, matrix calculation is sequentially performed on the matrix data of each partitioned matrix after data processing through a preset two-dimensional pulse array, a target matrix corresponding to each partitioned matrix is determined, the decomposition algorithm corresponding to the partitioned matrices is determined according to the relative position of the partitioned matrices, finally, the target matrices are subjected to result separation, and an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved are determined according to the separation result of the target matrices, therefore, when the matrix to be solved is subjected to decomposition calculation, the throughput of the matrix in the decomposition calculation is improved by processing through the two-dimensional systolic array.
Example four
Corresponding to the implementation method of matrix decomposition described in the second embodiment, the implementation apparatus of matrix decomposition may include:
and the matrix dimension acquisition module is used for acquiring the dimension of the lower triangular matrix to be solved.
And the blocking module is used for blocking the lower triangular matrix along the main diagonal direction of the lower triangular matrix according to the number of the registers when the dimensionality of the lower triangular matrix is greater than the number of the registers of a preset register group to obtain a plurality of lower triangular blocking matrixes.
The data processing module is used for carrying out data processing on the matrix data of each lower triangular block matrix in the register group according to the inversion algorithm corresponding to each lower triangular block matrix, carrying out matrix calculation on the matrix data of each lower triangular block matrix after data processing through a preset two-dimensional pulse array, and determining a lower triangular target matrix corresponding to each lower triangular block matrix; and determining the inversion algorithm corresponding to the lower triangular block matrix according to the relative position of the lower triangular block matrix.
And the result determining module is used for determining the inversion result of the lower triangular matrix according to the lower triangular target matrixes.
In an embodiment, the apparatus for implementing matrix decomposition may further include:
and the second mode determining unit is used for determining at least one processing mode from in-situ copy processing, diagonal normalization to 1 processing, in-group horizontal diffusion processing, lower triangular inversion processing and cross-group unidirectional in-situ mapping processing, wherein the lower triangular inversion processing does not contain main diagonals.
And the third data processing unit is used for carrying out data processing on the matrix data of the lower triangular block matrix in the register group according to the processing mode.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the foregoing embodiments.
EXAMPLE five
Fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only portions related to the embodiments of the present application are shown.
As shown in fig. 13, the terminal device 13 of this embodiment includes: at least one processor 1300 (only one is shown in fig. 13), a memory 1301 connected to the processor 1300, and a computer program 1302, such as a matrix factorization implementation program, stored in the memory 1301 and executable on the at least one processor 1300. The processor 1300 implements the steps in the embodiment of the implementation method for matrix decomposition, such as steps S101 to S107 shown in fig. 1 or steps S801 to S806 shown in fig. 8, when executing the computer program 1302. Alternatively, the processor 1300 implements the functions of the modules in the device embodiments, for example, the modules 121 to 124 shown in fig. 12, when executing the computer program 1302.
Illustratively, the computer program 1302 may be divided into one or more modules, and the one or more modules are stored in the memory 1301 and executed by the processor 1300 to complete the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used for describing the execution process of the computer program 1302 in the terminal device 13. For example, the computer program 1302 may be divided into the dimension obtaining module 121, the matrix partitioning module 122, the matrix calculating module 123, and the result separating module 124, and the specific functions of the modules are as follows:
the dimension obtaining module 121 is configured to obtain a matrix dimension of the matrix to be solved.
And the matrix blocking module 122 is configured to, when the matrix dimension is greater than the number of registers of the preset register group, block the matrix to be solved according to the number of registers to obtain a plurality of block matrices.
The matrix calculation module 123 is configured to perform data processing on the matrix data of each block matrix in the register set according to a decomposition algorithm corresponding to each block matrix, and perform matrix calculation on the matrix data of each block matrix after data processing in sequence through a preset two-dimensional systolic array, so as to determine a target matrix corresponding to each block matrix; and determining the decomposition algorithm corresponding to the block matrix according to the relative position of the block matrix.
And the result separation module 124 is configured to perform result separation on the multiple target matrices, and determine an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
The terminal device 13 may include, but is not limited to, a processor 1300 and a memory 1301. Those skilled in the art will appreciate that fig. 13 is merely an example of the terminal device 13, and does not constitute a limitation of the terminal device 13, and may include more or less components than those shown, or combine some of the components, or different components, such as an input-output device, a network access device, a bus, etc.
The Processor 1300 may be a Central Processing Unit (CPU), and the Processor 1300 may be other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 1301 may be an internal storage unit of the terminal device 13 in some embodiments, for example, a hard disk or a memory of the terminal device 13. The memory 1301 may be an external storage device of the terminal device 13 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 13. Further, the memory 1301 may include both an internal storage unit of the terminal device 13 and an external storage device. The memory 1301 is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, and other programs, such as program codes of the computer programs. The memory 1301 described above may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the above modules or units is only one type of logical function division, and other division manners may be available in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method for implementing matrix factorization, comprising:
acquiring the matrix dimension of a matrix to be solved;
when the matrix dimension is larger than the number of registers of a preset register group, partitioning the matrix to be solved according to the number of registers to obtain a plurality of partitioned matrices;
performing data processing on the matrix data of each block matrix in the register group according to a decomposition algorithm corresponding to each block matrix, sequentially performing matrix calculation on the matrix data of each block matrix after data processing through a preset two-dimensional pulse array, and determining a target matrix corresponding to each block matrix; determining a decomposition algorithm corresponding to the block matrix according to the relative position of the block matrix;
and separating results of the target matrixes, and determining an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
2. The method for implementing matrix factorization of claim 1, wherein the register set comprises a first processing register set, a second processing register set, a result register set and a cache register set, the result register set and the cache register set are connected, and the performing data processing on the matrix data of each of the block matrices in the register set according to the factorization algorithm corresponding to each of the block matrices comprises:
inputting matrix data of the block matrix into the cache register group;
determining at least one processing mode from in-situ copy processing, longitudinal diffusion mapping processing, horizontal diffusion mapping processing and cross-group unidirectional in-situ mapping processing according to the decomposition algorithm;
and performing data processing on the matrix data of the block matrix in the cache register set according to the processing mode, and inputting the matrix data of the block matrix after the data processing into a register set corresponding to the processing mode, wherein the register set corresponding to the processing mode comprises at least one of the first processing register set, the second processing register set and the result register set.
3. The method for implementing matrix decomposition according to claim 2, wherein the performing matrix calculation on the matrix data of each block matrix after data processing sequentially through a preset two-dimensional systolic array comprises:
determining a target register position on a main diagonal corresponding to the longitudinal diffusion mapping processing;
determining an enabling signal of the two-dimensional systolic array according to the position of the target register;
controlling the two-dimensional systolic array to perform matrix calculation on the first processing register group and the second processing register group according to the enable signal, and inputting a calculation result to the result register group;
correspondingly, the determining the target matrix corresponding to each of the block matrices includes:
and determining a target matrix corresponding to the block matrix according to the matrix data in the result register set.
4. The method of claim 1, wherein the separating the result of the target matrices and determining the upper triangular matrix and the lower triangular matrix corresponding to the matrix to be solved comprises:
inputting matrix data of the target matrix into the register group;
performing in-situ copy processing, diagonal normalization processing and cross-group unidirectional in-situ mapping processing on the matrix data of the target matrix in the register group to obtain the matrix data of the processed target matrix;
and performing matrix calculation on the matrix data of the processed target matrix through the two-dimensional pulse array, and determining an upper triangular matrix and a lower triangular matrix corresponding to the target matrix according to a calculation result.
5. The method for implementing matrix decomposition according to claim 1, wherein the performing matrix calculation on the matrix data of each block matrix after data processing sequentially through a preset two-dimensional systolic array comprises:
if the adjacent block matrixes exist on the left side and/or the upper side of the block matrix, determining adjacent target matrixes corresponding to the adjacent block matrixes on the left side and/or the upper side of the block matrix;
inputting the matrix data of the adjacent target matrix into a target register group, and performing matrix calculation on the matrix data of the block matrix after data processing and the target register group through the two-dimensional systolic array.
6. A method for implementing the inversion of a lower triangular matrix is characterized by comprising the following steps:
acquiring the dimension of a lower triangular matrix of the lower triangular matrix to be solved;
when the dimensionality of the lower triangular matrix is larger than the number of registers of a preset register group, partitioning the lower triangular matrix along the main diagonal direction of the lower triangular matrix according to the number of the registers to obtain a plurality of lower triangular partitioned matrices;
performing data processing on the matrix data of each lower triangular block matrix in the register group according to an inversion algorithm corresponding to each lower triangular block matrix, performing matrix calculation on the matrix data of each lower triangular block matrix after data processing through a preset two-dimensional pulse array, and determining a lower triangular target matrix corresponding to each lower triangular block matrix; the inversion algorithm corresponding to the lower triangular block matrix is determined according to the relative position of the lower triangular block matrix;
and determining an inversion result of the lower triangular matrix according to the plurality of lower triangular target matrices.
7. The method as claimed in claim 6, wherein the data processing of the matrix data of each lower triangular block matrix in the register set according to the inversion algorithm corresponding to each lower triangular block matrix comprises:
determining at least one processing mode from in-situ copy processing, diagonal line normalization to 1 processing, in-group horizontal diffusion processing, lower triangle negation processing without a main diagonal line and cross-group one-way in-situ mapping processing according to the inversion algorithm;
and performing data processing on the matrix data of the lower triangular block matrix in the register group according to the processing mode.
8. An apparatus for implementing matrix factorization, comprising:
the dimension acquisition module is used for acquiring the matrix dimension of the matrix to be solved;
the matrix blocking module is used for blocking the matrix to be solved according to the number of registers to obtain a plurality of blocking matrixes when the dimensionality of the matrix is greater than the number of registers of a preset register group;
the matrix calculation module is used for performing data processing on the matrix data of each block matrix in the register group according to a decomposition algorithm corresponding to each block matrix, sequentially performing matrix calculation on the matrix data of each block matrix after data processing through a preset two-dimensional pulse array, and determining a target matrix corresponding to each block matrix; determining a decomposition algorithm corresponding to the block matrix according to the relative position of the block matrix;
and the result separation module is used for performing result separation on the target matrixes and determining an upper triangular matrix and a lower triangular matrix corresponding to the matrix to be solved.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of a method for implementing matrix decomposition according to any one of claims 1 to 5 or a method for implementing inversion of a lower triangular matrix according to any one of claims 6 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of a method for carrying out a matrix decomposition according to one of claims 1 to 7 or a method for carrying out an inversion of a lower triangular matrix according to one of claims 6 to 7.
CN202210499747.0A 2022-05-09 2022-05-09 Method for realizing matrix decomposition and lower triangular matrix inversion Pending CN114996649A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210499747.0A CN114996649A (en) 2022-05-09 2022-05-09 Method for realizing matrix decomposition and lower triangular matrix inversion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210499747.0A CN114996649A (en) 2022-05-09 2022-05-09 Method for realizing matrix decomposition and lower triangular matrix inversion

Publications (1)

Publication Number Publication Date
CN114996649A true CN114996649A (en) 2022-09-02

Family

ID=83025122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210499747.0A Pending CN114996649A (en) 2022-05-09 2022-05-09 Method for realizing matrix decomposition and lower triangular matrix inversion

Country Status (1)

Country Link
CN (1) CN114996649A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116560733A (en) * 2023-07-07 2023-08-08 中国兵器科学研究院 Space target feature on-orbit real-time parallel LU decomposition computing system and method
CN117634711A (en) * 2024-01-25 2024-03-01 北京壁仞科技开发有限公司 Tensor dimension segmentation method, system, device and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116560733A (en) * 2023-07-07 2023-08-08 中国兵器科学研究院 Space target feature on-orbit real-time parallel LU decomposition computing system and method
CN116560733B (en) * 2023-07-07 2023-10-24 中国兵器科学研究院 Space target feature on-orbit real-time parallel LU decomposition computing system and method
CN117634711A (en) * 2024-01-25 2024-03-01 北京壁仞科技开发有限公司 Tensor dimension segmentation method, system, device and medium
CN117634711B (en) * 2024-01-25 2024-05-14 北京壁仞科技开发有限公司 Tensor dimension segmentation method, system, device and medium

Similar Documents

Publication Publication Date Title
EP3373210B1 (en) Transposing neural network matrices in hardware
CN112214726B (en) Operation accelerator
US10534607B2 (en) Accessing data in multi-dimensional tensors using adders
CN114996649A (en) Method for realizing matrix decomposition and lower triangular matrix inversion
US10108538B1 (en) Accessing prologue and epilogue data
CN108205519B (en) Matrix multiply-add operation device and method, processing device, chip and electronic device
CN110415157B (en) Matrix multiplication calculation method and device
US9946539B1 (en) Accessing data in multi-dimensional tensors using adders
CN107992329A (en) A kind of computational methods and Related product
CN108845828B (en) Coprocessor, matrix operation acceleration method and system
CN116720551B (en) Convolution acceleration method and convolution accelerator of impulse neural network
CN114503126A (en) Matrix operation circuit, device and method
CN107943756A (en) A kind of computational methods and Related product
CN108108189A (en) A kind of computational methods and Related product
EP4206993A1 (en) Configurable pooling processing unit for neural network accelerator
WO2023122896A1 (en) Data processing method and apparatus
GB2614705A (en) Neural network accelerator with configurable pooling processing unit
US10891991B2 (en) Massively parallel, associative multiplier accumulator
GB2567038B (en) Accessing prologue and epilogue data
WO2020059156A1 (en) Data processing system, method, and program
CN112712461A (en) Image deconvolution processing method and device and terminal equipment
CN111178505B (en) Acceleration method of convolutional neural network and computer-readable storage medium
CN105915233A (en) Encoding method and apparatus, and decoding method and apparatus
CN115600062A (en) Convolution processing method, circuit, electronic device and computer readable storage medium
CN117215757A (en) Method and device for acquiring multidimensional tensor data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination