KR20170052432A - Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand - Google Patents
Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand Download PDFInfo
- Publication number
- KR20170052432A KR20170052432A KR1020160017819A KR20160017819A KR20170052432A KR 20170052432 A KR20170052432 A KR 20170052432A KR 1020160017819 A KR1020160017819 A KR 1020160017819A KR 20160017819 A KR20160017819 A KR 20160017819A KR 20170052432 A KR20170052432 A KR 20170052432A
- Authority
- KR
- South Korea
- Prior art keywords
- matrix
- buffer
- row
- elements
- zero
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3243—Power saving in microcontroller unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30069—Instruction skipping instructions, e.g. SKIP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30101—Special purpose registers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
Abstract
An operation method and an operation apparatus for skipping an operation for an operator having a zero value as an operand are disclosed. Recently, a special MCU (Micro Controller Unit) such as a sensor hub SoC (System on Chip) has been employed in mobile and wearable portable devices to process data transmitted by various sensors. Embodiments of the present invention are directed to a hardware accelerator that detects motion direction based on a six-axis sensor. The architecture of the hardware accelerator may be designed based on the profiling of the sensor fusion algorithm. In the performance evaluation, the hardware accelerator according to the embodiments of the present invention shows that execution time is improved by 100% or more.
Description
The following description relates to a computation method and an arithmetic unit for skipping an operation on an operator having a zero value as an operand.
With the development of IoT (Internet of Things) service, various sensors are being used in smart devices recently. 1 is a view showing an example of a sensor hub in the prior art. A gyro sensor, a motion sensor, an ambient light sensor, an accelerometer sensor, a temperature humidity sensor, a pressure sensor The number of
There are prior art techniques for designing an efficient hardware accelerator for an overall Kalman filter algorithm with an FPGA (Field Programmable Gate Array). While dedicated hardware accelerators may be efficient for certain target systems or sensors, the lack of programmability has the disadvantage that other systems, such as systems that use new sensors and / or algorithms, limit the use of such solutions. Therefore, a sensor hub MCU architecture (including embedded processors and hardware accelerators) is required that can achieve performance improvements in flexibility and programmability.
Reference literature: S. Cruz, D.M. Munoz, M. Conde an C.H. Llanos, G.A. Borges, "FPGA implementation of a sequential extended Kalman filter algorithm applied to mobile robotics localization problem," Circuits and Systems, pp. 1-4, Feb 2013.>
Embodiments of the present invention relate to a hardware accelerator for a microcontroller unit (MCU) of a sensor hub. To improve accuracy of direction estimation and reduce energy consumption for direction estimation, a complex Kalman filter ) In accordance with the present invention.
Further, the present invention provides an operation method and apparatus that have more programmability and can improve the performance of the Kalman filter processing time by more than 100%.
When a matrix is stored in a register from a memory, a zero bit register for storing whether the element of the corresponding row is zero is provided. If the element to be operated has a value of 0, the operation for the element is omitted The present invention provides an operation method and an apparatus capable of reducing an operation execution time and a power consumption required for an operation.
A method of operating a computing device, the method comprising: identifying a first operand having a zero value among a plurality of first operands in a computing device and indicating a first operand having a zero value through a zero bit verification buffer; The computing device sequentially broadcasts the plurality of first operands to a plurality of computing devices included in the computing device, and skips the broadcasting of the first operand determined to have a zero value through the zero bit verification buffer ; And processing an operation between the plurality of second operands transmitted in correspondence with each of the plurality of operators in each of the plurality of operators and the broadcasted first operand. do.
According to one aspect, the plurality of first operands are elements of an n-th row of a first matrix composed of a rows and b columns, and the plurality of second operands are elements of a second matrix consisting of c rows and d columns And a, b, c, and d are natural numbers, n is a natural number equal to or smaller than a, and m is a natural number equal to or smaller than the value of c.
According to another aspect, the broadcasting step sequentially broadcasts the elements of the n-th row of the first matrix, and the step of processing the operation further comprises a step of, in each of the plurality of operators, And a multiplication operation between one of the elements of the row and the corresponding element among all the elements of the m-th row of the second matrix.
According to another aspect of the present invention, the zero bit confirmation buffer stores a bit string for displaying elements which are zero values among elements of an nth row of the first matrix.
According to another aspect, the method includes loading a third matrix, which comprises e rows and f columns, into a first matrix buffer; And loading the second matrix as a transposed matrix of the third matrix into a second matrix buffer such that an mth row of the third matrix is substituted into an mth column of the second matrix, And e and f are natural numbers.
According to another aspect, the method further comprises loading the first matrix into the first matrix buffer after the second matrix is loaded into the second matrix buffer as a transpose matrix of the third matrix . ≪ / RTI >
According to another aspect, the third matrix is the same matrix as the first matrix, the values of a and e are the same, and the values of b and f are the same.
According to another aspect of the present invention, the calculating method further includes accumulating and storing the operation results of each of the plurality of operators in a result buffer, and the result buffer includes a plurality of stores corresponding to each of the plurality of operators And the operation result of the corresponding arithmetic unit is stored.
A method for computing a matrix between a first matrix and a second matrix of a computing device, the first matrix being composed of a rows and b columns, the second matrix being composed of c rows and d columns, Loading the nth row of the first matrix into the first matrix buffer and loading the second matrix into the second matrix buffer in the computing device; Calculating a multiplication between the i-th element of the n-th row of the first matrix and each of all the elements of the m-th row of the second matrix in the computing device; And accumulating and accumulating the multiplication result between the i-th element of the n-th row of the first matrix and the j-th element of the m-th row of the second matrix in the j-th storage of the buffer storing the result matrix, Wherein when the value of the i-th element of the n-th row of the first matrix is 0, the multiplication between all the elements of the m-th row of the second matrix is performed Wherein a, b, c, and d are natural numbers, n is a natural number less than or equal to a, i is a natural number less than or equal to b, m is a natural number less than or equal to c, j Is a natural number equal to or smaller than d.
A zero-bit verification unit for identifying a first operand having a zero value among a plurality of first operands and indicating a first operand having a zero value through a zero-bit verification buffer; A broadcasting unit for broadcasting the plurality of first operands sequentially to a plurality of operators included in the arithmetic unit, and for skipping broadcasting of a first operand determined to have a zero value through the zero bit confirmation buffer; ; And a plurality of arithmetic operators for processing arithmetic operations between the plurality of second operands transmitted in correspondence with each of the plurality of arithmetic operators and the broadcasted first operand.
A computing device for a matrix operation between a first matrix and a second matrix, the first matrix being composed of a rows and b columns, the second matrix being composed of c rows and d columns, A first matrix buffer for loading an nth row of the first matrix; A second matrix buffer for loading the second matrix; A multiplier for calculating a multiplication between the i-th element of the n-th row of the first matrix and each of all the elements of the m-th row of the second matrix; And an accumulator for accumulating the multiplication result between the i-th element of the n-th row of the first matrix and the j-th element of the m-th row of the second matrix in a jth storage, The a, the b, the c, and the d are set to be equal to or less than a predetermined value when the value of the i-th element of the n-th row of the matrix is 0 Wherein m is a natural number equal to or smaller than the value of c, and j is a natural number equal to or smaller than the value of d. do.
And a sensor hub MCU (Micro Controller Unit) including the computing device.
A hardware accelerator for a microcontroller unit (MCU), which can process a complex Kalman filter of a sensor fusion algorithm to improve the accuracy of direction estimation and reduce energy consumption for direction estimation .
In addition, with greater programmability, the performance over Kalman filter processing time can be improved by more than 100%.
When a matrix is stored in a register from a memory, a zero bit register for storing whether the element of the corresponding row is zero is provided. If the element to be operated has a value of 0, the operation for the element is omitted skip), it is possible to reduce the operation execution time and the power consumption required for the operation.
1 is a view showing an example of a sensor hub in the prior art.
2 is a diagram showing an example of the overall structure of a sensor hub MCU having a hardware accelerator in an embodiment of the present invention.
3 is a diagram illustrating an example of a process of transmitting matrix information in an embodiment of the present invention.
4 is a diagram illustrating an example of a process of loading data of matrices in an embodiment of the present invention.
5 is a diagram illustrating an example of a processing procedure of a MAC operation in an embodiment of the present invention.
6 is a diagram illustrating an example of a process of storing calculation results in an embodiment of the present invention.
7 is a diagram showing an example of an element processing structure (PE Architecture) in an embodiment of the present invention.
8 to 13 are diagrams illustrating an example of a process of matrix multiplication in an embodiment of the present invention.
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.
1. Structure
A. Sensor fusion algorithm analysis
Most sensor fusion mechanisms use a 6- or 9-axis (accelerometer, gyroscope and magnetometer) with a Kalman filter. A Kalman filter is used to accurately predict directions based on data obtained from various sensors. The key operations of SensorFusion based on the Kalman filter can be divided into two parts. The first part predicts the current state based on the predefined state equation, and the second part calibrates the predicted direction using the Kalman gain value. An example of such a sensor fusion algorithm is the open-source Freescale sensor fusion software (see Freescale Sensor Fusion, http://www.freescale.com/) and the DS-5 tool for profiling ARM Development Tools, see http://ds.arm.com/). The profiling results based on the DS-5 identify two major performance bottleneck functions, Calman gain calculation and error covariance matrix calculation, as shown in Table 1 below. The main operations of these functions are matrix multiplication and matrix transpose, and these two operations use about 80% of the total execution time.
Table 1 shows an example of execution time for each function.
B. Hardware Accelerator and Sensor Hub Architecture
As we have seen, the main operations of Sensor Fusion are matrix manipulation operations such as matrix multiplication and matrix transposition. Since other sensor-related operations are not usually based on matrix manipulation operations, but because these operations are the main kernels of the target sensor fusion algorithm, the structure of the hardware accelerator may be selected focusing on matrix manipulation operations. Although the matrix manipulation operations are well known and many studies are being performed, the structure of the hardware accelerator for the sensor hub microcontroller unit (MCU) is prudent because sensor fusion processing can use a different number of sensors . The various sizes of the matrix must be manipulated even in the same sensor fusion algorithm. For example, when the sensor fusion algorithm is implemented with 9-axis sensors, multiplication of matrices such as a multiplication between a 12 × 12 matrix and a 12 × 6 matrix and a multiplication between a 6 × 12 matrix and a 12 × 12 matrix must be performed. The architecture of the proposed hardware accelerator can be designed to take advantage of these features of sensor fusion processing. The proposed hardware accelerator may employ an element's broadcasting scheme and adaptive control mechanism in matrix A to handle various sizes of matrices.
2 is a diagram showing an example of the overall structure of a sensor hub MCU having a hardware accelerator in an embodiment of the present invention. The proposed
When the multiplication is started, the DMA (Direct Memory Access) 201 can load one row of the matrix B and one row of the matrix A, and sequentially load the remaining rows of the matrix A as the multiplication proceeds. The PEs (Processing Elements) 202 may perform multiplier-accumulator (MAC) operations and the
1) The embedded
2) The two row data of the matrix A and all the data of the matrix B can be loaded into the
3)
4) When the MAC operation is completed, the
7 is a diagram showing an example of an element processing structure (PE Architecture) in an embodiment of the present invention. In the entire sensor hub architecture including the proposed hardware accelerator, one element of matrix A may be broadcast to be an operand for all multipliers (the plurality of "MULs" shown in FIG. 7). The other operands for the multipliers can be each of the elements of the entire rows of the matrix B. [ When all the elements stored in the matrix
The method of computing a computing device includes: identifying a first operand having a zero value among a plurality of first operands in a computing device and indicating a first operand having a zero value through a zero bit verification buffer; Sequentially broadcasting the plurality of first operands to a plurality of operators included in the arithmetic unit and skipping broadcasting of a first operand determined to have a zero value through the zero bit confirmation buffer; And processing an operation between the plurality of second operands transmitted in correspondence with each of the plurality of operators in each of the plurality of operators and the broadcasted first operand. The computing device for this purpose includes a zero bit checking unit for identifying a first operand having a zero value among a plurality of first operands and indicating a first operand having a zero value through a zero bit checking buffer, 1 operands sequentially to a plurality of operators included in the arithmetic unit and skipping the broadcasting of a first operand determined to have a zero value through the zero bit verification buffer, And a plurality of operators operable to process an operation between the plurality of second operands transmitted in correspondence with each of the operators and the broadcasted first operand. For example, the zero bit checking unit may correspond to the zero
To describe a more specific example, the first matrix may be composed of a row and b columns, and the second matrix may be composed of c rows and d columns. Here, a, b, c, and d may all be natural numbers. For example, the first matrix may correspond to the matrix A described above, and the second matrix may correspond to the matrix B described above.
The computing device may load the nth row of the first matrix into the first matrix buffer and load the second matrix into the second matrix buffer. Here, the first matrix buffer may correspond to the
At this time, the computing device can calculate the multiplication between all the elements of the m-th row of the second matrix and the i-th element of the n-th row of the first matrix. For example, a (1, 1), which is the first element of the first row of the first matrix, can be multiplied with each of all the elements of the first row of the second matrix. Also, a (1, 2) can be multiplied with each of all the elements of the second row of the second matrix. In other words, the n is a natural number less than or equal to the a, the i is a natural number equal to or less than the b, and the m can be a natural number equal to or smaller than the c. Here, when the value of the i-th element of the n-th row of the first matrix is 0 (zero), the computing apparatus does not perform broadcasting, thereby performing multiplication and accumulation operations between all the elements of the m-th row of the second matrix Can be omitted. At this time, the computing device may display an element having a value of 0 out of the elements of the nth row of the first matrix through the zero
The arithmetic unit may accumulate and store the multiplication result between the i-th element of the n-th row of the first matrix and the j-th element of the m-th row of the second matrix in the j-th storage of the buffer storing the result matrix. Here, j may be a natural number less than or equal to d. For example, the multiplication result between a (1, 1) and b (1, 1), the first element of the first row of the second matrix, may be stored in the first storage c (1,1) of the result matrix . Also, the multiplication result between a (1, 2) and b (2, 1) can be accumulated and stored in the first storage c (1, 1) of the result matrix.
For computation of the transpose matrix, the computing device may first load a third matrix of e rows and f columns into the first matrix buffer. At this time, the computing device may load the second matrix as a transposed matrix of the third matrix into the second matrix buffer by replacing the mth row of the third matrix loaded in the first matrix buffer with the mth column of the second matrix. In this case, the values of e and d may be the same, and the values of f and c may be the same. This can be used to handle the multiplication operation between the first matrix and the second matrix, which is the transpose matrix of the third matrix.
Also, the third matrix may be the same matrix as the first matrix. In this case, the values of a and e may be the same, and the values of b and f may be the same. This can be used to process a multiplication operation between a first matrix and a second matrix, which is a transpose matrix of the first matrix. For example, if the third matrix is the same as the first matrix, the multiplication operation between the second matrix and the first matrix, which is the transpose matrix of the third matrix, may result in a first matrix (or a third matrix) 3 matrix) of the permutation matrix.
The computing device may be included in a sensor hub MCU (Micro Controller Unit) as a hardware accelerator.
8 to 13 are diagrams illustrating an example of a process of matrix multiplication in an embodiment of the present invention.
FIG. 8 shows that a 12 × 12
9 shows the result of the multiplication of the first element a (1,1) of the first row of the
10 shows the case where the first element a (1, 1) of the first row of the
11 shows the case where the second element a (1,2) of the first row of the
12 shows the case where the second element a (1, 2) of the first row of the
FIG. 13 shows the first element c ((820)) of the first row of the
8 to 13 are sequentially repeated for the remaining rows of the
The proposed hardware accelerator can also perform matrix transpose operation with slight modification of the control. Each row of the target matrix to be transposed may be loaded into a row buffer (matrix A-row 207) for matrix A and each element of the row is passed to the
C. Optimization for zero
A hardware accelerator may have special properties to process elements of matrix A that may have a value of zero. Because the hardware accelerator broadcasts one element of the matrix and performs a multiplication with all elements of the matrix B's row, if the value of the element to be broadcasted in matrix A is zero, no multiplication and accumulation operations are required . Thus, referring again to FIG. 2, when the row of matrix A is loaded from the SRAM via DMA, the control unit checks the value of the row of matrix A (via the zero
2. Evaluation
The MacSim simulator (see Macsim, see http://code.***.com/p/macsim/) and the pin trace tool (see Sion Berkowits, Tevi Devor, "Pin: Intel's Dynamic Binary Instrumentation Engine," CGO, Feb 2013.) Can be used for performance evaluation. The clock frequency of the MCU without cache memory is assumed to be 100 MHz. The MCU with the proposed hardware accelerator achieves a performance improvement twice as much as the reference structure without the hardware accelerator as shown in Table 2 below. The properties for the computation of the transpose and zero value elements have also achieved additional performance improvements.
3. Conclusion
Thus, the sensor hub structure including the specialized hardware accelerator according to the embodiments of the present invention can effectively process the sensor fusion algorithm. The performance improvement results show that the proposed hardware accelerator can achieve a large speed improvement. These hardware accelerators can be extended for application to other sensor data processing applications such as motion detection and context awareness applications.
As described above, according to the embodiments of the present invention, in order to improve the accuracy of direction estimation and reduce energy consumption for direction estimation with respect to a hardware accelerator for a sensor hub MCU (Micro Controller Unit) Filter (complex Kalman filter). In addition, with greater programmability, the performance over Kalman filter processing time can be improved by more than 100%. In addition, when a matrix is stored in a register from a memory, a zero bit register for storing whether the element of the corresponding row is zero is provided. If the element to be operated has a value of 0, By skipping, it is possible to reduce the power consumption required for the operation execution time and the operation.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.
Claims (17)
Identifying a first operand having a zero value among a plurality of first operands in the computing device and indicating a first operand having a zero value through a zero bit verification buffer;
The computing device sequentially broadcasts the plurality of first operands to a plurality of computing devices included in the computing device, and skips the broadcasting of the first operand determined to have a zero value through the zero bit verification buffer ; And
Processing an operation between a plurality of second operands transmitted in correspondence with each of the plurality of operators in each of the plurality of operators and the broadcasted first operand
And generating a second set of values for the second set of values.
Wherein the plurality of first operands are elements of an nth row of a first matrix consisting of a rows and b columns,
The plurality of second operands being elements of an m-th row of a second matrix consisting of c rows and d columns,
Wherein a, b, c, and d are natural numbers,
Wherein n is a natural number equal to or smaller than the a,
And m is a natural number equal to or smaller than c.
Wherein the broadcasting comprises:
Sequentially broadcasting elements of the nth row of the first matrix,
The method of claim 1,
Wherein each of the plurality of arithmetic operators performs a multiplication operation between an element of an nth row of the first matrix and a corresponding element of all elements of an mth row of the second matrix .
Wherein the zero bit confirmation buffer stores a bit string for displaying elements which are zero values among elements of the nth row of the first matrix.
Loading a third matrix consisting of e rows and f columns into a first matrix buffer; And
Storing the m-th row of the third matrix in a second matrix buffer so that the m-th row of the third matrix is substituted into an m-th column of the second matrix, and loading the second matrix as a transposed matrix of the third matrix into a second matrix buffer
Further comprising:
And e and f are natural numbers.
Loading the first matrix into the first matrix buffer after the second matrix is loaded into the second matrix buffer as a transposed matrix of the third matrix
≪ / RTI >
The third matrix is the same matrix as the first matrix,
Wherein the values of a and e are the same, and the values of b and f are the same.
Accumulating operation results of each of the plurality of operators in a result buffer and storing
Further comprising:
Wherein the result buffer stores a plurality of arrays corresponding to each of the plurality of arithmetic units and stores arithmetic results of the corresponding arithmetic units.
A zero bit verifier for identifying a first operand having a zero value among a plurality of first operands and indicating a first operand having a zero value through a zero bit verifying buffer;
A broadcasting unit for broadcasting the plurality of first operands sequentially to a plurality of operators included in the arithmetic unit, and for skipping broadcasting of a first operand determined to have a zero value through the zero bit confirmation buffer; ; And
A plurality of second operands transmitted in correspondence with each of the plurality of operators, and the plurality of operators that process an operation between the broadcasted first operands
And an arithmetic operation unit for arithmetically operating the arithmetic operation unit.
Wherein the plurality of first operands are elements of an nth row of a first matrix consisting of a rows and b columns,
The plurality of second operands being elements of an m-th row of a second matrix consisting of c rows and d columns,
Wherein a, b, c, and d are natural numbers,
Wherein n is a natural number equal to or smaller than the a,
And m is a natural number equal to or smaller than c.
The broadcasting unit includes:
Sequentially broadcasting elements of the nth row of the first matrix,
Wherein each of the plurality of operators includes:
And performs a multiplication operation between one of the elements of the n-th row of the first matrix and the corresponding element of all the elements of the m-th row of the second matrix.
Wherein the zero bit check unit generates a bit string for displaying elements which are zero among elements of an nth row of the first matrix and stores the bit string in the zero bit check buffer.
A third matrix consisting of e rows and f columns is loaded into a first matrix buffer and the mth row of the third matrix is substituted into an mth column of the second matrix, As a transposed matrix of the first matrix,
Further comprising:
And e and f are natural numbers.
Wherein,
And loads the first matrix into the first matrix buffer after the second matrix is loaded into the second matrix buffer as a transposed matrix of the third matrix.
The third matrix is the same matrix as the first matrix,
Wherein the values of a and e are the same, and the values of b and f are the same.
A result buffer for cumulatively storing operation results of each of the plurality of operators;
Further comprising:
Wherein the result buffer stores a result of operation of a corresponding arithmetic unit including a plurality of arrays corresponding to each of the plurality of arithmetic units.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20150152022 | 2015-10-30 | ||
KR1020150152022 | 2015-10-30 |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20170052432A true KR20170052432A (en) | 2017-05-12 |
KR101843243B1 KR101843243B1 (en) | 2018-03-29 |
Family
ID=58740009
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020160017819A KR101843243B1 (en) | 2015-10-30 | 2016-02-16 | Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101843243B1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018228703A1 (en) * | 2017-06-16 | 2018-12-20 | Huawei Technologies Co., Ltd. | Multiply accumulator array and processor device |
WO2019074185A1 (en) | 2017-10-12 | 2019-04-18 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
WO2019172685A1 (en) * | 2018-03-07 | 2019-09-12 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
WO2020190807A1 (en) * | 2019-03-15 | 2020-09-24 | Intel Corporation | Systolic disaggregation within a matrix accelerator architecture |
EP3659073A4 (en) * | 2017-10-12 | 2020-09-30 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
US11361496B2 (en) | 2019-03-15 | 2022-06-14 | Intel Corporation | Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format |
US11934342B2 (en) | 2019-03-15 | 2024-03-19 | Intel Corporation | Assistance for hardware prefetch in cache access |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4490925B2 (en) | 2006-01-16 | 2010-06-30 | 株式会社日立製作所 | Calculation device, calculation method, and calculation program |
GB2436377B (en) | 2006-03-23 | 2011-02-23 | Cambridge Display Tech Ltd | Data processing hardware |
-
2016
- 2016-02-16 KR KR1020160017819A patent/KR101843243B1/en active IP Right Grant
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018228703A1 (en) * | 2017-06-16 | 2018-12-20 | Huawei Technologies Co., Ltd. | Multiply accumulator array and processor device |
WO2019074185A1 (en) | 2017-10-12 | 2019-04-18 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
EP3659073A4 (en) * | 2017-10-12 | 2020-09-30 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
US11113361B2 (en) | 2018-03-07 | 2021-09-07 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
WO2019172685A1 (en) * | 2018-03-07 | 2019-09-12 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
CN111819581A (en) * | 2018-03-07 | 2020-10-23 | 三星电子株式会社 | Electronic device and control method thereof |
US11676239B2 (en) | 2019-03-15 | 2023-06-13 | Intel Corporation | Sparse optimizations for a matrix accelerator architecture |
US11361496B2 (en) | 2019-03-15 | 2022-06-14 | Intel Corporation | Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format |
WO2020190807A1 (en) * | 2019-03-15 | 2020-09-24 | Intel Corporation | Systolic disaggregation within a matrix accelerator architecture |
US11709793B2 (en) | 2019-03-15 | 2023-07-25 | Intel Corporation | Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format |
US11842423B2 (en) | 2019-03-15 | 2023-12-12 | Intel Corporation | Dot product operations on sparse matrix elements |
US11899614B2 (en) | 2019-03-15 | 2024-02-13 | Intel Corporation | Instruction based control of memory attributes |
US11934342B2 (en) | 2019-03-15 | 2024-03-19 | Intel Corporation | Assistance for hardware prefetch in cache access |
US11954063B2 (en) | 2019-03-15 | 2024-04-09 | Intel Corporation | Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format |
US11954062B2 (en) | 2019-03-15 | 2024-04-09 | Intel Corporation | Dynamic memory reconfiguration |
US11995029B2 (en) | 2019-03-15 | 2024-05-28 | Intel Corporation | Multi-tile memory management for detecting cross tile access providing multi-tile inference scaling and providing page migration |
US12007935B2 (en) | 2019-03-15 | 2024-06-11 | Intel Corporation | Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format |
US12013808B2 (en) | 2019-03-15 | 2024-06-18 | Intel Corporation | Multi-tile architecture for graphics operations |
Also Published As
Publication number | Publication date |
---|---|
KR101843243B1 (en) | 2018-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101843243B1 (en) | Calcuating method and apparatus to skip operation with respect to operator having value of zero as operand | |
US10817260B1 (en) | Reducing dynamic power consumption in arrays | |
US11816446B2 (en) | Systolic array component combining multiple integer and floating-point data types | |
EP3602278B1 (en) | Systems, methods, and apparatuses for tile matrix multiplication and accumulation | |
US11449745B2 (en) | Operation apparatus and method for convolutional neural network | |
US11467806B2 (en) | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range | |
US20180329867A1 (en) | Processing device for performing convolution operations | |
EP3451162B1 (en) | Device and method for use in executing matrix multiplication operations | |
EP3394723B1 (en) | Instructions and logic for lane-based strided scatter operations | |
EP3910503A1 (en) | Device and method for executing matrix addition/subtraction operation | |
EP3391203B1 (en) | Instructions and logic for load-indices-and-prefetch-scatters operations | |
WO2017112246A1 (en) | Instructions and logic for load-indices-and-gather operations | |
EP3394742A1 (en) | Instructions and logic for load-indices-and-scatter operations | |
WO2017105717A1 (en) | Instructions and logic for get-multiple-vector-elements operations | |
US20160179514A1 (en) | Instruction and logic for shift-sum multiplier | |
EP3391234A1 (en) | Instructions and logic for set-multiple-vector-elements operations | |
US20160092400A1 (en) | Instruction and Logic for a Vector Format for Processing Computations | |
US20170177351A1 (en) | Instructions and Logic for Even and Odd Vector Get Operations | |
US20140244987A1 (en) | Precision Exception Signaling for Multiple Data Architecture | |
US20130013283A1 (en) | Distributed multi-pass microarchitecture simulation | |
US20140297996A1 (en) | Multiple hash table indexing | |
US20160019062A1 (en) | Instruction and logic for adaptive event-based sampling | |
US9910669B2 (en) | Instruction and logic for characterization of data access | |
CN116611476A (en) | Performance data prediction method, performance data prediction device, electronic device, and medium | |
Douma et al. | Fast and precise cache performance estimation for out-of-order execution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |