CN113378115A - Near-memory sparse vector multiplier based on magnetic random access memory - Google Patents

Near-memory sparse vector multiplier based on magnetic random access memory Download PDF

Info

Publication number
CN113378115A
CN113378115A CN202110689836.7A CN202110689836A CN113378115A CN 113378115 A CN113378115 A CN 113378115A CN 202110689836 A CN202110689836 A CN 202110689836A CN 113378115 A CN113378115 A CN 113378115A
Authority
CN
China
Prior art keywords
vector
sparse
memory
data
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110689836.7A
Other languages
Chinese (zh)
Other versions
CN113378115B (en
Inventor
蔡浩
陈骏通
张优优
郭亚楠
周永亮
刘波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110689836.7A priority Critical patent/CN113378115B/en
Publication of CN113378115A publication Critical patent/CN113378115A/en
Application granted granted Critical
Publication of CN113378115B publication Critical patent/CN113378115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/325Power saving in peripheral device
    • G06F1/3275Power saving in memory, e.g. RAM, cache
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/02Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using magnetic elements
    • G11C11/16Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using magnetic elements using elements in which the storage effect is based on magnetic spin effect
    • G11C11/165Auxiliary circuits

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a Magnetic Random Access Memory (MRAM) -based near-memory sparse vector multiplier, which belongs to the field of integrated circuit design and comprises a sparse mark generator, an input unit, a controller, a near-memory multiplier accumulator, a near-memory processing unit, a core memory array, a cache memory array, a sensitive amplifier and a shift adder tree. The invention has the functions of realizing multiplication calculation of 2 signed integer vectors and automatically skipping zero vectors. The MRAM has the characteristics of non-volatility and extremely low standby power consumption, and meanwhile, the sparse zone bit is introduced and calculation is carried out at the output end of the memory, so that the data transfer power consumption and the overturning power consumption are reduced respectively. Compared with a traditional neural network accelerator with a von Neumann architecture, the method effectively improves the calculation energy efficiency of vector multiplication.

Description

Near-memory sparse vector multiplier based on magnetic random access memory
Technical Field
The invention relates to the field of integrated circuits, in particular to a magnetic random access memory-based near-memory sparse vector multiplier.
Background
In recent years, neural networks have been widely varied in the fields of computer vision, natural language processing and the like, leading to a new round of artificial intelligence enthusiasm. The neural network is composed of layers with different functions, and the current mainstream design comprises the following components: a convolution computation layer, a full-link computation layer, an activation function layer, a normalization layer, an attention layer, etc. In the application process, the core calculation process can be abstracted into the form of vector multiplication, as shown in formula (1):
Figure BDA0003126197080000011
wherein
Figure BDA0003126197080000012
The result of the calculation for the input or each layer will be changed continuously in the whole network calculation process,
Figure BDA0003126197080000013
is a fixed weight and does not change.
At present, in order to effectively reduce consumption of hardware resources, particularly in embedded mobile devices, an idea is to use a quantization method to change an activation value and a weight from a 32-bit floating point number to an 8-bit integer, so that under the condition of not losing application performance, storage requirements and data calculation amount are greatly reduced, and energy efficiency is improved. Another idea is to use the sparse property of activation values or weights, as shown in the following example:
Figure BDA0003126197080000014
obviously, for vectors
Figure BDA0003126197080000015
For example, the result of multiplying the first four elements by any vector is 0, so skipping the zero vector multiplication can effectively reduce power consumption. Currently for the multiplication of sparse vectors,the method of judging after reading data is mostly adopted, although the calculation power consumption can be reduced, the access and storage are still needed, and the access and storage power consumption also occupies the leading factor considering that each element occupies the bit width of 8bits, so the method still has an optimization space.
In the conventional von neumann architecture, the memory and the computing unit are independent from each other, and when a computing operation needs to be performed, data needs to be transferred to a cache of the computing unit, which is usually composed of a Static Random Access Memory (SRAM) or a Flip-Flop (Flip-Flop), and then the result is transferred to the memory, which consumes a lot of energy for data transfer and cache update. Near Memory Computing (NMC) breaks through the traditional von Neumann architecture, and a Computing circuit and a Memory are connected into a whole, so that the data transfer and Memory access power consumption are greatly reduced. Since the NMC usually employs a memory array in cooperation with a digital processing unit, the calculation accuracy can be guaranteed, but it is a key challenge in the NMC architecture to further reduce the power consumption of the two circuits. Most NMC technologies are based on Dynamic Random Access Memory (DRAM) which requires frequent refresh operations to maintain data or FLASH (FLASH) which is slow and presents a short board in the face of neural network applications with large data computations. The novel non-volatile memory MRAM can store data in a power-off state, greatly reduces data maintenance power consumption and leakage power consumption, and has a high memory access speed to meet the calculation requirement of a neural network, so that the MRAM-based near-memory sparse vector multiplier has great advantages compared with other NMC technologies.
Disclosure of Invention
The technical problem is as follows: aiming at the defects in the prior art, the invention discloses a near memory sparse vector multiplier based on a Magnetic Random Access Memory (MRAM). A bit zone bit is additionally written in while data is written in, and the near memory processing unit skips the access and calculation processes by using sparse zone bit information to realize near memory sparse vector multiplication. The multiplier is optimized in power consumption in the aspects of circuit structure and network structure, and the problems of low speed and high energy consumption of the conventional NMC technology are solved.
The technical scheme is as follows: the invention relates to a near memory sparse vector multiplier based on a magnetic random access memory, which comprises a sparse sign generator, an input unit, a near memory accumulator and a controller, wherein the input unit is used for inputting a sparse sign;
the sparse flag generator is connected with the input unit and judges whether input data is 0 or not through a logic circuit to generate a sparse flag bit, and the data and the sparse flag bit are transmitted into the input unit; the input data comprises a weight vector and an activation vector;
the input unit is connected with the near memory accumulator, the near memory accumulator receives data from the input unit and performs near memory accumulation calculation, and memory access and calculation of zero vectors are skipped in the near memory accumulation calculation process;
the controller is respectively connected with the sparse mark generator, the input unit and the near-storing-multiplying accumulator, and is used for controlling the realization of functions of the sparse mark generator, the input unit and the near-storing-multiplying accumulator and generating address signals for reading and storing data.
Further, the sparse flag generator includes six two-input or gates and one two-input nor gate, and is configured to judge whether all 8-bit data are 0, and generate a sparse flag bit of the data; the six two-input or gates are respectively marked as a first two-input or gate, a second two-input or gate, a third two-input or gate, a fourth two-input or gate, a fifth two-input or gate and a sixth two-input or gate, wherein the input ends of the first to fourth two-input or gates form the input end of the sparse flag generator, the output ends of the first two-input or gate and the second two-input or gate are connected with the input end of the fifth two-input or gate, the output ends of the third two-input or gate and the fourth two-input or gate are connected with the input end of the sixth two-input or gate, the output ends of the fifth two-input or gate and the sixth two-input or gate are connected with the input end of the two-input nor gate, and the output end of the two-input nor gate is the output end of the sparse flag generator.
Further, the input unit is configured to receive input data and a sparse flag bit of the sparse flag generator, receive 8-bit write data and a sparse flag bit of the data per cycle, receive 8-bit write data from the sparse flag generator in eight cycles, update the current sparse flag bit after each data reception cycle, and output 64-bit and 1-bit sparse flag bits in total after eight cycles;
as shown in formula (4), the sparse flag bit F is used to characterize whether the length-8 and bit-width-8 vector is zero, FiIndicating whether the vector written during the i-th cycle is zero or not.
Figure BDA0003126197080000031
Furthermore, the near memory accumulator comprises near memory processing units PE and a part accumulator, each near memory processing unit PE in the near memory accumulator carries out parallel calculation, and the final result is accumulated by the part accumulator;
the near memory processing unit comprises an address decoder, a core array MRAM1, a buffer array MRAM2, a buffer array MRAM3, a first sense amplifier, a second sense amplifier, a shift adder tree and a logic AND module.
The address decoder is respectively connected with the core array MRAM1, the buffer array MRAM2 and the buffer array MRAM3, and is used for decoding the address signals output by the controller and storing data into corresponding addresses according to the address signals; or reading data participating in the calculation;
the core array MRAM1 is used to store weight vectors, the buffer array MRAM2 is used to store activation vectors, and the buffer array MRAM3 is used to store output vectors;
the first sense amplifier is connected with the core array MRAM1 for reading the weight vector sparse flag bit F of the core array MRAM10The second sense amplifier is connected to the data bit and the buffer array MRAM2, and the first sense amplifier and the second sense amplifier are sensitive to the sparse flag signal, and are used for reading the sparse flag bit F of the activation vector in the buffer array MRAM21And a data bit;
the first and second sense amplifiers first read sparse flag bits in the weight vector and the activation vector, where F0And F1Interacting with each other and feeding back to the first and second sense amplifiers if F0|F1If true, at least one of the weight vector and the activation vector is zero, so that the first sense amplifier and the second sense amplifier are all turned off, and the access of the zero vector is skipped. If F0|F1If false, the weight vector and the activation vector are multiplied by the AND logic module and sent to the shift adder tree.
The shift adder tree is sensitive to sparse flag signals, receives sparse flag bits transmitted by the first sensitive amplifier and the second sensitive amplifier, skips calculation of zero vectors if the sparse flag bits indicate that zero vectors exist in vectors to be multiplied, maintains all data unchanged, sets output to 0 through combinational logic, and reduces turnover power consumption. Otherwise, the input of the first sensitive amplifier and the input of the second sensitive amplifier are multiplied through logical AND and are sent to a shift adder tree for shift addition;
the logic and module is used for calculating the product of the activation vector and the weight vector, the logic and module calculates (1bit multiplied by 8bits) each time and sends the result to the shift adder tree, and the (8bit multiplied by 8bits) calculation is completed after 8 cycles.
The near-memory multiply accumulator works in a three-stage pipeline mode and comprises a PE calculation part and an accumulation part, wherein the PE calculation part and the accumulation part are written back. The vector multiplication is performed internally by each PE, the accumulated result of 48 PEs is then sent to the partial sum accumulator, the accumulation operation is performed, the shift is performed, the data is restored to 8bits, and finally the 8-bit data is written back to the cache array MRAM 3. In the whole process, the read operation occurs in the core MRAM1 and the buffer MRAM2, and the write operation occurs in the buffer MRAM3
Further, the core array MRAM1 is used to store weight vectors, and the weight matrix M is mapped in the proximity processing unit PE core array MRAM1 as shown in formula (2)
Figure BDA0003126197080000041
The mapping mode is that each element of the weight matrix M is expanded into 8-bit binary number, and a sparse flag bit is additionally added to each row to judge whether the vector of the row is zero or not.
Further, the cache memory array MRAM2 is used to store an activation vector
Figure BDA0003126197080000042
In the cache memory array MRAM2 disclosed in the present invention, the mapping formula is shown in (3)
Figure BDA0003126197080000043
Mapping into an activation vector
Figure BDA0003126197080000044
Each element is developed as an 8-bit binary number, and each row is arranged with eight identical address bits of the operated-on number and a sparse flag bit, and whether the vector of the row and the previous row is a zero vector is determined, for example, fa7Indicates whether the row vector is zero and fa0-fa6Whether or not it is also zero, so fa7For representing the activation vector
Figure BDA0003126197080000051
Whether it is a zero vector.
Has the advantages that: by adopting the technical scheme, the invention has the following beneficial effects:
(1) according to the invention, the near memory sparse vector multiplier is constructed based on the MRAM, data stored in the MRAM array cannot be lost due to power supply shutoff, the storage requirement that a large amount of weights are hardly updated in neural network application is met, the data maintenance power consumption is effectively reduced, meanwhile, the power consumption of data migration is greatly reduced due to the characteristic of near memory calculation, and the overall energy efficiency is improved.
(2) The invention realizes the sparsity judgment of input data by using the sparse mark generator, records the sparsity only with 1.6 percent of storage overhead, and overcomes the defect that all data still need to be accessed when sparse vector operation is carried out.
(3) The invention realizes the neural network calculation of full-connection 8-bit quantization by using a near-memory sparse vector multiplier, skips the memory access and calculation stages based on sparse flag bits, and reduces the memory access power consumption and the calculation power consumption.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are always needed for describing the embodiments are simply reduced, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts.
FIG. 1 is a block diagram of a structure for implementing MNIST handwritten digit recognition by using a magnetic random access memory-based near-memory sparse vector multiplier according to an embodiment of the present invention;
FIG. 2 is a block diagram of a magnetic random access memory-based near-memory sparse vector multiplier according to an embodiment of the present invention;
FIG. 3 is a circuit diagram of a sparse flag generator provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a near memory multiply accumulator according to an embodiment of the present invention;
FIG. 5 is a block diagram of a near memory processing unit according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a shift adder according to an embodiment of the present invention;
FIG. 7 is a timing diagram illustrating operation of a near memory processing unit according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a working pipeline of a magnetic random access memory-based near-memory sparse vector multiplier according to an embodiment of the present invention;
FIG. 9 is a comparison diagram of power consumption of a near-memory sparse vector multiplication provided by an embodiment of the present invention;
fig. 10 is a statistical result of sparsity of a neural network in an MNIST handwriting database application according to an embodiment of the present invention;
fig. 11 is a block diagram of the multiplier of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the examples of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.
Fig. 1 is a structural block diagram of a magnetic random access memory-based near-memory sparse vector multiplier for implementing MNIST handwritten digit recognition according to an embodiment of the present invention; the picture to be identified is converted into an input vector, the circle in the square frame represents a weight vector, a group of probability vectors are obtained through a calculation mode of multiple vector multiplication, the number corresponding to the maximum value is taken out from the probability vectors and is an identification value, and the vector multiplication is realized by using a near-memory sparse vector multiplier.
As shown in fig. 2 and fig. 11, the near memory sparse vector multiplier based on the magnetic random access memory of the present invention includes a sparse flag generator, an input unit, a controller and a near memory accumulator.
The sparse mark generator is connected with the input unit and judges whether input data is 0 or not through a logic circuit, a sparse mark bit is generated, and the data and the sparse mark bit are transmitted into the input unit. The input data includes a weight vector and an activation vector.
The input unit is connected with the near memory accumulator, the near memory accumulator receives data from the input unit and performs near memory accumulation calculation, and memory access and calculation of zero vectors are skipped in the near memory accumulation calculation process;
the controller is respectively connected with the sparse mark generator, the input unit and the near-multiplication accumulator, and is used for controlling the realization of the functions of the sparse mark generator, the input unit and the near-multiplication accumulator.
As shown in fig. 3, the sparse flag generator includes a combinational logic circuit composed of 6 two-input or gates and 1 two-input nor gate, and is configured to determine whether all 8-bit data are 0, implement the logical operation of equation (5), and generate a sparse flag bit of the data.
Figure BDA0003126197080000061
In this embodiment, a 64 × 384 fully-connected layer is used as a design object, i.e. the weight data is a 64 × 384 matrix, the input activation vector is a 1 × 384 ordered array, and the output activation vector is a 1 × 64 ordered array, the system will complete the calculation of formula (6), where-128 < i, and w < 127.
Figure BDA0003126197080000062
Figure BDA0003126197080000071
As shown in fig. 4, the near memory multiply accumulator provided by the embodiment of the present invention includes 48 near memory processing units PE and a partial sum accumulator, each near memory processing unit PE performs parallel computation, and the computation result is accumulated in the partial sum accumulator. Therefore, the weight array of 64 × 384 is divided into 48 groups of 64 × 8 data corresponding to PEs one-to-one, while the input activation vector is divided into 48 groups of 1 × 8 data corresponding to PEs one-to-one in the same way; thus, equation (6) can be transformed into equation (7) again, where j represents the jth PE unit.
Figure BDA0003126197080000072
As shown in fig. 5, the near memory processing unit includes an address decoder, a core array MRAM1, a buffer array MRAM2, a buffer array MRAM3, a first sense amplifier, a second sense amplifier, a shift adder tree, and a logical and module.
The address decoder is respectively connected with the core array MRAM1, the buffer array MRAM2 and the buffer array MRAM3, and is used for decoding the address signals output by the controller and storing data into corresponding addresses; or reading data participating in the calculation;
the core MRAM1 is used to store weight vectors, the buffer MRAM2 is used to store activation vectors, and the buffer MRAM3 is used to store output vectors.
The first sense amplifier is connected with the core array MRAM1 for reading the weight vector sparse flag bit F of the core array MRAM10The second sense amplifier is connected to the data bit and the buffer array MRAM2 for reading the sparse flag bit F of the activation vector in the buffer array MRAM21And a data bit, wherein F0And F1Interacting with each other and feeding back to the first and second sense amplifiers if F0|F1If true, it means that at least one of the weight vector or the activation vector is zero, thus turning off all of the first and second sense amplifiers and skipping the calculation cycle. If F0|F1If false, the weight vector and the activation vector are multiplied by the AND logic module and sent to the shift adder tree.
The logic and module is used for calculating the product of the activation vector and the weight vector, the logic and module calculates (1bit multiplied by 8bits) each time and sends the result to the shift adder tree, and the (8bit multiplied by 8bits) calculation is completed after 8 cycles.
Weight matrix W in equation (7)ijThe mapping in the core array MRAM1 in PE is shown in equation (8), where the elements of the mapping weight matrix W are expanded into 8-bit binary numbers, and each row is additionally provided with a sparse flag fwjx(j is jth PE, x is xth operand of PE), and the controller provided in this embodiment of the present invention generates a weight vector write signal to write the weight vector into the core array MRAM1, and since the MRAM used in this embodiment stores data without being affected by the power being turned off, the weight needs to be uploaded only once for the MNIST handwriting recognition application of this embodiment.
Figure BDA0003126197080000081
Direction of activationThe quantity mapping is shown in equation (9), where
Figure BDA0003126197080000082
Representing the input vector corresponding to the ith PE, and expanding each element into 8-bit binary number, wherein each row is arranged in such a way that the same address bit and one sparse flag bit f of eight operated numbersijx(j is the jth PE and x is the xth operand of that PE). The controller provided by the embodiment of the present invention generates an activation vector write signal to write the activation vector into the cache memory array MRAM 2.
Figure BDA0003126197080000083
The intra-PE calculation is thus as shown in equation (10):
Figure BDA0003126197080000084
FIG. 6 is a schematic diagram of a shift adder according to an embodiment of the present invention, in which 8 data with a bit width of 8 are added two by two to calculate a final result and stored in SregAnd the shift adder is sensitive to sparse flag bits, if the input vector is a zero vector, then SregRemain unchanged and set the output to 0, otherwise shift add.
As shown in FIG. 7, the timing diagram of the operation of the near memory processing unit according to the embodiment of the present invention is that the controller generates the read enable signal SAE, and the core array MRAM1 implements the read weight sparse flag F on the falling edge (r) of the read enable signal SAE0The cache memory array MRAM2 implements the read input sparse flag F1
At rising edge of SAE @, if the sparse flag bits are both 0, it means that neither the activation vector nor the weight vector is 0, so the calculation operation is ready to be entered, including the following three simultaneous steps:
a) reading all data (8 multiplied by 8bits) of the weight vector at the rising edge according to the stored data mapping mode, reading the most significant bit (8 multiplied by 1bit) of the activation vector, performing logic AND operation on the two, generating a product result (8 multiplied by 8bits) of the weight vector and the most significant bit of the activation vector, and sending the product result into a shift adder tree;
b) the tree output of the shift adder is reset at the rising edge (c) and outputs S in the next cycle0
c) The read enable of the first sense amplifier is turned off at position two (data is kept through a register in the sense amplifier, no turning power consumption and read power consumption are generated), the read enable of the second sense amplifier is kept, the next highest bit of the activation vector is read out in the next period, and is subjected to AND operation with the weight vector and sent into the shift adder tree, and S is completed0Shifting left by one bit and accumulating the current result;
d) repeating the operation c until the lowest bit of the activated vector is read out, and outputting the final accumulation result S by the shift adder7
If any sparse flag bit is 1, it indicates that there is a zero vector in the active vector and the weight vector at the address, and neither the weight vector and the active vector nor the value of the register in the shift adder (PSUM) are updated in the next eight cycles, at which time the output of the shift adder is set to 0 by the combinational logic.
Fig. 8 is a schematic diagram of a working pipeline of a magnetic random access memory-based near memory sparse vector multiplier according to an embodiment of the present invention, where after a weight matrix is uploaded, a near memory multiplier accumulator enters an inference stage; the vector multiply accumulation is performed internally by each PE, the accumulated results of 48 PEs are then sent to the partial sum accumulator, the accumulation operation is performed, the shift is performed to restore the data to 8bits, and finally the 8-bit data is written back to the cache array MRAM 3. In the overall process, a read operation occurs in the core array MRAM1 and the buffer array MRAM2, and a write operation occurs in the buffer array MRAM 3.
FIG. 9 is a comparison diagram of power consumption for near memory sparse vector multiplication according to an embodiment of the present invention; it can be obtained from the above calculation process that when a single PE calculates a non-zero vector, it needs to read 130 bits of data (8 × 8bits of weight vector and 1bit sparse flag bit, 8 × 8bits of activation vector and 1bit sparse flag bit), and since the activation vector only reads 8bits at a time, it needs 93 bits of registers (8 × 8bits of weight vector and 1bit sparse flag bit, 8bits of activation vector and 1bit sparse flag bit, and 19 bits of accumulated sum result), and partial combinational logic. Due to the addition of the sparse flag bits, when the PE processes any one of the activation vector and the weight vector as a zero vector, only 2 flag bits need to be read in the first period, then all the sensitive amplifiers are turned off, the register maintains the value of the last moment, and the output is set to be 0 through combinational logic. In this embodiment, the inversion of the register and the read stage of the sense amplifier consume more than 80% of the energy, so that the power consumption can be effectively reduced and the energy efficiency can be improved by the above method.
Fig. 10 is a statistical result of sparsity of a neural network in an application of an MNIST handwriting data set according to an embodiment of the present invention; statistical results are obtained by analyzing the weights uploaded to the PE units with the input data. The statistical result corresponds to the structure of a near memory vector multiplier, each box represents a PE unit, the deeper the color is, the lower the sparsity degree is, otherwise, the higher the sparsity degree is, the overall average sparsity level is 61.2%, namely, more than six calculations are skipped on average in ten calculations. Therefore, the near-memory sparse vector multiplier of the embodiment saves power consumption and improves energy efficiency by identifying the sparsity and skipping the calculation process.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A near-memory sparse vector multiplier based on a magnetic random access memory is characterized by comprising a sparse mark generator, an input unit, a near-memory accumulator and a controller;
the sparse flag generator is connected with the input unit and judges whether input data is 0 or not through a logic circuit to generate a sparse flag bit, and the data and the sparse flag bit are transmitted into the input unit; the input data comprises a weight vector and an activation vector;
the input unit is connected with the near memory accumulator, the near memory accumulator receives data from the input unit and performs near memory accumulation calculation, and memory access and calculation of zero vectors are skipped in the near memory accumulation calculation process;
the controller is respectively connected with the sparse mark generator, the input unit and the near-storing-multiplying accumulator, and is used for controlling the realization of functions of the sparse mark generator, the input unit and the near-storing-multiplying accumulator and generating address signals for reading and storing data.
2. The MRAM-based near-memory sparse vector multiplier of claim 1, wherein the sparse flag generator comprises six two-input OR gates and one two-input NOR gate for determining whether all 8-bit data is 0 and generating sparse flag bits of the data.
3. The MRAM-based near-memory sparse vector multiplier of claim 1, wherein the input unit is configured to receive input data and sparse flag bits of the sparse flag generator, receive an 8-bit write data and sparse flag bits of the data per cycle, receive 8-bit write data from the sparse flag generator in eight cycles, update the current sparse flag bit after each data reception cycle, and output 64 bits and 1-bit sparse flag bits in total after eight cycles;
Figure FDA0003126197070000011
as shown in formula (4), the sparse flag bit F is used to characterize whether the length-8 and bit-width-8 vector is zero, FiIndicating whether the vector written during the i-th cycle is zero or not.
4. The MRAM-based near-memory sparse vector multiplier of claim 1, wherein the near-memory multiplier accumulator comprises near-memory processing units (PE) and a partial sum accumulator, wherein each of the near-memory processing units (PE) in the near-memory multiplier accumulator performs parallel calculation, and the final result is accumulated by the partial sum accumulator;
the near memory processing unit comprises an address decoder, a core array MRAM1, a buffer array MRAM2, a buffer array MRAM3, a first sense amplifier, a second sense amplifier, a shift adder tree and a logic AND module;
the address decoder is respectively connected with the core array MRAM1, the cache array MRAM2 and the cache array MRAM3, and is used for decoding the address signals output by the controller and storing data into corresponding addresses according to the address signals; or reading data participating in the calculation;
the core array MRAM1 is used to store weight vectors, the buffer array MRAM2 is used to store activation vectors, and the buffer array MRAM3 is used to store output vectors;
the first sense amplifier is connected with the core array MRAM1 for reading the weight vector sparse flag bit F of the core array MRAM10The second sense amplifier is connected to the data bit and the buffer array MRAM2, and the first sense amplifier and the second sense amplifier are sensitive to the sparse flag signal, and are used for reading the sparse flag bit F of the activation vector in the buffer array MRAM21And a data bit;
the first and second sense amplifiers first read sparse flag bits in the weight vector and the activation vector, where F0And F1Interacting with each other and feeding back to the first and second sense amplifiers if F0|F1If true, at least one group of vectors in the weight vector or the activation vector is zero, so that the first sensitive amplifier and the second sensitive amplifier are all turned off, and the access and the storage of the zero vector are skipped; if F0|F1If the result is false, the weight vector and the activation vector are subjected to AND multiplication by the logical AND module and are sent to the shift adder tree;
the shift adder tree is sensitive to sparse flag signals, receives sparse flag bits transmitted by a first sense amplifier and a second sense amplifier, skips calculation of zero vectors if the sparse flag bits indicate that zero vectors exist in vectors to be multiplied, maintains all data unchanged, sets output to 0 through combinational logic, and reduces turnover power consumption; otherwise, the input of the first sensitive amplifier and the input of the second sensitive amplifier are multiplied through logical AND and are sent to a shift adder tree for shift addition;
after the multiplication of the weight vector and the activation vector is completed in the near memory processing element PE, the accumulated result of each PE is then sent to the partial sum accumulator, shifted after the accumulation operation is performed, the data is restored to 8bits, and finally the 8-bit output vector is written back to the cache array MRAM 3.
5. The MRAM-based near-memory sparse vector multiplier of claim 4, wherein the core array MRAM1 is configured to store weight vectors, and the weight matrix M is mapped in the near-memory processing unit PE core array MRAM1 as shown in formula (2)
Figure FDA0003126197070000021
The mapping mode is that each element of the weight matrix M is expanded into 8-bit binary number, and a sparse flag bit is additionally added to each row to judge whether the vector of the row is zero or not.
6. The MRAM-based near-memory sparse vector multiplier of claim 4, wherein the buffer array MRAM2 is configured to store an activation vector
Figure FDA0003126197070000031
In the cache memory array MRAM2 disclosed in the present invention, the mapping formula is shown in (3)
Figure FDA0003126197070000032
Mapping into an activation vector
Figure FDA0003126197070000033
Each element is developed as an 8-bit binary number, and each row is arranged with eight identical address bits of the operated-on number and a sparse flag bit, and whether the vector of the row and the previous row is a zero vector is determined, for example, fa7Indicates whether the row vector is zero and fa0-fa6Whether or not it is also zero, so fa7For representing the activation vector
Figure FDA0003126197070000034
Whether it is a zero vector.
CN202110689836.7A 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory Active CN113378115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110689836.7A CN113378115B (en) 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110689836.7A CN113378115B (en) 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory

Publications (2)

Publication Number Publication Date
CN113378115A true CN113378115A (en) 2021-09-10
CN113378115B CN113378115B (en) 2024-04-09

Family

ID=77578375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110689836.7A Active CN113378115B (en) 2021-06-22 2021-06-22 Near-memory sparse vector multiplier based on magnetic random access memory

Country Status (1)

Country Link
CN (1) CN113378115B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115981751A (en) * 2023-03-10 2023-04-18 之江实验室 Near memory computing system, near memory computing method, device, medium and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046916A1 (en) * 2016-08-11 2018-02-15 Nvidia Corporation Sparse convolutional neural network accelerator
CN110325988A (en) * 2017-01-22 2019-10-11 Gsi 科技公司 Sparse matrix multiplication in associated memory devices
CN110889259A (en) * 2019-11-06 2020-03-17 北京中科胜芯科技有限公司 Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046916A1 (en) * 2016-08-11 2018-02-15 Nvidia Corporation Sparse convolutional neural network accelerator
CN110325988A (en) * 2017-01-22 2019-10-11 Gsi 科技公司 Sparse matrix multiplication in associated memory devices
CN110889259A (en) * 2019-11-06 2020-03-17 北京中科胜芯科技有限公司 Sparse matrix vector multiplication calculation unit for arranged block diagonal weight matrix

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
付世航: "深度卷积算法优化与硬件加速", 中国优秀硕士学位论文全文数据库信息科技辑, 15 December 2019 (2019-12-15) *
刘世培 等: "一种基于FPGA的稀疏矩阵高效乘法器", 微电子学, no. 02, 20 April 2013 (2013-04-20) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115981751A (en) * 2023-03-10 2023-04-18 之江实验室 Near memory computing system, near memory computing method, device, medium and equipment

Also Published As

Publication number Publication date
CN113378115B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
US11625584B2 (en) Reconfigurable memory compression techniques for deep neural networks
CN107169563B (en) Processing system and method applied to two-value weight convolutional network
Imani et al. Acam: Approximate computing based on adaptive associative memory with online learning
WO2021046567A1 (en) Methods for performing processing-in-memory operations on serially allocated data, and related memory devices and systems
US11934826B2 (en) Vector reductions using shared scratchpad memory
CN114072876B (en) Memory processing unit and method for calculating dot product
US11934824B2 (en) Methods for performing processing-in-memory operations, and related memory devices and systems
CN111459552B (en) Method and device for parallelization calculation in memory
CN117574970A (en) Inference acceleration method, system, terminal and medium for large-scale language model
Garzón et al. AIDA: Associative in-memory deep learning accelerator
CN113378115B (en) Near-memory sparse vector multiplier based on magnetic random access memory
CN115394336A (en) Storage and computation FPGA (field programmable Gate array) framework
CN111124999A (en) Dual-mode computer framework supporting in-memory computation
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
CN109978143B (en) Stack type self-encoder based on SIMD architecture and encoding method
CN110085270B (en) Storage operation circuit module and processor
CN115879530A (en) Method for optimizing array structure of RRAM (resistive random access memory) memory computing system
KR20240036594A (en) Subsum management and reconfigurable systolic flow architectures for in-memory computation
CN115586885A (en) Memory computing unit and acceleration method
CN114267391A (en) Machine learning hardware accelerator
CN116129973A (en) In-memory computing method and circuit, semiconductor memory and memory structure
Chen et al. An efficient ReRAM-based inference accelerator for convolutional neural networks via activation reuse
KR102154834B1 (en) In DRAM Bitwise Convolution Circuit for Low Power and Fast Computation
CN115658011B (en) SRAM in-memory computing device of vector multiply adder and electronic equipment
CN115658012B (en) SRAM analog memory computing device of vector multiply adder and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant