WO2023207441A1 - 基于电容耦合的sram存算一体芯片 - Google Patents

基于电容耦合的sram存算一体芯片 Download PDF

Info

Publication number
WO2023207441A1
WO2023207441A1 PCT/CN2023/083070 CN2023083070W WO2023207441A1 WO 2023207441 A1 WO2023207441 A1 WO 2023207441A1 CN 2023083070 W CN2023083070 W CN 2023083070W WO 2023207441 A1 WO2023207441 A1 WO 2023207441A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
input
sram
multiplication
data
Prior art date
Application number
PCT/CN2023/083070
Other languages
English (en)
French (fr)
Inventor
王源
Original Assignee
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学 filed Critical 北京大学
Publication of WO2023207441A1 publication Critical patent/WO2023207441A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the technical field of integrated circuit design, and in particular to an SRAM storage and calculation integrated chip based on capacitive coupling.
  • In-memory computing (Compute-In-Memory, CIM) technology refers to the transformation of the traditional computing-centered architecture into a data-centered architecture, which directly uses memory for data processing, thereby integrating data storage and computing at the same time.
  • One chip forms an integrated storage and computing chip, which can completely eliminate the bottleneck of the von Neumann computing architecture and reduce the additional power consumption and performance loss caused by data transmission.
  • Static Random Access Memory (SRAM) can be widely used to construct storage and computing integrated chips due to its high speed, low power consumption and high robustness.
  • the storage and calculation integrated chip can be used as a hardware implementation of the multiplication and accumulation operation of the neural network model.
  • the existing storage and calculation integrated chip usually uses a charge-based CIM structure and uses switches in the analog domain.
  • the array and additional control signals implement a charge sharing circuit.
  • the sharing control is complex and the delay is large, which has a great impact on the computing performance of the integrated storage and computing chip.
  • the present disclosure provides an SRAM storage and computing integrated chip based on capacitive coupling to solve the defects existing in the prior art.
  • the present disclosure provides an SRAM storage and calculation integrated chip based on capacitive coupling, including: an input module, a bitwise multiplication module, a capacitance attenuation module and an output module.
  • the input module, the bitwise multiplication module, the capacitance attenuation module and The output modules are connected in sequence;
  • the input module is used to receive input data
  • the bitwise multiplication module includes a plurality of bitwise multiplication units, and each bitwise multiplication unit is used to multiply the input data with one bit of the bitwise stored data based on the capacitive coupling principle to obtain the bitwise multiplication unit. Describes the multiplication result corresponding to one bit of stored data;
  • the capacitive attenuation module includes a two-layer capacitive attenuator array. Each first-type capacitive attenuator in the first-layer capacitive attenuator array is connected between two adjacent bit multiplication units. The second-layer capacitive attenuator array Each second-type capacitive attenuator is connected between two adjacent first-type capacitive attenuators; the capacitive attenuation module is used to accumulate the multiplication results corresponding to each bit of the stored data in layers. , obtain multi-bit data simulation accumulation results;
  • the output module is used to determine and output the digital accumulation result corresponding to the multi-bit data analog accumulation result.
  • the input module includes an input sparse sensing module and an input sparse comparison module, and the input sparse sensing module is connected to the bitwise multiplication module;
  • the output module includes a Flash analog-to-digital conversion module, and the input sparse sensing module, the input sparse comparison module, and the Flash analog-to-digital conversion module are connected in sequence;
  • the input sparse sensing module is used to convert the input data into an analog voltage
  • the input sparse comparison module is used to compare the analog voltage with the first reference voltage to obtain a first comparison result
  • the Flash analog-to-digital conversion module is used to compare the multi-bit data simulation accumulation result with the second reference voltage based on the first comparison result to obtain a second comparison result, and use the second comparison result as the The above numbers add up.
  • the working mode of the SRAM storage and computing integrated chip includes a storage operation mode and a computing operation mode
  • the input module and the output module do not work
  • the SRAM storage and computing integrated chip performs multiplication and accumulation operations on the input data and the storage data.
  • the input sparse comparison module includes a plurality of first comparators
  • the Flash analog-to-digital conversion module includes a plurality of Flash analog-to-digital conversion units, each of which The Flash analog-to-digital conversion unit includes a plurality of second comparators;
  • the first comparator and the second comparator are connected in a one-to-one correspondence, and the first reference of each first comparator The voltage is the same as the second reference voltage of the corresponding connected second comparator.
  • the number of the Flash analog-to-digital conversion units is the same as the number of the second type of capacitive attenuators.
  • the bit multiplication unit includes a column of 9T1C unit array, and the 9T1C unit array includes a plurality of 9T1C units;
  • the SRAM storage and calculation integrated chip also includes an SRAM read and write external structure, and the SRAM read and write external structure is connected to the 9T1C unit.
  • the 9T1C unit includes six first-type transistors and three second-type transistors, and the first-type transistors and the second-type transistors are both Connected to the SRAM read and write external structure;
  • the six first-type transistors are used to store one bit of data of the stored data
  • the three second-type transistors are used to perform a multiplication operation between one bit of the stored data stored by the six first-type transistors and a corresponding bit of the input data.
  • the SRAM read and write external structure includes an SRAM controller, an SRAM peripheral circuit and an address decoding driver;
  • the SRAM controller is respectively connected to the SRAM peripheral circuit and the address decoding driver, and the SRAM peripheral circuit and the address decoding driver are both connected to the 9T1C unit.
  • the SRAM storage and computing integrated chip also includes an in-memory computing controller, and the in-memory computing controller is connected to the input module and the output module respectively. connect.
  • the storage data includes multiple 4-bit weight data in the neural network.
  • the SRAM storage and calculation integrated chip based on capacitive coupling includes: an input module, a bitwise multiplication module, a capacitance attenuation module and an output module.
  • the input data is received through the input module; the input data and the stored data are realized through the bitwise multiplication module.
  • Multiplication operation is used to obtain the multiplication operation result; and the capacitance attenuation module is used to accumulate the multiplication operation results layer by layer with a hierarchical capacitance attenuator structure. Not only the structure is simpler, but also the calculation time is shorter.
  • the digital accumulation results can be obtained quickly, improving multiplication and accumulation. Energy efficiency and computational throughput of operations.
  • Figure 1 is one of the structural schematic diagrams of an SRAM storage and computing integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 2 is a schematic structural diagram of the 4b-DAC in the SRAM storage and calculation integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 3 is a schematic diagram of the connection between the DAC array, the input sparse sensing module and the bitwise multiplication module in the SRAM storage and calculation integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 4 is a schematic structural diagram of each bit multiplication unit in the SRAM storage and calculation integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 5 is a multiplication operation timing diagram of the 9T1C unit in the SRAM storage and calculation integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 6 is a schematic diagram of the connection between the bitwise multiplication module and the capacitance attenuation module when each bitwise multiplication unit in the bitwise multiplication module of the SRAM storage and calculation integrated chip based on capacitive coupling provided by the present disclosure includes a column of 9T1C unit arrays.
  • Figure 7 is the layout of the 9T1C unit and HCA column in the SRAM storage and computing integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 8 is the second structural schematic diagram of the SRAM storage and computing integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 9 is a working timing diagram of the MAC operation with input sparse sensing of the SRAM storage and computing integrated chip based on capacitive coupling provided by the present disclosure
  • Figure 10 is a Monte Carlo simulation schematic diagram of the simulated calculation transfer function, linear fitting results and process fluctuations of the capacitive coupling-based SRAM storage and computing integrated chip provided by the present disclosure at different temperatures and process angles;
  • Figure 11 is a distribution diagram of Monte Carlo simulation results and MAC operation results at point A in Figure 10;
  • Figure 12 is a distribution diagram of Monte Carlo simulation results and MAC operation results at point B in Figure 10;
  • Figure 13 is a distribution diagram of Monte Carlo simulation results and MAC operation results at point C in Figure 10.
  • Neural networks have been widely used and achieved excellent performance in pattern recognition, automatic control, financial analysis, biomedicine and other fields.
  • convolutional neural networks have great application in image processing. The performance is particularly outstanding.
  • the complexity of the task continues to increase, the scale of the neural network continues to increase, and the amount of parameters and calculations in the network are also increasing. This also means that the resources and power consumption consumed by the neural network mapped to the hardware increasing day by day.
  • the core and largest operation in a convolutional neural network is the multiplication and accumulation (Multiply Accumulate, MAC) operation. Therefore, the key to realizing a low-power convolutional neural network lies in the design of a low-power MAC operation unit.
  • Compute-In-Memory (CIM) technology aims to transform the traditional computing-centered architecture into a data-centered architecture, which directly uses memory for data processing, thereby integrating data storage and computing at the same time.
  • the bottleneck of the von Neumann computing architecture can be completely eliminated, which is especially suitable for large-scale parallel application scenarios such as deep learning neural networks (Deep Convolution Neural Network, DCNN) with large amounts of data.
  • This system architecture not only retains the storage, reading and writing functions of the storage circuit itself, but also supports different logic or multiplication and addition operations because the storage unit and computing unit are integrated together, thereby greatly reducing the need for central processing.
  • the frequent bus interactions between the memory and the memory circuit also further reduce the amount of data movement. It can perform a large number of parallel calculations with ultra-low power consumption, which greatly improves the energy efficiency of the system and is an important factor in promoting the realization of high-end artificial intelligence applications. A highly potential research direction in energy efficiency computing.
  • CIM based on charge domain calculation has smaller capacitor mismatch and process changes, and has better nonlinearity and accuracy.
  • CIM based on charge domain calculations still faces some challenges, including:
  • the design of storage and computing units to achieve dot multiplication requires a trade-off between the number of transistors, size and computational dynamic range. For example, if the 8T1C unit is used, the number of transistors is small, but there will be a threshold loss in the dynamic range, while the 10T1C unit is used , although it can achieve rail-to-rail dynamic range, its storage and computing units are larger.
  • CIM based on the charge domain uses a switch array and an amount in the analog domain.
  • the former has complex sharing control and large delay, while the latter has high power consumption and large area.
  • the multi-bit Analog-to-Digital Converter (ADC) that converts the analog MAC operation results into digital encoding consumes a large amount of energy and has a serious impact on the overall energy efficiency.
  • ADC Analog-to-Digital Converter
  • Figure 1 is a schematic structural diagram of an SRAM storage and computing integrated chip based on capacitive coupling provided in an embodiment of the present disclosure.
  • the chip includes: an input module 1, a bitwise multiplication module 2, a capacitive attenuation module 3 and Output module 4;
  • the input module 1, the bitwise multiplication module 2, the capacitance attenuation module 3 and the output module 4 are connected in sequence;
  • the input module 1 is used to receive input data
  • the bitwise multiplication module 2 includes a plurality of bitwise multiplication units 21, each of which is used to multiply the input data with one bit of the bitwise stored data based on the capacitive coupling principle. Obtain the multiplication result corresponding to one bit of the stored data;
  • the capacitive attenuation module 3 includes two layers of capacitive attenuator arrays. Each first-type capacitive attenuator 311 in the first layer of capacitive attenuator array 31 is respectively connected between two adjacent bit multiplication units 21. Each second type capacitive attenuator 321 in the capacitive attenuator array 32 is respectively connected between two adjacent first type capacitive attenuators 311;
  • the capacitance attenuation module 3 is used to accumulate the multiplication results corresponding to each bit of the stored data layer by layer to obtain a multi-bit data simulation accumulation result;
  • the output module 4 is used to determine and output the digital accumulation result corresponding to the multi-bit data analog accumulation result.
  • the input module 1 may include a Digital-to-Analog Converter (DAC) array.
  • the DAC array can include multiple DACs, and the number of bits of each DAC can be determined based on the number of bits of a single storage data stored in each bit multiplication unit and the number of bits of a single input data. The three can be consistent, for example, they can all be 4 bits (i.e. 4b).
  • the DAC array can include multiple 4b-DACs, and each 4b-DAC can be used to receive a 4b of input data.
  • the number of 4b-DACs included in the DAC array can be set as needed, for example, it can be set to 128.
  • each 4b-DAC provides a driving voltage through an off-chip external bias.
  • the off-chip bias provides 16 driving voltages for each 4b DAC, 1/16VDD from GND to VDD
  • the 4-16 decoder function is mainly implemented on the chip. Input a 4b input data (4b Input) into a 4b DAC to obtain the decoding result (DAC-OUT).
  • the bitwise multiplication module 2 can include multiple bitwise multiplication units 21.
  • the number of bitwise multiplication units 21 in the bitwise multiplication module 2 can be set as needed. For example, it can be set to 64, that is, the bitwise multiplication module 2 can implement a total of 64 4b Bitwise multiplication of stored data and input data.
  • Each bit multiplication unit 21 may include the same number of computing units as the DACs in the DAC array, so that the DACs are connected to the computing units in a one-to-one correspondence. Storage data of corresponding bits (that is, storage data of 1b) can be stored in each calculation unit.
  • Each calculation unit includes a capacitor, and then through the capacitive coupling principle, the multiplication operation between the decoding result output by the connected DAC and the stored data of the corresponding bit can be realized. Then each bit multiplication unit can realize all input data and By performing multiplication on one bit of stored data, the multiplication result corresponding to one bit of stored data can be obtained. Therefore, four adjacent bit multiplication units can jointly realize the multiplication operation of all input data and a 4b stored data stored in bitwise.
  • the capacitance attenuator module 3 includes a two-layer capacitance attenuator (CA) array.
  • Each first-type capacitance attenuator 311 in the first-layer capacitance attenuator array 31 is connected to two adjacent ones respectively.
  • each second type capacitive attenuator 321 in the second layer capacitive attenuator array 32 is respectively connected between two adjacent first type capacitive attenuators 311 .
  • the attenuation coefficient of the first type of capacitive attenuator 311 and the second type of capacitive attenuator 321 can be determined according to the ratio of each bit of stored data.
  • the multiplication results corresponding to each bit of each stored data can be accumulated layer by layer to obtain the multi-bit data simulation accumulation result of each stored data.
  • the device 321 may form a hierarchical capacitor attenuator (HCA) structure. Therefore, the capacitance attenuation module 3 connected to the bitwise multiplication module 2 can include a total of 64 HCA structures to realize the accumulation of the bitwise multiplication results of 64 4b stored data and input data.
  • HCA hierarchical capacitor attenuator
  • the HCA structure Compared with the accumulation method of capacitance sharing based on weighted capacitor arrays and compensation capacitor arrays, the HCA structure does not use a switch array for temporary storage and charge sharing of analog data. Therefore, the HCA structure is simpler.
  • the capacitance attenuation module 3 can be driven by a strong external voltage to achieve a stable voltage output. Compared with the weight accumulation mode based on capacitance sharing that achieves a stable voltage output through weak internal voltage rebalancing. This way, the calculation time for obtaining multi-bit data simulation accumulation results is shorter.
  • the output module 4 in the chip may include multiple analog-to-digital converters (ADCs), and the number of bits of each DAC can be determined based on the number of bits of a single stored data stored bitwise in each bit multiplication unit. OK, for example, they can all be 4 bits.
  • the output module 4 can include multiple 4b-ADCs, each 4b-ADC is connected to an HCA structure, converts the multi-bit data analog accumulation result corresponding to each stored data into a digital accumulation result, and converts the digital The accumulated results are output.
  • the SRAM storage and computing integrated chip based on capacitive coupling includes: an input module, a bitwise multiplication module, a capacitance attenuation module and an output module.
  • the input data is received through the input module; the input data is combined with the bitwise multiplication module.
  • the multiplication operation of stored data is used to obtain the multiplication operation result; and the capacitance attenuation module is used to accumulate the multiplication operation results layer by layer with a hierarchical capacitance attenuator structure. Not only the structure is simpler, but also the calculation time is shorter, and the digital accumulation result can be obtained quickly. Improve the energy efficiency and computational throughput of multiply-accumulate operations.
  • the input module includes an input sparse sensing module and an input sparse comparison module, and the input sparse sensing module and the button bit multiplication module connection;
  • the output module includes a Flash analog-to-digital conversion module, and the input sparse sensing module, the input sparse comparison module, and the Flash analog-to-digital conversion module are connected in sequence;
  • the input sparse sensing module is used to convert the input data into an analog voltage
  • the input sparse comparison module is used to compare the analog voltage with the first reference voltage to obtain a first comparison result
  • the Flash analog-to-digital conversion module is used to compare the multi-bit data simulation accumulation result with the second reference voltage based on the first comparison result to obtain a second comparison result, and use the second comparison result as the The numbers are tired Add results.
  • the input module may also include an input sparse sensing module and an input sparse comparison module.
  • the DAC array, the input sparse sensing module and the bitwise multiplication module are connected.
  • the decoding result obtained by each DAC in the DAC array can be expressed as IA[i], 0 ⁇ i ⁇ N-1, and N is the number of DACs in the DAC array, which can be 128. It can be understood that, regardless of whether the input module includes an input sparse sensing module and an input sparse comparison module, IA[i] can be input to the bit multiplication unit for multiplication operations.
  • the input sparsity sensing module can be an IS-DAC (Input Sparsity Sensing DAC), which can include NMOS11 and multiple sensing branches.
  • the sensing branches are connected to the DACs in the DAC array in one-to-one correspondence.
  • the IS-DAC can include a total of one NMOS and 128 A sensing branch, NMOS is responsible for discharging, the source of NMOS can be grounded, and the gate of NMOS can receive an external reset signal (RST_IS).
  • Each sensing branch includes a switch 12 and a capacitor 13.
  • the DAC, switch 12 and capacitor 13 are connected in sequence.
  • the IS-DAC can include a switch array composed of 128 switches and a capacitor array composed of 128 capacitors. The other plates of all capacitors are connected to the collector of NMOS and the input sparse comparison module respectively.
  • the switch array can receive an external gate connection control signal (IS-Eval).
  • the IS-DAC combines the control signal of the switch array and converts all IA[i] into a representation of the input sparsity through capacitive coupling through the capacitor array.
  • Analog voltage V IS Analog voltage
  • the input sparsity comparison module may include an input sparsity comparator array (Input Sparsity Comparator, IS-CA), including a plurality of first comparators.
  • the number of first comparators may be set as needed, for example, 15 may be included.
  • IS-CA is controlled through the external enable signal (IS_SA_EN).
  • the inverting terminal of each first comparator is connected to the first reference voltage Vref[j], 0 ⁇ j ⁇ M-1, and M is the number of comparators in the IS-CA, which can be 15.
  • Each first comparator in IS-CA can compare the analog voltage V IS with the first reference voltage Vref[j] to obtain the first comparison result, that is, the thermometer code DR[j] of 1b, and the IS-CA can output The thermometer code DR ⁇ 0:14> of 15b.
  • the output module can include a Flash analog-to-digital conversion module.
  • the Flash analog-to-digital conversion module can include multiple Flash analog-to-digital conversion units (Flash-ADC). Each Flash-ADC can be a 4b-ADC, so the Flash analog-to-digital conversion module Can be regarded as a 4b-Flash-ADC array.
  • the number of Flash-ADCs in the Flash analog-to-digital conversion module can be the same as the number of stored data, that is, the Flash analog-to-digital conversion module can include a total of 64 Flash-ADCs, which can be recorded as Flash-ADC ⁇ k>, 0 ⁇ k ⁇ K -1, K is the number of Flash-ADC in the Flash analog-to-digital conversion module, which can be 64.
  • Each Flash-ADC can include multiple second comparators.
  • the second comparators in the Flash-ADC are connected to the first comparators in the IS-CA in a one-to-one correspondence. Therefore, there can be 15 comparators in each Flash-ADC.
  • Each of the Flash-ADC Both comparators have a second reference voltage.
  • the second comparator in Flash-ADC can compare the multi-bit data simulation accumulation result with the second reference voltage based on the first comparison result of the corresponding first comparator to obtain the second reference voltage. Compare the result, and output the second comparison result as a digital accumulation result corresponding to the multi-bit data analog accumulation result.
  • the first comparator in IS-CA and the second comparator in Flash-ADC are both strong-arm comparators.
  • the first comparator in IS-CA is arranged in order from the closest to the IS-DAC. In far order, the first reference voltage is from low to high.
  • the second comparator in the Flash-ADC is arranged in the order of the distance from the first comparator in the connected IS-CA to the IS-DAC from the nearest to the farthest, and the second reference voltage is from low to high.
  • the first second comparator can be expressed as L-Comp ⁇ 0>, and its corresponding second reference voltage ranges from 0-400mV.
  • the last second comparator can be represented as H-Comp ⁇ 14>, and its corresponding second reference voltage ranges from 400-900mV.
  • the input sparse sensing module, the input sparse comparison module and the Flash analog-to-digital conversion module are combined to achieve high throughput of the chip.
  • the Flash analog-to-digital conversion module has a rail-to-rail decoding range. Since in MAC operations, the entire dynamic range is rarely achieved, especially when the input data is sparse.
  • the input sparse sensing strategy based on the real-time sensing of the input sparsity characteristics of the input sparse sensing module is used in the decoding of the Flash analog-to-digital conversion module to reduce the number of comparisons and thereby reduce energy. This strategy estimates the sum of 128 4b input data and quantizes it without considering the stored data. According to the quantization result, it allows to skip redundant comparator work.
  • embodiments of the present disclosure provide an SRAM storage and computing integrated chip based on capacitive coupling.
  • the working mode of the SRAM storage and computing integrated chip includes a storage operation mode and a computing operation mode;
  • the input module and the output module do not work
  • the SRAM storage and computing integrated chip performs multiplication and accumulation operations on the input data and the storage data.
  • the working mode of the SRAM storage and computing integrated chip may include two types, namely the storage operation (SRAM) mode and the computing operation (CIM) mode.
  • the storage operation mode refers to the operation mode in which the storage data is stored in the SRAM storage and calculation integrated chip bit by bit.
  • the storage location can be in the bit multiplication unit.
  • the calculation operation mode refers to the operation mode in which input data and stored data are calculated.
  • the input sparse comparison module includes a plurality of first comparators
  • the Flash analog-to-digital conversion module includes a plurality of Flash An analog-to-digital conversion unit, each of the Flash analog-to-digital conversion units includes a plurality of second comparators;
  • the first comparator and the second comparator are connected in a one-to-one correspondence, and the first reference voltage of each first comparator is the same as the second reference voltage of the correspondingly connected second comparator.
  • the first reference voltage of the first comparator and the second reference voltage of the connected second comparator can be the same, so as to ensure that the working state of the second comparator is accurately determined through the input sparseness. judgment, thereby improving the accuracy of the output results while reducing the number of working second comparators.
  • the number of the Flash analog-to-digital conversion units is the same as the number of the second type capacitive attenuators.
  • the number of Flash-ADCs is the same as the number of second-type capacitive attenuators, and they can be connected in one-to-one correspondence. This can ensure that each Flash analog-to-digital conversion unit realizes a corresponding 4b storage data Determination and output of digital accumulation results.
  • the input sparse sensing module can convert 128 4b input data into one representing the input data through capacitive coupling.
  • IS-CA compares V IS with the first reference voltage of the first comparator.
  • the first comparison result is the 15b thermometer code DR[0:14], which represents the quantized input sparsity.
  • the thermometer code DR[0:14] determines the working status and second comparison result of the 15 second comparators in each Flash-ADC during the readout stage.
  • thermometer code DR[i] 0
  • corresponding second comparator Comp ⁇ i> will be skipped and the comparison result will be set to 0.
  • thermometer code DR[i] 1
  • its corresponding second comparator Comp ⁇ i> will work normally and generate an output.
  • the bit multiplication unit includes a column of 9T1C unit array, and the 9T1C unit array includes multiple 9T1C units;
  • the SRAM storage and calculation integrated chip also includes an SRAM read and write external structure, and the SRAM read and write external structure is connected to the 9T1C unit.
  • each bit multiplication unit may include a column of 9T1C unit arrays, and each 9T1C unit array may include multiple 9T1C units.
  • Each 9T1C unit is a computing unit containing 9 transistors T and 1 capacitor (C bitcell ). Storage and multiplication are realized through 9T, and the multiplication results are accumulated on the upper plate of the capacitor through the capacitive coupling principle of the capacitor.
  • the SRAM storage and computing integrated chip can also include an SRAM read and write external structure.
  • the SRAM read and write external structure can be connected to each 9T1C unit through the word line WL and the bit line BL/BLB to realize the driving and control of each 9T1C.
  • a 9T1C cell array is used as a bit multiplication unit to realize the multiplication operation of one bit of stored data and the input data.
  • the number of transistors and the dynamic range are balanced, and the number of transistors is improved. improve chip performance.
  • the 9T1C unit includes six first-type transistors and three second-type transistors.
  • the first-type transistors The second type of transistor is connected to the SRAM read and write external structure;
  • the six first-type transistors are used to store one bit of data of the stored data
  • the three second-type transistors are used to perform a multiplication operation between one bit of the stored data stored by the six first-type transistors and a corresponding bit of the input data.
  • each bit multiplication unit may include a column of 9T1C unit arrays, and the 9T1C unit array may include multiple 9T1C units, and each 9T1C unit may be used to store one bit of data.
  • the number of 9T1C cells in each 9T1C cell array can be the same as the number of input data, for example, both can be 128.
  • each 9T1C unit The input of each 9T1C unit is IA[i].
  • the input line IA[i] divides the 9 transistors T into 6T on the upper side and 3T on the lower side.
  • the upper 6T can be the first type of transistor, which is mainly used to store one bit of data.
  • the lower 3T can be the second type of transistor, which is used to multiply the input data and the one bit of data stored in the upper 6T.
  • 6T includes node Q and node QB, and the input data is stored in node Q in the form of voltage.
  • the first T is parallel to the second T and then connected in series with the third T.
  • QB[i] and Q[i] are the voltages on the lines connected to the gates of the first T and the second T respectively.
  • the capacitor C in each 9T1C unit is parallel to the third T.
  • the upper plate voltage of the capacitor C can be expressed as Mult[i], which can be used to characterize the product operation result of IA[i] and the one-bit data stored at point Q. .
  • each 9T1C unit is connected to the word line WL[i], the bit line BL, and the bit line BLB, and each 9T1C unit includes a calculation line CL.
  • a switch for reset is connected to the calculation line CL. After the switch is turned on, the corresponding calculation line can receive the reset signal (RST_MAC).
  • the 9T1C unit has a rail-to-rail dynamic range that is greater than the 8T1C design.
  • the capacitor used is a ⁇ 1.33f MOM capacitor, which can be placed above the 9T transistor during chip preparation, with a small area overhead.
  • the 1b data in the 4b storage data is stored in each 9T1C unit.
  • the 4b input data is applied to the input line IA[0:127] as an analog voltage generated by the 4b-DAC, which is used to drive all bits on the corresponding row.
  • the multiplication operation of the 9T1C unit can include two stages: reset and output evaluation, as shown in Figure 5.
  • Multiple storage data are stored in multiple 9T1C cells in the 9T1C cell array, multiple 9T1C cells are parallelized to perform multiplication operations, and multiple input data are applied to the IA as driving voltages during the evaluation phase.
  • the generated voltage V_CL is proportional to the bitwise MAC operation result of the 1b data of the stored data and the 4b input data.
  • each bit multiplication unit in the bit-wise multiplication module includes a column of 9T1C unit arrays, the bit-wise multiplication module and Connection diagram of the capacitive attenuation module.
  • Figure 6 only shows four adjacent bit multiplication units corresponding to a 4b storage data and two first-type capacitive attenuators and second-layer capacitive attenuators in the first-layer capacitive attenuator array included in an HCA structure.
  • the 4-bit data in the 4b storage data are represented as W[0], W[1], W[2], and W[3] respectively.
  • Each bit of data corresponds to a calculation line, which is CL[0] and CL[1 respectively. ], CL[2], CL[3].
  • the two first-type capacitive attenuators are Cw01 and Cw02, and the second-type capacitive attenuator is Cw01/23.
  • Cw01 is connected to CL[0] and CL[1] respectively
  • Cw02 is connected to CL[2] and CL[ respectively.
  • 3] connection Cw01/23 is connected to CL[1] and CL[3] respectively.
  • the multiplication results of 128 4b IAs and four 1b data storing data are stored in CL[3], CL[2], CL[1] and CL[0] respectively.
  • the above calculation process determines the attenuation coefficient AC of each capacitive attenuator according to the ratio of one bit of data, and this process is hierarchical.
  • the multiplication operation of the 9T1C unit can include a reset phase and an evaluation phase.
  • the reset phase the upper and lower plates of the first type capacitor attenuator and the second type capacitor attenuator are discharged to GND.
  • the evaluation stage all 4b input data are input into the 9T1C unit array through the 4b-DAC, and the upper plate of the capacitor in the 9T1C unit is clamped to a fixed voltage generated by the 4b-DAC.
  • the output voltage V HCA of the HCA structure representing the calculation result is generated.
  • V HCA will be quantized by the 4b Flash analog-to-digital conversion module and output the 4b calculation result.
  • V HCA can be calculated by the following formula:
  • IA i is the input of the i-th row
  • w i,j is the one-bit data of the storage data stored in the i-th row and j-th column, which is 0 or 1
  • IA max,i is the maximum input value, for 4b Input data, IA max,i is 15.
  • the transistor area of the 9T1C unit is 0.7um ⁇ 1.42um
  • the MOM capacitor area of 1.33fF is 0.55um ⁇ 1.42um.
  • Figure 7 shows the layout of 9T1C units and HCA columns, which includes 4 columns of 9T1C units and 1 HCA structure, showing a multiplication and accumulation implementation layout of 4b weight data. Taking into account the symmetry and matching of the layout, the following three improvements are made: First, the C w01 /C w23 and C w01/23 capacitors in the HCA structure are split into 128 and 64 unit capacitances C bitcell at the 9T1C unit level respectively. . These small unit capacitors are distributed throughout the layout of the bitwise multiplication module while ensuring a preset ratio. Therefore, for each row, seven small capacitors at the 9T1C unit level with different functions are distributed in the transistor-level layout of four 9T1C units.
  • A is the unit capacitance C bitcell
  • B is the first type of capacitive attenuator
  • C is a virtual capacitor
  • D is the second type of capacitive attenuator.
  • the SRAM read and write external structure may include an SRAM controller (SRAM Controller) and SRAM peripheral circuits (SRAM Peripheral circuits). And address decoder driver (Address Decoder&Driver).
  • the SRAM controller can be connected to the SRAM peripheral circuit and address decoding driver respectively to achieve global control of the chip's storage function.
  • SRAM peripheral circuits and address decoding drivers are compatible with 9T1C The units are connected to ensure that the stored data can be stored in each 9T1C unit bit by bit.
  • the automatic implementation of the chip storage function can be realized through the SRAM controller.
  • the SRAM storage and computing integrated chip based on capacitive coupling is provided in the embodiment of the present disclosure.
  • the SRAM storage and computing integrated chip may also include an in-memory computing controller (CIM Controller).
  • the in-memory computing controller Can be connected to input modules and output modules respectively. Through the in-memory computing controller, global control of the chip’s computing functions can be achieved.
  • the storage data includes multiple 4-bit weight data in the neural network.
  • the neural network usually includes a large amount of 4-bit weight data, which can be stored bit by bit as storage data to the bit multiplication unit in the chip, and combined with the input data of the neural network.
  • the multiplication and accumulation operation can then realize the function of the convolution kernel in the neural network.
  • the chip 8 is a schematic diagram of the complete structure of an SRAM storage and computing integrated chip based on capacitive coupling provided in an embodiment of the present disclosure.
  • This chip can realize the requirement of 64 convolution kernels in a neural network to achieve 128 operations individually.
  • the chip includes a 128 ⁇ 256 9T1C cell array, SRAM controller, SRAM peripheral circuit, address decoder driver, CIM controller, 128 4b-DAC, IS-DAC, IS-CA, including 1 ⁇ 64 HCA Capacitance attenuation module and Flash analog-to-digital conversion module containing 1 ⁇ 64 4b-ADC.
  • the size of the 9T1C cell array is 128 ⁇ 256, and the memory capacity of the chip is 32kb.
  • the 256 columns of the 9T1C cell array are divided into 64 groups. Each group contains 4 columns and is used to store a 4b weight data.
  • the chip In SRAM mode, 4b-DAC, IS-DAC, IS-CA, and 4b-ADC do not work. At this time, the chip is a 6T-SRAM memory and performs normal read and write operations. In this mode, the neural network Weight data will be written to SRAM. In CIM mode, the chip executes 4b MAC operations completely in parallel. In a single cycle, all rows input 4b input data.
  • the chip can support a total of 128 4b input data, which can be expressed as IN[0][0 :3], IN[1][0:3], ..., IN[127][0:3].
  • the corresponding decoding results IA[0], IA[1], ..., IA[127] can be obtained by inputting the data through the corresponding 4b-DAC.
  • the vector matrix multiplication of 128 input data and 64 128 ⁇ 4b weights is calculated in the analog domain through capacitive coupling.
  • the Flash analog-to-digital conversion module converts the analog voltage representing the MAC operation result into a 4b digital code output.
  • the SRAM peripheral circuit provides bit lines BL/BLB and address decoding for each 9T1C cell in the 9T1C cell array
  • the driver provides word line WL for each 9T1C cell.
  • a switch controlled by the CIM controller is also connected between the 9T1C unit array and the capacitance attenuation module.
  • the switch can be grounded to realize the reset (RST) of the corresponding calculation line CL[i].
  • the input module includes an input sparse sensing module and an input sparse comparison module
  • the output module includes a Flash analog-to-digital conversion module.
  • the working sequence of the MAC operation with input sparsity sensing is divided into two independent parts: input sparsity sensing (IS) and MAC operation, which are interconnected through the thermometer code DR[0:14] the process of.
  • the IS process is divided into two processes: reset (Reset_IS) and output evaluation (Evaluation_IS).
  • the MAC operation process is also divided into two processes: reset (Reset_MAC) and output evaluation (Evaluation_MAC).
  • CLK is the working clock of the chip.
  • the working clock generates RST_IS, EVAL_IS (i.e. IS-Eval in Figure 2), SA_EN_IS (i.e. IS_SA_EN in Figure 3), RST_MAC, EVAL_MAC (indicating the perceptual support during MAC operation) through the timing control module.
  • the gate of the switch in the road is connected to the control signal) and SA_EN_MAC (indicating the enable signal of IS-CA during the MAC operation).
  • RST_IS and EVAL_IS, RST_MAC and EVAL_MAC are two inverted signals.
  • SA_EN_IS and SA_EN_MAC are both used for readout in the Evaluation stage.
  • the RST_IS signal is advanced before the RST_MAC signal, so that DR[0:14] can be generated before the MAC operation stage, thereby controlling the working status and the second type of comparator in the Flash analog-to-digital conversion module. output.
  • the IS-DAC is reset through the RST_IS signal.
  • IS-DAC immediately evaluates the input sparsity in the analog domain, generates V IS , and then prepares to start quantification of V IS through IS-CA.
  • the Reset_MAC process of the MAC operation is in progress.
  • IS-CA has already generated DR[0:14]. Such working timing ensures that the addition of the input sparse sensing strategy will not reduce the throughput of chip calculations.
  • the SRAM storage and computing integrated chip based on capacitive coupling adopts a 9T1C unit, which performs multiplication operations in the capacitive domain in a capacitive coupling manner; it achieves 4b of stored data through a hierarchical attenuation capacitive structure.
  • this structure does not have the additional switches, complex controls, and long sharing time of the traditional charge sharing structure, which greatly improves the computing throughput of multi-bit heavy data computing systems; it uses a Flash analog-to-digital conversion module based on the input sparse sensing strategy , which can reduce the number of AD comparisons and improve the energy efficiency of the system.
  • the chip can support 8192 4b ⁇ 4b MAC operations.
  • the distribution of Monte Carlo simulation results and MAC operation results at point A in Figure 10 is shown in Figure 11.
  • the MAC operation results corresponding to the vertical lines from left to right in Figure 11 are 166.107m, 166.307m, 166.506m, respectively. 166.706m, 166.905m, 167.105m and 167.304m.
  • the distribution of Monte Carlo simulation results and MAC operation results at point B in Figure 10 is shown in Figure 12.
  • the MAC operation results corresponding to the vertical lines from left to right in Figure 12 are 446.451m, 446.721m, 446.992m, respectively. 447.262m, 447.532m, 447.802m and 448.073m.
  • the distribution of Monte Carlo simulation results and MAC operation results at point C in Figure 10 is shown in Figure 13.
  • the MAC operation results corresponding to the vertical lines from left to right in Figure 13 are 726.978m, 727.275m, 727.573m, respectively. 727.871m, 728.168m, 728.466m and 728.763m.
  • the structure of the chip was simulated to calculate the voltage establishment time.
  • simulation first write all 1's in the storage locations of the 9T1C unit, and then input the corresponding pattern from small to large with equal gradients, so that the MAC operation result gradually increases.
  • the average settling time of the analog voltage of the MAC operation of 128 input data of 4b and weight data of 4b is 0.2ns.
  • the analog voltage establishment time of this chip is reduced by 90%, which makes the chip have a higher computational throughput than the charge sharing scheme. 50% improvement.
  • the reduction in analog voltage settling time and the improvement in throughput are mainly due to the fact that the analog voltage of the chip is established at a strong and clear externally applied voltage, while the analog voltage in the charge sharing structure is established at a weak and floating internal voltage. is established through re-equilibration of potential.
  • the energy efficiency of the chip under different input sparsity with and without the input sparsity sensing strategy is also compared. In both cases, the chip's energy efficiency increases with input sparsity, in increasingly larger increments.
  • the presence of the delta in the absence of an input sparsity-aware strategy indicates that the 9T1C unit saves the driver cost of capacitors with sparse dot product results.
  • the results with the input sparsity-aware strategy are lower than without the input sparsity-aware strategy due to the cost of IS-DAC and IS-CA.
  • the results with the presence of the input sparsity-aware strategy are significantly greater than the results without the input sparsity-aware strategy, thanks to the large number of skips in the second category in the Flash analog-to-digital conversion module during computation Comparators.
  • the input sparsity sensing strategy introduced by this chip achieves high energy efficiency of 460 ⁇ 2264.4TOPS/W under input sparsity of 5% to 95%. In terms of average energy efficiency, the results with the input sparse sensing strategy show an improvement of 12.8%, reaching a high energy efficiency of 666TOPS/W.
  • Table 3 shows a performance comparison table between the chip structure provided in the embodiment of the present disclosure and the existing chip structure.
  • the chip structure provided in the embodiment of the present disclosure achieves higher energy efficiency and throughput, and is better than the existing chip structure. Compared to 10 times and 1.84 times respectively. And behavioral simulation results show that the classification accuracy in the CIFAR-10 dataset is comparable to other works.
  • the existing chip structures in Table 3 are all reflected in the source of the article giving the structure.
  • Table 3 shows a performance comparison table between the chip structure provided in the embodiment of the present disclosure and the existing chip structure.
  • 1 means the average area considered local computing cell
  • 2 means the current calculation of the 1b weight MAC operation and the charge sharing of multi-bit weight accumulation (Current computation for 1b weight MAC and charge-sharing for multi-bit weight accumulation)
  • 3 means estimated from the description (Estimated from the description)
  • 4 means from the proposed Estimated from the proposed structure with NMOS as transmission gate switch
  • 5 represents Estimated from the graph
  • 6 represents normalization to 4b/4b input/weight operation (Normalized to 4b/4b input/weight operation)
  • 7 means that one MAC operation is counted as two operations (multiplication and addition) (One MAC is counted as two operations (multiplication and addition))
  • 8 means that the comparator offset is considered Behavioral simulation result considering comparator offset voltage.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Semiconductor Integrated Circuits (AREA)
  • Semiconductor Memories (AREA)

Abstract

本公开提供一种基于电容耦合的SRAM存算一体芯片,包括:输入模块、按位乘法模块、电容衰减模块以及输出模块,通过输入模块接收输入数据;通过按位乘法模块实现输入数据与存储数据的乘法运算,得到乘法运算结果;并采用电容衰减模块,以层次化电容衰减器结构乘法运算结果的按层累加,不仅结构更加简单,而且计算时间更短,可以快速得到数字累加结果,提高乘法累加运算的能量效率以及计算吞吐。

Description

基于电容耦合的SRAM存算一体芯片
相关申请的交叉引用
本申请要求于2022年04月27日提交的申请号为202210457425X,发明名称为“基于电容耦合的SRAM存算一体芯片”的中国专利申请的优先权,其通过引用方式全部并入本文。
技术领域
本公开涉及集成电路设计技术领域,尤其涉及一种基于电容耦合的SRAM存算一体芯片。
背景技术
存内计算(Compute-In-Memory,CIM)技术,是指把传统以计算为中心的架构转变为以数据为中心的架构,其直接利用存储器进行数据处理,从而把数据存储与计算融合在同一个芯片当中,即构成存算一体芯片,可以彻底消除冯诺依曼计算架构瓶颈,降低数据传输造成的额外功耗和性能损失。静态随机存取存储器(Static Random Access Memory,SRAM)因其高速、低功耗和高鲁棒性的特点,可被广泛用于构造存算一体芯片。
目前,存算一体芯片可以作为神经网络模型的乘法累加运算的硬件化实现,但是现有的存算一体芯片,为了实现多比特的数据累加,通常采用基于电荷的CIM结构,在模拟域使用开关阵列和额外的控制信号实现电荷共享电路,共享控制复杂,延时大,对存算一体芯片的计算性能产生极大影响。
发明内容
本公开提供一种基于电容耦合的SRAM存算一体芯片,用以解决现有技术中存在的缺陷。
本公开提供一种基于电容耦合的SRAM存算一体芯片,包括:输入模块、按位乘法模块、电容衰减模块以及输出模块,所述输入模块、所述按位乘法模块、所述电容衰减模块以及所述输出模块依次连接;
所述输入模块用于接收输入数据;
所述按位乘法模块包括多个位乘法单元,每个所述位乘法单元均用于基于电容耦合原理,将所述输入数据与按位存储的存储数据的一位数据进行乘法运算,得到所述存储数据的一位数据对应的乘法运算结果;
所述电容衰减模块包括两层电容衰减器阵列,第一层电容衰减器阵列中的每个第一类电容衰减器分别连接于相邻两个位乘法单元之间,第二层电容衰减器阵列中的每个第二类电容衰减器分别连接于相邻两个第一类电容衰减器之间;所述电容衰减模块用于将所述存储数据的各位数据对应的乘法运算结果进行按层累加,得到多比特数据模拟累加结果;
所述输出模块用于确定并输出所述多比特数据模拟累加结果对应的数字累加结果。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述输入模块包括输入稀疏感知模块以及输入稀疏比较模块,所述输入稀疏感知模块与所述按位乘法模块连接;
所述输出模块包括Flash模数转换模块,所述输入稀疏感知模块、所述输入稀疏比较模块以及所述Flash模数转换模块依次连接;
所述输入稀疏感知模块用于将所述输入数据转换为模拟电压;
所述输入稀疏比较模块用于将所述模拟电压与第一参考电压进行比较,得到第一比较结果;
所述Flash模数转换模块用于基于所述第一比较结果,将所述多比特数据模拟累加结果与第二参考电压进行比较,得到第二比较结果,并将所述第二比较结果作为所述数字累加结果。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述SRAM存算一体芯片的工作模式包括存储操作模式以及计算操作模式;
所述存储操作模式下,所述输入模块以及所述输出模块不工作;
所述计算操作模式下,所述SRAM存算一体芯片对所述输入数据与所述存储数据进行乘法累加运算。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述输入稀疏比较模块包括多个第一比较器,所述Flash模数转换模块包括多个Flash模数转换单元,每个所述Flash模数转换单元包括多个第二比较器;
所述第一比较器与所述第二比较器一一对应连接,且每个所述第一比较器的第一参考 电压与对应连接的所述第二比较器的第二参考电压相同。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述Flash模数转换单元的数量与所述第二类电容衰减器的数量相同。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述位乘法单元包括一列9T1C单元阵列,所述9T1C单元阵列包括多个9T1C单元;
所述SRAM存算一体芯片还包括SRAM读写外部结构,所述SRAM读写外部结构与所述9T1C单元连接。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述9T1C单元包括六个第一类晶体管以及三个第二类晶体管,所述第一类晶体管与所述第二类晶体管均与所述SRAM读写外部结构连接;
所述六个第一类晶体管用于存储所述存储数据的一位数据;
所述三个第二类晶体管用于计算所述六个第一类晶体管存储的所述存储数据的一位数据与所述输入数据的对应位进行乘法运算。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述SRAM读写外部结构包括SRAM控制器、SRAM***电路以及地址解码驱动;
所述SRAM控制器分别与所述SRAM***电路以及所述地址解码驱动连接,所述SRAM***电路以及所述地址解码驱动均与所述9T1C单元连接。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述SRAM存算一体芯片还包括存内计算控制器,所述存内计算控制器分别与所述输入模块以及所述输出模块连接。
根据本公开提供的一种基于电容耦合的SRAM存算一体芯片,所述存储数据包括神经网络中的多个4位的权重数据。
本公开提供的基于电容耦合的SRAM存算一体芯片,包括:输入模块、按位乘法模块、电容衰减模块以及输出模块,通过输入模块接收输入数据;通过按位乘法模块实现输入数据与存储数据的乘法运算,得到乘法运算结果;并采用电容衰减模块,以层次化电容衰减器结构乘法运算结果的按层累加,不仅结构更加简单,而且计算时间更短,可以快速得到数字累加结果,提高乘法累加运算的能量效率以及计算吞吐。
附图说明
为了更清楚地说明本公开或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本公开提供的基于电容耦合的SRAM存算一体芯片的结构示意图之一;
图2是本公开提供的基于电容耦合的SRAM存算一体芯片中的4b-DAC的结构示意图;
图3是本公开提供的基于电容耦合的SRAM存算一体芯片中DAC阵列、输入稀疏感知模块与按位乘法模块连接示意图;
图4是本公开提供的基于电容耦合的SRAM存算一体芯片中每个位乘法单元的结构示意图;
图5是本公开提供的基于电容耦合的SRAM存算一体芯片中9T1C单元的乘法操作时序图;
图6是本公开提供的基于电容耦合的SRAM存算一体芯片的按位乘法模块中每个位乘法单元包括一列9T1C单元阵列时,按位乘法模块与电容衰减模块的连接示意图
图7是本公开提供的基于电容耦合的SRAM存算一体芯片中9T1C单元和HCA列的版图;
图8是本公开提供的基于电容耦合的SRAM存算一体芯片的结构示意图之二;
图9是本公开提供的基于电容耦合的SRAM存算一体芯片带输入稀疏感知的MAC运算的工作时序图;
图10是本公开提供的基于电容耦合的SRAM存算一体芯片在不同温度和工艺角下的模拟计算转移函数、线性拟合结果和工艺涨落的蒙特卡洛仿真示意图;
图11是图10中点A处蒙特卡罗模拟结果随MAC运算结果的分布图;
图12是图10中点B处蒙特卡罗模拟结果随MAC运算结果的分布图;
图13是图10中点C处蒙特卡罗模拟结果随MAC运算结果的分布图。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚,下面将结合本公开中的附图,对本公开中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
神经网络在模式识别、自动控制、金融分析、生物医疗等领域都得到了广泛应用并取得了优异的表现,卷积神经网络作为人工神经网络中应用最为广泛的一种网络,在图像处理方面的表现尤为出色。然而随着任务的复杂度不断增加,神经网络的规模不断增大,网络中的参数量、计算量也越来越大,这也就意味着神经网络映射到硬件上所消耗的资源和功耗日益增加。卷积神经网络中最核心,也是占比最多的运算为乘法累加(Multiply Accumulate,MAC)运算,因而实现一个低功耗卷积神经网络的关键在于低功耗MAC运算单元的设计。
存内计算(Compute-In-Memory,CIM)技术,旨在把传统以计算为中心的架构转变为以数据为中心的架构,其直接利用存储器进行数据处理,从而把数据存储与计算融合在同一个芯片当中,可以彻底消除冯·诺依曼计算架构瓶颈,特别适用于深度学习神经网络(Deep Convolution Neural Network,DCNN)这种大数据量大规模并行的应用场景。这种***架构由于存储单元和计算单元集成在一起,不仅保留了存储电路本身所具有的存储和读写功能,同时可以支持不同的逻辑或者乘加运算,从而在很大程度上减少了中央处理器和存储器电路之间频繁的总线交互,也进一步降低了大量的数据搬移量,能够以超低的功耗进行大量并行的计算,极大地提升了***的能量效率,是推动人工智能应用实现高能效计算的一个极具潜力的研究方向。
以前的CIM结构,与传统冯·诺依曼架构相比,具有显著的能效和吞吐量优势。现有的SRAM-CIM通过晶体管电流实现MAC运算。但是,基于电流的计算非线性和涨落较差,导致DCNN的精度显著下降。基于电荷域计算的CIM,电容器的失配和工艺变化较小,具有较好的非线性和精度。然而,基于电荷域计算的CIM仍然面临一些挑战,包括:
第一,实现点乘的存储和计算单元设计需要在晶体管数量、尺寸和计算动态范围之间进行权衡,如采用8T1C单元,晶体管数目少,但在动态范围内会有阈值损失,而采用10T1C单元,虽然可以实现轨到轨的动态范围,但其存储和计算单元较大。
第二,为了实现多比特的权重累加,基于电荷域的CIM在模拟域使用开关阵列和额 外的控制信号实现电荷共享电路,或者在数字域使用移位器组和加法器组,前者共享控制复杂,延时大,后者功耗大,面积大。
第三,将模拟MAC运算结果转换为数字编码的多位模数转换器(Analog-to-Digital Converter,ADC)消耗大量的能源,对整体能源效率造成严重影响。
也就是说,现有的存算一体芯片,在实现多比特的数据累加时,因为在模拟域使用了开关阵列和额外的控制信号实现电荷共享电路,所以导致了共享控制复杂、延时大、降低存算一体芯片的计算性能以及能源效率等问题。为此,现急需提供一种基于电容耦合的SRAM存算一体芯片,以解决多比特数据累加时出现的问题。
图1为本公开实施例中提供的一种基于电容耦合的SRAM存算一体芯片的结构示意图,如图1所示,该芯片包括:输入模块1、按位乘法模块2、电容衰减模块3以及输出模块4;
所述输入模块1、所述按位乘法模块2、所述电容衰减模块3以及所述输出模块4依次连接;
所述输入模块1用于接收输入数据;
所述按位乘法模块2包括多个位乘法单元21,每个所述位乘法单元均用于基于电容耦合原理,将所述输入数据与按位存储的存储数据的一位数据进行乘法运算,得到所述存储数据的一位数据对应的乘法运算结果;
所述电容衰减模块3包括两层电容衰减器阵列,第一层电容衰减器阵列31中的每个第一类电容衰减器311分别连接于相邻两个位乘法单元21之间,第二层电容衰减器阵列32中的每个第二类电容衰减器321分别连接于相邻两个第一类电容衰减器311之间;
所述电容衰减模块3用于将所述存储数据的各位数据对应的乘法运算结果进行按层累加,得到多比特数据模拟累加结果;
所述输出模块4用于确定并输出所述多比特数据模拟累加结果对应的数字累加结果。
具体地,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,为实现输入数据的接收,输入模块1可以包括数模转换器(Digital-to-Analog Converter,DAC)阵列,DAC阵列中可以包括多个DAC,每个DAC的位数可以根据每个位乘法单元中按位存储的单个存储数据的位数、单个输入数据的位数进行确定,三者可以一致,例如可以均为4位(即4b)。此时,DAC阵列中可以包括多个4b-DAC,每个4b-DAC均可以用于接收一 个4b的输入数据。DAC阵列中包括的4b-DAC数量可以根据需要进行设定,例如可以设定为128。
每个4b-DAC的结构均可以如图2所示,每个4b-DAC均通过片外偏置(Off-chip external bias)提供驱动电压。考虑到芯片设计中电压基准设计的难度和准确度,为了设计上的方便,本公开实施例中片外偏置为每个4b DAC提供的驱动电压包括16个,从GND到VDD的1/16VDD为梯度,在芯片上主要实现的是4-16的译码器功能。将一个4b的输入数据(4b Input)输入至一个4b DAC中,得到译码结果(DAC-OUT)。
按位乘法模块2可以包括多个位乘法单元21,按位乘法模块2中位乘法单元21的数量可以根据需要进行设置,例如可以设置为64,即按位乘法模块2共可以实现64个4b的存储数据与输入数据的按位乘法运算。每个位乘法单元21可以包括与DAC阵列中DAC数量相同的计算单元,以使DAC与计算单元一一对应连接。在每个计算单元内可以存储有对应位的存储数据(即1b的存储数据)。
每个计算单元中包括电容,进而可以通过电容耦合原理,实现连接的DAC输出的译码结果与存储的对应位的存储数据之间的乘法运算,则每个位乘法单元可以实现所有输入数据与一个存储数据的一位数据进行乘法运算,可以得到该存储数据的一位数据对应的乘法运算结果。由此,4个相邻的位乘法单元可以共同实现所有输入数据与按位存储的一个4b的存储数据的乘法运算。
本公开实施例中,电容衰减模块3包括两层电容衰减器(Capacity Attenuator,CA)阵列,第一层电容衰减器阵列31中的每个第一类电容衰减器311分别连接于相邻两个位乘法单元21之间,第二层电容衰减器阵列32中的每个第二类电容衰减器321分别连接于相邻两个第一类电容衰减器311之间。
第一类电容衰减器311与第二类电容衰减器321的衰减系数可以根据每一个存储数据的各位数据的比例进行确定,例如第一类电容衰减器311的衰减系数可以是AC=0.5,此时第一类电容衰减器311是1/2CA,第二类电容衰减器321的衰减系数可以是AC=0.25,此时第二类电容衰减器321是1/4CA。
通过电容衰减模块3的结构,可以将每个存储数据的各位数据对应的乘法运算结果进行按层累加,得到每个存储数据的多比特数据模拟累加结果。
与4个相邻的位乘法单元对应的两个第一类电容衰减器311以及一个第二类电容衰减 器321可以构成层次化电容衰减器(Hierarchical Capacitor Attenuator,HCA)结构。因此,与按位乘法模块2连接的电容衰减模块3,共可以包括64个HCA结构,实现64个4b的存储数据与输入数据的按位乘法运算结果的累加。
相比于基于权重电容阵列和补偿电容阵列的电容共享的累加方式,HCA结构并未采用开关阵列用于模拟数据的暂存和电荷共享。因此,HCA结构更加简单。
此外,本公开实施例中,电容衰减模块3可以通过强的外加电压进行驱动,以实现电压稳定输出,相比于基于电容共享的权重累加模式通过弱的内部电压再平衡实现电压稳态输出的方式,得到多比特数据模拟累加结果的计算时间更短。
芯片中的输出模块4可以包括多个模数转换器(Analog-to-Digital Converter,ADC),每个DAC的位数可以根据每个位乘法单元中按位存储的单个存储数据的位数进行确定,例如可以均为4位。此时,输出模块4中可以包括多个4b-ADC,每个4b-ADC均与一个HCA结构连接,将每个存储数据对应的多比特数据模拟累加结果转换为数字累加结果,并将该数字累加结果进行输出。
本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,包括:输入模块、按位乘法模块、电容衰减模块以及输出模块,通过输入模块接收输入数据;通过按位乘法模块实现输入数据与存储数据的乘法运算,得到乘法运算结果;并采用电容衰减模块,以层次化电容衰减器结构乘法运算结果的按层累加,不仅结构更加简单,而且计算时间更短,可以快速得到数字累加结果,提高乘法累加运算的能量效率以及计算吞吐。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述输入模块包括输入稀疏感知模块以及输入稀疏比较模块,所述输入稀疏感知模块与所述按位乘法模块连接;
所述输出模块包括Flash模数转换模块,所述输入稀疏感知模块、所述输入稀疏比较模块以及所述Flash模数转换模块依次连接;
所述输入稀疏感知模块用于将所述输入数据转换为模拟电压;
所述输入稀疏比较模块用于将所述模拟电压与第一参考电压进行比较,得到第一比较结果;
所述Flash模数转换模块用于基于所述第一比较结果,将所述多比特数据模拟累加结果与第二参考电压进行比较,得到第二比较结果,并将所述第二比较结果作为所述数字累 加结果。
具体地,本公开实施例中,输入模块除包括DAC阵列之外,还可以包括输入稀疏感知模块以及输入稀疏比较模块,DAC阵列、输入稀疏感知模块与按位乘法模块连接。如图3所示,DAC阵列中每个DAC得到的译码结果可以表示为IA[i],0≤i≤N-1,N为DAC阵列中的DAC数量,可以为128。可以理解的是,无论输入模块中是否包含有输入稀疏感知模块以及输入稀疏比较模块,IA[i]均可以输入至位乘法单元进行乘法运算。
输入稀疏感知模块可以是IS-DAC(Input Sparsity Sensing DAC),可以包括NMOS11以及多个感知支路,感知支路与DAC阵列中的DAC一一对应连接,IS-DAC共可以包括一个NMOS和128个感知支路,NMOS负责放电,NMOS的源极可以接地,NMOS的栅极可以接收外部的复位信号(RST_IS)。
每个感知支路包括开关12和电容13,DAC、开关12以及电容13依次连接,IS-DAC则可以包括由128个开关构成的开关阵列以及由128个电容构成的电容阵列。所有电容的另一极板均分别与NMOS的集电极和输入稀疏比较模块连接。开关阵列可以接收外部的栅极连接控制信号(IS-Eval),IS-DAC结合开关阵列的控制信号,通过电容阵列以电容耦合的方式将所有的IA[i]转换成一个代表输入稀疏度的模拟电压VIS
输入稀疏比较模块可以包括输入稀疏比较器阵列(Input Sparsity Comparator,IS-CA),包括多个第一比较器,第一比较器的数量可以根据需要进行设置,例如可以包括15个。IS-CA通过外部的使能信号(IS_SA_EN)进行控制。每个第一比较器的反相端接有第一参考电压Vref[j],0≤j≤M-1,M为IS-CA中比较器的数量,可以为15。
IS-CA中每个第一比较器可以将模拟电压VIS与第一参考电压Vref[j]进行比较,得到第一比较结果,即1b的温度计码DR[j],IS-CA则可以输出15b的温度计码DR<0:14>。
输出模块可以包括Flash模数转换模块,Flash模数转换模块中可以包括多个Flash模数转换单元(Flash-ADC),每个Flash-ADC均可以是一个4b-ADC,因此Flash模数转换模块可以看作是4b-Flash-ADC阵列。Flash模数转换模块中Flash-ADC的数量可以与存储数据的数量相同,即Flash模数转换模块中共可以包括64个Flash-ADC,可以分别记为Flash-ADC<k>,0≤k≤K-1,K为Flash模数转换模块中Flash-ADC的数量,可以为64。
每个Flash-ADC可以包括多个第二比较器,Flash-ADC中第二比较器与IS-CA中第一比较器一一对应连接,因此每个Flash-ADC中比较器可以有15个。Flash-ADC中每个第 二比较器均有第二参考电压,Flash-ADC中第二比较器可以基于对应的第一比较器的第一比较结果,将多比特数据模拟累加结果与第二参考电压进行比较,得到第二比较结果,并将该第二比较结果作为多比特数据模拟累加结果对应的数字累加结果输出。
IS-CA中的第一比较器与Flash-ADC中第二比较器均为强臂比较器(strong-arm comparator),IS-CA中的第一比较器按与IS-DAC的距离由近至远的顺序,第一参考电压由低至高。同理,Flash-ADC中第二比较器按连接的IS-CA中的第一比较器与IS-DAC的距离由近至远的顺序,第二参考电压由低至高。
因此,每个Flash-ADC中,第一个第二比较器可以表示为L-Comp<0>,其对应的第二参考电压的取值范围为0-400mV。最后一个第二比较器可以表示为H-Comp<14>,其对应的第二参考电压的取值范围为400-900mV。
本公开实施例中,采用输入稀疏感知模块、输入稀疏比较模块以及Flash模数转换模块结合的方式,可以实现芯片的高吞吐率。Flash模数转换模块拥有轨到轨的译码范围。由于MAC运算中,整个动态范围是很少能达到的,特别是当输入数据是稀疏的情况下。为此,基于输入稀疏感知模块实时的感知输入稀疏特性的输入稀疏感知策略,用于Flash模数转换模块的译码,以减少比较次数,从而实现能量的减低。该策略在不考虑存储数据的情况下,估计128个4b的输入数据的和并且量化,根据量化的结果,允许跳过冗余的比较器工作。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述SRAM存算一体芯片的工作模式包括存储操作模式以及计算操作模式;
所述存储操作模式下,所述输入模块以及所述输出模块不工作;
所述计算操作模式下,所述SRAM存算一体芯片对所述输入数据与所述存储数据进行乘法累加运算。
具体地,本公开实施例中,SRAM存算一体芯片的工作模式可以包括两种,分别为存储操作(SRAM)模式以及计算操作(CIM)模式。存储操作模式是指将存储数据按位存储至SRAM存算一体芯片中的操作模式,存储位置可以是位乘法单元中。计算操作模式是指将输入数据与存储数据进行计算的操作模式。
在存储操作模式下,输入模块以及输出模块均不工作。
在计算操作模式下,SRAM存算一体芯片中所有模块均进行工作,对输入数据与存储 数据进行乘法累加运算。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述输入稀疏比较模块包括多个第一比较器,所述Flash模数转换模块包括多个Flash模数转换单元,每个所述Flash模数转换单元包括多个第二比较器;
所述第一比较器与所述第二比较器一一对应连接,且每个所述第一比较器的第一参考电压与对应连接的所述第二比较器的第二参考电压相同。
具体地,本公开实施例中,第一比较器的第一参考电压与连接的第二比较器的第二参考电压可以相同,如此可以保证通过输入稀疏度对第二比较器的工作状态进行准确判断,进而在减少工作的第二比较器的数量的情况下提高输出结果的准确性。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述Flash模数转换单元的数量与所述第二类电容衰减器的数量相同。
具体地,本公开实施例中,Flash-ADC的数量与第二类电容衰减器的数量相同,且可以一一对应连接,如此可以保证每个Flash模数转换单元实现一个4b的存储数据对应的数字累加结果的确定及输出。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述输入稀疏感知模块可以通过电容耦合的方式将128个4b的输入数据转换成一个代表着输入稀疏度的模拟电压VIS。然后IS-CA将VIS与第一比较器的第一参考电压进行比较。第一比较结果为15b的温度计码DR[0:14],代表着量化后的输入稀疏度。温度计码DR[0:14]在读出阶段决定每个Flash-ADC中15个第二比较器的工作状态和第二比较结果。
第二比较器的控制逻辑为:当温度计码DR[i]=0时,其相对应的第二比较器Comp<i>将会被跳过比较,其比较结果会被置为0。当温度计码DR[i]=1时,其相对应的第二比较器Comp<i>将会正常工作,产生输出。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述位乘法单元包括一列9T1C单元阵列,所述9T1C单元阵列包括多个9T1C单元;
所述SRAM存算一体芯片还包括SRAM读写外部结构,所述SRAM读写外部结构与所述9T1C单元连接。
具体地,本公开实施例中,每个位乘法单元均可以包括一列9T1C单元阵列,每个9T1C单元阵列包括多个9T1C单元。每个9T1C单元均为一个计算单元,包含有9个晶体管T 和1个电容(Cbitcell)。通过9T实现存储以及乘法运算,通过电容的电容耦合原理将乘法运算结果累积在电容的上极板。
该SRAM存算一体芯片还可以包括SRAM读写外部结构,该SRAM读写外部结构可以通过字线WL、位线BL/BLB与每个9T1C单元连接,以实现对每个9T1C的驱动及控制。
本公开实施例中,采用9T1C单元阵列作为位乘法单元,实现存储数据的一位数据与输入数据的乘法运算,相比于现有技术中的8T1C和10T1C,平衡了晶体管数目与动态范围,提高了芯片性能。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述9T1C单元包括六个第一类晶体管以及三个第二类晶体管,所述第一类晶体管与所述第二类晶体管均与所述SRAM读写外部结构连接;
所述六个第一类晶体管用于存储所述存储数据的一位数据;
所述三个第二类晶体管用于计算所述六个第一类晶体管存储的所述存储数据的一位数据与所述输入数据的对应位进行乘法运算。
具体地,如图4所示,每个位乘法单元可以包括一列9T1C单元阵列,9T1C单元阵列包括多个9T1C单元,每个9T1C单元可以用于对存储数据的一位数据进行存储。每个9T1C单元阵列中9T1C单元的数量可以与输入数据的数量相同,例如可以均为128。
每个9T1C单元的输入均为IA[i],每个9T1C单元中,输入线IA[i]将9个晶体管T分为上面6T和下面3T。上面6T可以是第一类晶体管,主要用于存储数据的一位数据的存储,下面3T可以是第二类晶体管,用于输入数据与上面6T中存储的一位数据的乘法运算。
6T中包括节点Q和节点QB,输入数据则以电压形式存储至节点Q。3T中第一个T与第二个T并列后与第三个T串联,QB[i]以及Q[i]分别为与第一个T与第二个T的栅极连接的线上的电压。每个9T1C单元中的电容C与第三个T并列,电容C的上极板电压可以表示为Mult[i],可以用于表征IA[i]与Q点存储的一位数据的乘积运算结果。
图4中,每个9T1C单元均连接有字线WL[i]以及位线BL、位线BLB,每个9T1C单元均包含有计算线CL。计算线CL上连接有用于复位的开关,该开关接通后相应计算线可以接收到复位信号(RST_MAC)。
每个9T1C单元的操作真值表如表1所示,4b Input与IA的对应关系如表2所示。
表1 9T1C单元的操作真值表
表2 4b Input与IA的对应关系表
从表1和表2可以看出,9T1C单元具有轨到轨的动态范围,大于8T1C设计。所用电容为~1.33f MOM电容,可在芯片制备时放置在9T晶体管上方,具有较小的面积开销。4b的存储数据中的1b数据存储在每个9T1C单元中,4b的输入数据作为4b-DAC生成的模拟电压施加在输入线IA[0:127]上,用于驱动对应行上所有的一位数据为1的电容的上极板Mult[i]。
9T1C单元的乘法操作可以包括两个阶段:复位(reset)和输出评估(evaluation),如图5所示。9T1C单元的乘法操作从将电容C的上底板复位到GND开始。此处,有两种复位方式。如果Q=0,则与QB相连的晶体管T打开并将计算线(CL)拉低至GND。如 果Q=1,传输门打开并将CL拉低到IA,即复位阶段的GND。复位后,模拟电压施加到IA。当且仅当Q=1时,模拟电压才能传输到节点Mult。多个存储数据存储在9T1C单元阵列中的多个9T1C单元中,多个9T1C单元并行化进行乘法操作,并且在评估阶段将多个输入数据作为驱动电压应用于IA。基于电容耦合原理,产生的电压V_CL与存储数据的1b数据和4b的输入数据的按位MAC运算结果成正比。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,图6为按位乘法模块中每个位乘法单元包括一列9T1C单元阵列时,按位乘法模块与电容衰减模块的连接示意图。
图6中仅示出了一个4b的存储数据对应的4个相邻的位乘法单元以及一个HCA结构中包括的第一层电容衰减器阵列中的两个第一类电容衰减器、第二层电容衰减器阵列中的一个第二类电容衰减器。该4b的存储数据中4位数据分别表示为W[0]、W[1]、W[2]、W[3],每位数据对应一个计算线,分别为CL[0]、CL[1]、CL[2]、CL[3]。
两个第一类电容衰减器分别为Cw01、Cw02,一个第二类电容衰减器为Cw01/23,Cw01分别与CL[0]、CL[1]连接,Cw02分别与CL[2]、CL[3]连接,Cw01/23分别与CL[1]、CL[3]连接。
如图6所示,128个4b的IA与一个存储数据的4个1b数据的乘法运算结果分别存放在CL[3]、CL[2]、CL[1]和CL[0]上。HCA结果中的2个第一类电容衰减器的衰减系数AC=0.5,可以通过计算W[0](W[2])和W[1](W[3])数据对的和确定,一个第二类电容衰减器的衰减系数AC=0.25,可以通过计算W[0:1]和W[2:3]数据对的和确定。上述计算过程即按照一位数据的比例来确定各电容衰减器的衰减系数AC,并且这个过程是分层次的。
通过在输出点看进去的每个支路CL[0]、CL[1]、CL[2]和CL[3]的电容贡献满足1:2:4:8的计算,可以确定相关的电容值分别为:
Cw01=Cw23=128Cbitcell
Cw01/23=64Cbitcell
由于9T1C单元的乘法操作可以包括reset阶段和evaluation阶段。在reset阶段,第一类电容衰减器以及第二类电容器衰减的上极板和下极板均放电至GND。而在evaluation阶段,所有4b的输入数据都通过4b-DAC输入到9T1C单元阵列中,将9T1C单元中的电容的上极板钳在一个固定的由4b-DAC产生的电压。然后,当由4个位乘法单元中所有的 9T1C单元中的电容和HCA结构中的电容衰减器组成的耦合电容阵列重新达到一个稳态时,代表着计算结果的HCA结构的输出电压VHCA就产生了。VHCA会被4b的Flash模数转换模块进行量化,输出4b的计算结果。理想情况下,不考虑寄生电容,VHCA可以通过以下公式计算得到:
其中,IAi是第i行的输入,wi,j是第i行、第j列存储的存储数据的一位数据,为0或者1,IAmax,i为最大的输入值,对于4b的输入数据,IAmax,i为15。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,9T1C单元的晶体管面积大小为0.7um×1.42um,1.33fF的MOM电容面积为0.55um×1.42um。
图7为9T1C单元和HCA列的版图,其中包括4列9T1C单元和1个HCA结构,展现一个4b的权重数据的乘法累加的实现版图。考虑到版图的对称性和匹配,做以下三点改进:第一,HCA结构中的Cw01/Cw23和Cw01/23电容分别被拆分为128和64个9T1C单元层级的单位电容Cbitcell。这些小的单位电容在保证预设的比例情况下,分布在整个按位乘法模块的版图中。因此,对于每一行,7个不同功能的9T1C单元层级的小电容分布在4个9T1C单元的晶体管层面的版图。第二,将W[2]列和W[3]列交换位置,使得列布局从左到右依次变为W[0],W[1],W[3]和W[2],实现了一个中心对称的版图布局,使得W[0:1]和W[2:3]之间的模拟计算失配能够最小化。第三,引入了64个虚拟电容,和64个Cw01/23的小电容相互交织排布,可以最大化电容阵列版图的对称性,使得随机失配的影响降到最小。
图7中,A均为单位电容Cbitcell,B均为第一类电容衰减器,C均为虚拟电容,D均为第二类电容衰减器。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述SRAM读写外部结构可以包括SRAM控制器(SRAM Controller)、SRAM***电路(SRAM Peripheral circuits)以及地址解码驱动(Address Decoder&Driver)。
SRAM读写外部结构SRAM控制器可以分别与SRAM***电路以及地址解码驱动连接,用以实现对芯片的存储功能的全局控制。SRAM***电路以及地址解码驱动均与9T1C 单元连接,以保证存储数据可以按位存储至每个9T1C单元中。
本公开实施例中,通过SRAM控制器可以实现芯片存储功能的自动化实现。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述SRAM存算一体芯片还可以包括存内计算控制器(CIM Controller),存内计算控制器可以分别与输入模块以及输出模块连接。通过存内计算控制器,可以实现对芯片的计算功能的全局控制。
在上述实施例的基础上,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,所述存储数据包括神经网络中的多个4位的权重数据。
具体地,本公开实施例中,神经网络中通常包括大量的4位的权重数据,均可以将其作为存储数据按位存储至芯片中的位乘法单元,并将其与神经网络的输入数据进行乘法累加运算,进而可以实现神经网络中卷积核的功能。
图8为本公开实施例中提供的基于电容耦合的SRAM存算一体芯片的完整结构示意图,该芯片可以实现神经网络中64个卷积核单个实现128次运算的需求。该芯片包括128×256的9T1C单元阵列、SRAM控制器、SRAM***电路、地址解码器驱动、CIM控制器、128个4b-DAC、IS-DAC、IS-CA、包含有1×64个HCA的电容衰减模块以及包含有1×64个4b-ADC的Flash模数转换模块。9T1C单元阵列的大小为128×256,该芯片的存储容量为32kb。9T1C单元阵列的256列一共分为64个组,每个组包含4列,用于存放一个4b的权重数据。
在SRAM模式下,4b-DAC、IS-DAC、IS-CA、4b-ADC均不工作,此时,该芯片为6T-SRAM存储器,进行正常的读写操作,在该模式下,神经网络的权重数据会写入到SRAM中。在CIM模式下,该芯片完全并行执行4b MAC运算,单个周期内,所有行进行4b的输入数据的输入,该芯片共可以支持128个4b的输入数据,可以分别表示为IN[0][0:3]、IN[1][0:3]、…、IN[127][0:3]。输入数据经对应的4b-DAC可以得到对应的译码结果IA[0]、IA[1]、…、IA[127]。
通过电容耦合方式在模拟域计算128个输入数据和64个128×4b权重的向量矩阵乘法。Flash模数转换模块将代表MAC运算结果的模拟电压转换为4b数字码输出。另外,如果要实现更高位宽的计算,可以通过串行输入计算结合移位累加器实现。
SRAM***电路为9T1C单元阵列中的每个9T1C单元提供位线BL/BLB,地址解码 驱动为每个9T1C单元提供字线WL。
在9T1C单元阵列与电容衰减模块之间还连接有CIM控制器控制的开关,该开关可以接地,用以实现对应计算线CL[i]的复位(RST)。
如图9所示,为输入模块中包括输入稀疏感知模块以及输入稀疏比较模块、输出模块包括Flash模数转换模块的带输入稀疏感知的MAC运算的工作时序图。从图9中可以看出,带输入稀疏感知的MAC运算的工作时序分为输入稀疏感知(Input Sparsity Sensing,IS)和MAC运算两个相独立而又通过温度计码DR[0:14]相互联系的过程。IS过程分为:复位(Reset_IS)和输出评估(Evaluation_IS)两个过程。MAC运算过程也分为复位(Reset_MAC)和输出评估(Evaluation_MAC)两个过程。CLK为芯片的工作时钟,该工作时钟通过时序控制模块产生RST_IS、EVAL_IS(即图2中的IS-Eval)、SA_EN_IS(即图3中的IS_SA_EN)、RST_MAC、EVAL_MAC(表示MAC运算过程中感知支路中开关的栅极连接控制信号)和SA_EN_MAC(表示MAC运算过程中IS-CA的使能信号)。RST_IS与EVAL_IS、RST_MAC与EVAL_MAC是两个反相的信号。SA_EN_IS和SA_EN_MAC均用于Evaluation阶段的读出。在整个过程中,将RST_IS的信号提前至RST_MAC信号之前,这样使得DR[0:14]可以在MAC运算阶段之前就能够产生,从而控制Flash模数转换模块中第二类比较器的工作状态和输出。在输入数据输入至芯片之前,IS-DAC通过RST_IS信号完成重置。然后在输入数据输入至芯片之后,IS-DAC马上进行模拟域中输入稀疏度的评估,产生VIS,然后准备通过IS-CA开始VIS的量化。同时,MAC运算的Reset_MAC过程正在进行。当Evaluation_MAC过程开始的时候,IS-CA已经产生了DR[0:14]。这样的工作时序使得输入稀疏感知策略的加入并不会降低芯片计算的吞吐率。
综上所述,本公开实施例中提供的基于电容耦合的SRAM存算一体芯片,采用9T1C单元,该单元以电容耦合方式在电容域进行乘法运算;通过层次化衰减电容结构实现4b的存储数据的累加,该结构没有传统电荷共享结构的额外的开关、复杂的控制,和长的共享时间,大大提高了多比特权重数据计算***的计算吞吐;采用基于输入稀疏感知策略的Flash模数转换模块,可以降低AD的比较次数,提高***的能效。该芯片可以支持8192个4b×4b的MAC运算。
基于以上内容,模拟了基于电容耦合的SRAM存算一体芯片在3种温度(-40/27/85℃)和3种工艺角(TT/SS/FF)组合共9种情况下的模拟计算传递函数。仿真时,给按位乘法模 块中所有的9T1C单元的存储位置写1,然后从小到大、等梯度的输入相应的模式,然后记录HCA结构输出的电压VHCA的大小,便可得到电压VHCA和MAC运算结果之间的关系曲线,如图10所示。图10的横坐标为MAC运算结果,单位为m,纵坐标为电压VHCA,单位为V。曲线可以表示为:
y=0.4676x-1.6388,R2=1
在不同的温度和工艺角组合下,模拟计算传递函数没有显著差异,对27℃下TT角的结果进行线性拟合,发现拟合优度R2=1,说明该芯片可以实现具有良好的线性MAC运算,也说明温度和工艺相关的非理想性对其稳定性影响较小。此外,可以在曲线上选取三个点A、B、C,分别位于MAC=360、960和1560,给出了基于500个蒙特卡罗模拟结果的工艺涨落变化。这三个点涨落的最大标准差为0.297mV。因此,由于传递函数线性拟合良好,且温度和工艺变化小,该芯片可以为卷积神经网络的应用提供了有效的计算。
图10中点A处蒙特卡罗模拟结果随MAC运算结果的分布图如图11所示,图11中由左至右各竖线对应的MAC运算结果分别为166.107m、166.307m、166.506m、166.706m、166.905m、167.105m以及167.304m。该分布的中线即均值(mean)为166.706m,标准差(std dev)为σ=199.538um。
图10中点B处蒙特卡罗模拟结果随MAC运算结果的分布图如图12所示,图12中由左至右各竖线对应的MAC运算结果分别为446.451m、446.721m、446.992m、447.262m、447.532m、447.802m以及448.073m。该分布的中线即均值(mean)为447.262m,标准差(std dev)为σ=270.271um。
图10中点C处蒙特卡罗模拟结果随MAC运算结果的分布图如图13所示,图13中由左至右各竖线对应的MAC运算结果分别为726.978m、727.275m、727.573m、727.871m、728.168m、728.466m以及728.763m。该分布的中线即均值(mean)为727.871m,标准差(std dev)为σ=297.633um。
另外,对芯片的结构进行了模拟计算电压建立时间的仿真。仿真时,先将9T1C单元的存储位置全部写1,然后从小到大、等梯度输入相应的模式,使得MAC运算结果逐渐递增。128个4b的输入数据与4b的权重数据的MAC运算的模拟电压的平均建立时间为0.2ns。与传统的基于权重电容阵列的电荷共享的多比特权重累加的方案比较,该芯片的模拟电压建立时间减少了90%,从而使得该芯片在计算吞吐量上相对于电荷共享的方案有了 50%的提升。模拟电压建立时间的减少和吞吐量的提升主要得益于在该芯片的模拟电压是在强烈且明确的外部施加的电压下建立的,而电荷共享结构下模拟电压是在微弱且浮动的内部电压下通过电势的再平衡建立的。
本公开实施例中,还比较了存在和不存在输入稀疏感知策略的情况下不同输入稀疏度下该芯片的能效。在这两种情况下,该芯片的能效都随着输入稀疏度的增加而增加,并且增量越来越大。在不存在输入稀疏感知策略的情况下增量的存在表明9T1C单元节省了具有稀疏点积结果的电容器的驱动器成本。在较低的稀疏度(<30%)下,由于IS-DAC和IS-CA的成本,存在输入稀疏感知策略的结果低于不存在输入稀疏感知策略的结果。当输入稀疏度较高(>30%)时,存在输入稀疏感知策略的结果明显大于不存在输入稀疏感知策略的结果,这要归功于计算期间Flash模数转换模块中的大量跳过第二类比较器。该芯片引入的输入稀疏感知策略,在5%到95%的输入稀疏度下实现了460~2264.4TOPS/W的高能效。在平均能效方面,存在输入稀疏感知策略的结果显示提高了12.8%,达到666TOPS/W的高能效。
如表3所示为本公开实施例中提供的芯片结构与现有芯片结构的性能对比情况表,本公开实施例中提供的芯片结构实现了更高的能效和吞吐量,与现有芯片结构相比分别提高了10倍和1.84倍。并且行为模拟结果显示CIFAR-10数据集中的分类精度与其他工作相当。表3中现有芯片结构均以给出结构的文章出处的形式体现。
表3所示为本公开实施例中提供的芯片结构与现有芯片结构的性能对比情况表

其中,表3中尾注的含义如下:1表示考虑了局部计算单元的平均面积(Average area considered local computing cell);2表示1b权重MAC运算的电流计算和多位权重累加的电荷共享(Current computation for 1b weight MAC and charge-sharing for multi-bit weight accumulation);3表示根据描述估计得到(Estimated from the description),4表示从提出的 以NMOS作为传输门开关的结构估计得到(Estimated from the proposed structure with NMOS as transmission gate);5表示从图中估计得到(Estimated from the graph);6表示归一化为4b/4b输入/权重操作(Normalized to 4b/4b input/weight operation);7表示一个MAC运算算作两次运算(即乘法和加法)(One MAC is counted as two operations(multiplication and addition));8表示考虑比较器偏移电压的行为仿真结果(Behavioral simulation result considering comparator offset voltage)。
最后应说明的是:以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的精神和范围。

Claims (10)

  1. 一种基于电容耦合的SRAM存算一体芯片,其特征在于,包括:输入模块、按位乘法模块、电容衰减模块以及输出模块,所述输入模块、所述按位乘法模块、所述电容衰减模块以及所述输出模块依次连接;
    所述输入模块用于接收输入数据;
    所述按位乘法模块包括多个位乘法单元,每个所述位乘法单元均用于基于电容耦合原理,将所述输入数据与按位存储的存储数据的一位数据进行乘法运算,得到所述存储数据的一位数据对应的乘法运算结果;
    所述电容衰减模块包括两层电容衰减器阵列,第一层电容衰减器阵列中的每个第一类电容衰减器分别连接于相邻两个位乘法单元之间,第二层电容衰减器阵列中的每个第二类电容衰减器分别连接于相邻两个第一类电容衰减器之间;所述电容衰减模块用于将所述存储数据的各位数据对应的乘法运算结果进行按层累加,得到多比特数据模拟累加结果;
    所述输出模块用于确定并输出所述多比特数据模拟累加结果对应的数字累加结果。
  2. 根据权利要求1所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述输入模块包括输入稀疏感知模块以及输入稀疏比较模块,所述输入稀疏感知模块与所述按位乘法模块连接;
    所述输出模块包括Flash模数转换模块,所述输入稀疏感知模块、所述输入稀疏比较模块以及所述Flash模数转换模块依次连接;
    所述输入稀疏感知模块用于将所述输入数据转换为模拟电压;
    所述输入稀疏比较模块用于将所述模拟电压与第一参考电压进行比较,得到第一比较结果;
    所述Flash模数转换模块用于基于所述第一比较结果,将所述多比特数据模拟累加结果与第二参考电压进行比较,得到第二比较结果,并将所述第二比较结果作为所述数字累加结果。
  3. 根据权利要求2所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述SRAM存算一体芯片的工作模式包括存储操作模式以及计算操作模式;
    所述存储操作模式下,所述输入模块以及所述输出模块不工作;
    所述计算操作模式下,所述SRAM存算一体芯片对所述输入数据与所述存储数据进行乘法累加运算。
  4. 根据权利要求2所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述输入稀疏比较模块包括多个第一比较器,所述Flash模数转换模块包括多个Flash模数转换单元,每个所述Flash模数转换单元包括多个第二比较器;
    所述第一比较器与所述第二比较器一一对应连接,且每个所述第一比较器的第一参考电压与对应连接的所述第二比较器的第二参考电压相同。
  5. 根据权利要求4所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述Flash模数转换单元的数量与所述第二类电容衰减器的数量相同。
  6. 根据权利要求1所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述位乘法单元包括一列9T1C单元阵列,所述9T1C单元阵列包括多个9T1C单元;
    所述SRAM存算一体芯片还包括SRAM读写外部结构,所述SRAM读写外部结构与所述9T1C单元连接。
  7. 根据权利要求6所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述9T1C单元包括六个第一类晶体管以及三个第二类晶体管,所述第一类晶体管与所述第二类晶体管均与所述SRAM读写外部结构连接;
    所述六个第一类晶体管用于存储所述存储数据的一位数据;
    所述三个第二类晶体管用于计算所述六个第一类晶体管存储的所述存储数据的一位数据与所述输入数据的对应位进行乘法运算。
  8. 根据权利要求6所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述SRAM读写外部结构包括SRAM控制器、SRAM***电路以及地址解码驱动;
    所述SRAM控制器分别与所述SRAM***电路以及所述地址解码驱动连接,所述SRAM***电路以及所述地址解码驱动均与所述9T1C单元连接。
  9. 根据权利要求1-8中任一项所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述SRAM存算一体芯片还包括存内计算控制器,所述存内计算控制器分别与所述输入模块以及所述输出模块连接。
  10. 根据权利要求1-8中任一项所述的基于电容耦合的SRAM存算一体芯片,其特征在于,所述存储数据包括神经网络中的多个4位的权重数据。
PCT/CN2023/083070 2022-04-27 2023-03-22 基于电容耦合的sram存算一体芯片 WO2023207441A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210457425.XA CN115048075A (zh) 2022-04-27 2022-04-27 基于电容耦合的sram存算一体芯片
CN202210457425.X 2022-04-27

Publications (1)

Publication Number Publication Date
WO2023207441A1 true WO2023207441A1 (zh) 2023-11-02

Family

ID=83157158

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/083070 WO2023207441A1 (zh) 2022-04-27 2023-03-22 基于电容耦合的sram存算一体芯片

Country Status (2)

Country Link
CN (1) CN115048075A (zh)
WO (1) WO2023207441A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117316237A (zh) * 2023-12-01 2023-12-29 安徽大学 时域8t1c-sram存算单元及时序跟踪量化的存算电路
CN118098310A (zh) * 2024-04-25 2024-05-28 南京大学 基于超前补偿型跨阻放大器的光电存算阵列读出电路

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048075A (zh) * 2022-04-27 2022-09-13 北京大学 基于电容耦合的sram存算一体芯片
CN115658012B (zh) * 2022-09-30 2023-11-28 杭州智芯科微电子科技有限公司 向量乘加器的sram模拟存内计算装置和电子设备
CN115664422B (zh) * 2022-11-02 2024-02-27 北京大学 一种分布式逐次逼近型模数转换器及其运算方法
CN115794728B (zh) * 2022-11-28 2024-04-12 北京大学 一种存内计算位线钳位与求和***电路及其应用
CN116029351B (zh) * 2023-03-30 2023-06-13 南京大学 基于光电存算单元的模拟域累加读出电路

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144558A (zh) * 2020-04-03 2020-05-12 深圳市九天睿芯科技有限公司 基于时间可变的电流积分和电荷共享的多位卷积运算模组
CN111611529A (zh) * 2020-04-03 2020-09-01 深圳市九天睿芯科技有限公司 电容容量可变的电流积分和电荷共享的多位卷积运算模组
CN113658628A (zh) * 2021-07-26 2021-11-16 安徽大学 一种用于dram非易失存内计算的电路
US11176991B1 (en) * 2020-10-30 2021-11-16 Qualcomm Incorporated Compute-in-memory (CIM) employing low-power CIM circuits employing static random access memory (SRAM) bit cells, particularly for multiply-and-accumluate (MAC) operations
CN115048075A (zh) * 2022-04-27 2022-09-13 北京大学 基于电容耦合的sram存算一体芯片

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144558A (zh) * 2020-04-03 2020-05-12 深圳市九天睿芯科技有限公司 基于时间可变的电流积分和电荷共享的多位卷积运算模组
CN111611529A (zh) * 2020-04-03 2020-09-01 深圳市九天睿芯科技有限公司 电容容量可变的电流积分和电荷共享的多位卷积运算模组
US11176991B1 (en) * 2020-10-30 2021-11-16 Qualcomm Incorporated Compute-in-memory (CIM) employing low-power CIM circuits employing static random access memory (SRAM) bit cells, particularly for multiply-and-accumluate (MAC) operations
CN113658628A (zh) * 2021-07-26 2021-11-16 安徽大学 一种用于dram非易失存内计算的电路
CN115048075A (zh) * 2022-04-27 2022-09-13 北京大学 基于电容耦合的sram存算一体芯片

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117316237A (zh) * 2023-12-01 2023-12-29 安徽大学 时域8t1c-sram存算单元及时序跟踪量化的存算电路
CN117316237B (zh) * 2023-12-01 2024-02-06 安徽大学 时域8t1c-sram存算单元及时序跟踪量化的存算电路
CN118098310A (zh) * 2024-04-25 2024-05-28 南京大学 基于超前补偿型跨阻放大器的光电存算阵列读出电路

Also Published As

Publication number Publication date
CN115048075A (zh) 2022-09-13

Similar Documents

Publication Publication Date Title
WO2023207441A1 (zh) 基于电容耦合的sram存算一体芯片
US20220351761A1 (en) Sub-cell, Mac array and Bit-width Reconfigurable Mixed-signal In-memory Computing Module
US12007890B2 (en) Configurable in memory computing engine, platform, bit cells and layouts therefore
US10642922B2 (en) Binary, ternary and bit serial compute-in-memory circuits
Chou et al. Cascade: Connecting rrams to extend analog dataflow in an end-to-end in-memory processing paradigm
TW202013264A (zh) 適用於人工神經元的記憶體內運算記憶體裝置之結構
KR102207909B1 (ko) 비트라인의 전하 공유에 기반하는 cim 장치 및 그 동작 방법
WO2020112485A1 (en) Sram-based process in memory system
Kang et al. A 481pJ/decision 3.4 M decision/s multifunctional deep in-memory inference processor using standard 6T SRAM array
TW202147320A (zh) 低功率記憶體內計算位元格
US11574173B2 (en) Power efficient near memory analog multiply-and-accumulate (MAC)
US20220276835A1 (en) Sub-cell, Mac array and Bit-width Reconfigurable Mixed-signal In-memory Computing Module
Kim et al. 10T SRAM computing-in-memory macros for binary and multibit MAC operation of DNN edge processors
Zhang et al. A 55nm 1-to-8 bit configurable 6T SRAM based computing-in-memory unit-macro for CNN-based AI edge processors
CN115080501A (zh) 基于局部电容电荷共享的sram存算一体芯片
Tsai et al. RePIM: Joint exploitation of activation and weight repetitions for in-ReRAM DNN acceleration
Zhang et al. HD-CIM: Hybrid-device computing-in-memory structure based on MRAM and SRAM to reduce weight loading energy of neural networks
Cheon et al. A 2941-TOPS/W charge-domain 10T SRAM compute-in-memory for ternary neural network
CN117130978A (zh) 基于稀疏跟踪adc的电荷域存内计算电路及其计算方法
Xiao et al. A 128 Kb DAC-less 6T SRAM computing-in-memory macro with prioritized subranging ADC for AI edge applications
CN117389466A (zh) 可重构智能存算一体处理器及存算架构设计装置
CN115312090A (zh) 一种存内计算电路及方法
CN114863964A (zh) 基于局部乘-整体加结构的存内计算电路、存储器及设备
Bharti et al. Compute-in-memory using 6T SRAM for a wide variety of workloads
Kushwaha et al. Multi-Bit Compute-In Memory Architecture Using a C-2C Ladder Network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794876

Country of ref document: EP

Kind code of ref document: A1