CN118210755A

CN118210755A - In-memory computing circuit structure, chip and electronic equipment

Info

Publication number: CN118210755A
Application number: CN202410473404.6A
Authority: CN
Inventors: 岳金山; 戴卓玉; 闫胜哲; 丛照日; 郭泽钰; 李泠
Original assignee: Institute of Microelectronics of CAS
Current assignee: Institute of Microelectronics of CAS
Priority date: 2024-04-19
Filing date: 2024-04-19
Publication date: 2024-06-18

Abstract

The disclosure provides an in-memory computing circuit structure and electronic equipment, which can be applied to the technical field of circuit design. The in-memory computing circuit structure comprises: the in-memory computing array comprises an in-memory computing unit, wherein the in-memory computing unit is used for caching input data and processing the input data and the weight data to obtain output data; the access array is in communication connection with the in-memory computing array and comprises an access unit, wherein the access unit is used for sending input data to the in-memory computing unit and receiving output data.

Description

In-memory computing circuit structure, chip and electronic equipment

Technical Field

The present disclosure relates to the field of circuit design technology, and more particularly, to an in-memory computing circuit structure, a chip, and an electronic device.

Background

The in-memory calculation integrates storage and calculation, has the characteristics of high parallelism and high energy efficiency, and is a better alternative scheme for an algorithm requiring a large number of parallel matrix vector multiplication operations.

In implementing the inventive concepts of the present disclosure, the inventors have found that the energy efficiency of the in-memory computational circuitry is reduced due to the nature of the multiplication of the in-memory computational array data.

Disclosure of Invention

In view of the above, the present disclosure provides an in-memory computing circuit structure and an electronic device.

According to a first aspect of the present disclosure, there is provided an in-memory computing circuit structure comprising: the memory computing array comprises a memory computing unit, wherein the memory computing unit is used for caching input data and processing the input data and weight data to obtain output data; the access array is in communication connection with the in-memory computing array and comprises an access unit, wherein the access unit is used for sending the input data to the in-memory computing unit and receiving the output data.

According to an embodiment of the disclosure, the in-memory computing array includes N in-memory computing units, the access array includes n+1 of the access units, and an nth of the access units is configured to receive result data, where N is a positive integer, and the result data is a sum of the output data.

According to an embodiment of the present disclosure, the in-memory computing unit includes a computing subunit and an accumulating subunit; the processing the input data and the weight data to obtain the output data includes: performing multiply-accumulate processing on the input data and the weight data in the computing subunit to obtain the target data; and accumulating the target data in the accumulation subunit to obtain the output data.

According to an embodiment of the present disclosure, the number of the plurality of calculation subunits in the in-memory calculation unit is M, the accumulation subunit includes a local data register, the local data register is used for storing local target data, the local target data is M portions of calculation result target data of the calculation subunits, M satisfies M < M, and M are both positive integers.

According to an embodiment of the present disclosure, the accumulation subunit includes a systolic data register for storing the output data.

According to an embodiment of the present disclosure, the computing subunit includes a plurality of operation regions including 2 storage components and 1 adder component.

According to an embodiment of the present disclosure, the operation area is used for synchronously updating the weight data.

According to an embodiment of the present disclosure, the above-mentioned computing subunit further includes: the system comprises an input data driver, an in-memory computing input driving component and an in-memory computing output driving component; the input data driver is used for receiving the input data, the in-memory computing input driving component is used for receiving the weight data, and the in-memory computing output driving component is used for outputting the target data.

A second aspect of the present disclosure provides a chip comprising:

such as any of the above-described in-memory computational circuit configurations.

A third aspect of the present disclosure provides an electronic device, comprising:

The chip.

According to the embodiment of the disclosure, the input data in the access unit is input into the in-memory computing unit, the input data and the weight data are processed in the in-memory computing unit to obtain the output data, and the output data is sent to the access unit.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates a schematic diagram of an in-memory computing circuit architecture according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a schematic diagram of an in-memory computing unit according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a schematic diagram of a computing subunit according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a schematic diagram of an accumulation subunit according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a block diagram of a chip according to an embodiment of the disclosure;

Fig. 6 schematically illustrates a block diagram of an electronic device suitable for implementing an in-house computing circuit architecture, in accordance with an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a formulation similar to at least one of "A, B or C, etc." is used, in general such a formulation should be interpreted in accordance with the ordinary understanding of one skilled in the art (e.g. "a system with at least one of A, B or C" would include but not be limited to systems with a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. and processed, all in compliance with the related laws and regulations and standards of the related country and region, necessary security measures are taken, no prejudice to the public order, and corresponding operation entries are provided for the user to select authorization or rejection.

In-memory computation is an emerging circuit architecture, and unlike traditional von neumann architecture, in which memory and computation are separated, memory computation integrates memory and computation, and computation is completed inside a memory cell.

At present, even though the computing subunits in the in-memory computing array have high energy efficiency, the large amount of access units can be caused by the large neural network required to store a lot of input data. Access to the access unit and the peripheral circuitry reduce the energy efficiency ratio of the in-memory computing system to the computing subunit by about 60%.

In the neural network algorithm, a group of input data in the access unit needs to be multiplied by the in-memory computing array, namely, a matrix vector multiplication operation, and the matrix vector multiplication operation consists of a series of vector multiplication operations, which represents that the input data needs to be subjected to vector multiplication with each weight data in the in-memory computing array, namely, each input data needs to be input into the in-memory computing array for computation, so that the access of the input data from the access unit is greatly increased in the large-scale neural network.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: because the access amount of the access unit is large, power consumption is consumed, so that the energy efficiency of the in-memory computing system is reduced, and the performance of the in-memory computing whole chip is affected.

To at least partially solve the technical problems existing in the related art, embodiments of the present disclosure provide an in-memory computing circuit structure, including: the in-memory computing array comprises an in-memory computing unit, wherein the in-memory computing unit is used for caching input data and processing the input data and the weight data to obtain output data; the access array is in communication connection with the in-memory computing array and comprises an access unit, wherein the access unit is used for sending input data to the in-memory computing unit and receiving output data.

The in-memory computing circuit structure of the disclosed embodiment will be described in detail below with reference to fig. 1 to 4 based on the scenario described in fig. 1.

Fig. 1 schematically shows a schematic diagram of an in-memory computing circuit structure according to an embodiment of the present disclosure.

As shown in fig. 1, the embodiment 100 includes an access array 110 and an in-memory computing array 120, wherein the access array 110 includes an access unit 111, and the in-memory computing array 120 includes an in-memory computing unit 121.

According to the embodiment of the disclosure, the in-memory computing unit may be configured to cache the input data, so that the cached input data and the weight data in the in-memory computing unit may be directly processed without reading the data from the access unit when the same input data is subjected to in-memory computing.

According to an embodiment of the present disclosure, the weight data is used to characterize data stored in the in-memory computing unit for in-memory computing operations with the input data.

According to an embodiment of the present disclosure, output data obtained by processing input data and weight data is written back to the access unit.

According to an embodiment of the present disclosure, an access array is communicatively connected to an in-memory computing array, the in-memory computing array acquires input data from the access array and writes back output data obtained after processing to the access array, and data between the in-memory computing unit and the access unit is transmitted in a zigzag shape.

According to an embodiment of the disclosure, the in-memory computing array includes N in-memory computing units, the access array includes n+1 access units, and the nth access unit is configured to receive result data, where the result data is a sum of output data, and N is a positive integer.

For example: the access array may include 10 access units, and the in-memory compute array may include 9 compute units, ensuring that the accumulated sum of the output data of the 9 compute units may be written back to the 10 th access unit.

Fig. 2 schematically illustrates a schematic diagram of an in-memory computing unit according to an embodiment of the disclosure.

As shown in fig. 2, the in-memory computing unit 121 includes a computing subunit 121_1 and an accumulating subunit 121_2.

According to an embodiment of the present disclosure, an in-memory computing unit includes a computing subunit and an accumulating subunit; processing the input data and the weight data to obtain output data, including: performing multiply-accumulate processing on the input data and the weight data in the computing subunit to obtain target data; and accumulating the target data in an accumulation subunit to obtain output data.

According to an embodiment of the present disclosure, performing multiplication processing on input data and weight data is to perform a multiply-accumulate operation, i.e., a logical and operation, on the input data and the weight data, with the target data being a result of the multiply-accumulate operation.

According to an embodiment of the disclosure, the accumulation subunit is configured to accumulate the target data of the calculation subunit to obtain the output data.

According to the embodiment of the disclosure, the in-memory computing unit forms a pulsation architecture capable of flexibly realizing convolution or matrix vector multiplication operation, so that different algorithms can be performed in the in-memory computing unit.

According to an embodiment of the disclosure, the in-memory computing unit includes a plurality of computing subunits, and the accumulating subunit includes a local data register for storing local target data, where the local target data is target data of a part of the computing subunits.

According to an embodiment of the present disclosure, the accumulation subunit includes a systolic data register for storing output data.

Fig. 3 schematically illustrates a schematic diagram of a computing subunit according to an embodiment of the disclosure.

As shown in fig. 3, in the accumulation subunit 121_2, a local data register 121_2 (1) and a systolic data register 121_2 (2) are included. Wherein, the target data of part of the calculation subunits are accumulated and stored in the local data register 121_2 (1); the bit serial input data is used to characterize different bits of the input data sent to the in-memory computing unit at different clock cycles, and the input of the higher input data requires a left shift to be summed.

According to embodiments of the present disclosure, a systolic data register may store results of accumulating output data between different in-memory computing units using a time sequence of the systolic. The data in the systolic data register is written back to the corresponding access unit through the shift operation and transmitted to the next in-memory computing unit.

According to an embodiment of the present disclosure, the computing subunit includes a plurality of operation regions including 2 storage components and 1 adder component. The operation area is used for synchronously updating the weight data. The computing subunit further includes: the system comprises an input data driver, an in-memory computing input driving component and an in-memory computing output driving component; the input data driver is used for receiving input data, the in-memory computing input driving component is used for receiving weight data, and the in-memory computing output driving component is used for outputting target data.

Fig. 4 schematically illustrates a schematic diagram of an accumulation subunit according to an embodiment of the disclosure.

As shown in fig. 4, the computation subunit 121_1 includes an input data register & word line decoder 121_1 (3), a storage element 121_1 (5), an adder element 121_1 (4), an in-memory computation input driving element 121_1 (6), and an in-memory computation output driving element 121_1 (7).

According to embodiments of the present disclosure, an input data driver may be used to send input data from an access unit to an in-memory computing unit. The storage components are used for storing weight data, and the 2 storage components and the 1 adder component form an operation area, and each operation area can synchronously update the weight value. The adder component performs accumulation calculation by shifting and adding.

According to embodiments of the present disclosure, an in-memory computing input drive component may be used to send weight data from the access unit to the in-memory computing unit.

According to embodiments of the present disclosure, a word line decoder may be used to convert an address of input data into a form of a row and select a row to write to an in-memory computing unit.

According to an embodiment of the present disclosure, the in-memory computing output driver component is configured to output the output data to the next computing subunit.

According to the embodiment of the disclosure, the writing flow of the input data calculated in the memory is the same as the writing mode of the traditional access array, writing is performed according to rows, one row of word lines is opened at a time, meanwhile, the data to be written is input on the bit lines and a certain voltage is maintained, and the opened rows can be written; the output flow of the output data is the same as the read-out mode of the traditional access array, the data is read out according to the rows, one row of word lines is opened at a time, and the data on the bit lines is read out at the same time, so that the opened row of data can be obtained.

According to the embodiment of the disclosure, the Memory component may be a Memory cell circuit formed by a 6T or 8T universal Static Random Access Memory (SRAM), and may also be other existing Memory cell circuits, for example, a resistive Random Access Memory (RESISTIVE RANDOM ACCESS MEMORY RRAM).

According to an embodiment of the present disclosure, the calculation subunit multiplies the weight data and the input data in the operation region and outputs the output data through the in-memory calculation output driving component. The operation area can simultaneously support in-memory computing operation and updating weight data operation, so that the influence on in-memory computing performance when updating weight data is reduced.

In order to better understand the in-memory computation process in the in-memory computation circuit structure of the embodiments of the present disclosure, the in-memory computation process and the transmission between data will be described in more detail below by taking a convolution operation of 3×3 as an example.

According to an embodiment of the present disclosure, for a convolution operation of 3×3, taking 9 in-memory computing units as an example, 9 weight data of a convolution kernel are stored in the 9 in-memory computing units, respectively. The 1 st to 3 rd access units write the input data into the 3 rd memory computing unit, the 4 th to 6 th access units write the input data into the 6 th memory computing unit, and the 7 th to 9 th access units write the input data into the 9 th memory computing unit. After the input data and the weight value are multiplied for one period in the calculation subunit of the in-memory calculation unit, the output data of the 9 in-memory calculation units are accumulated by the accumulation subunit.

According to the embodiments of the present disclosure, for input data buffered in the in-memory computing units, transmission can be made between the in-memory computing units when the multiplication operation of the next cycle is performed. After all cycles of the in-memory computation operation are completed, the result data is written back to the 10 th access unit.

Based on the above-mentioned in-memory computing circuit structure, the present disclosure also provides a chip. The chip will be described in detail with reference to fig. 5.

Fig. 5 schematically shows a block diagram of a chip according to an embodiment of the disclosure.

As shown in fig. 5, this embodiment 500 includes an in-memory computing circuit structure 510.

According to embodiments of the present disclosure, an integrated circuit chip incorporating examples of the present disclosure is obtained via front-end design, back-end design, wafer fabrication of digital circuits. Wherein, the process adopts a technology of accumulating electricity by 28nm, and then the power consumption and the performance are tested after the chip is packaged. In actual chip testing, access unit accesses for ResNet, convNext-T, recommender (Criteo) and recommender (MovieLens) were reduced by 1.50 x-3.75 x compared to the prior art. For a 3 x 3 convolution operation, embodiments of the present disclosure increase the energy efficiency ratio of the in-memory meter system to the in-memory computing unit by 15.3%.

As shown in fig. 6, this embodiment 600 includes a chip 610.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. An in-memory computing circuit structure, comprising:

The in-memory computing array comprises an in-memory computing unit, wherein the in-memory computing unit is used for caching input data and processing the input data and the weight data to obtain output data;

The access array is in communication connection with the in-memory computing array and comprises an access unit, wherein the access unit is used for sending the input data to the in-memory computing unit and receiving the output data.

2. The structure of claim 1, wherein the in-memory computing array comprises N of the in-memory computing units, the access array comprises n+1 of the access units, an nth of the access units is to receive result data, the result data is a sum of the output data, and N is a positive integer.

3. The architecture of claim 1, wherein the in-memory computing unit comprises a computing subunit and an accumulation subunit;

The processing the input data and the weight data to obtain the output data includes:

performing multiply-accumulate processing on the input data and the weight data in the computing subunit to obtain the target data;

and accumulating the target data in the accumulation subunit to obtain the output data.

4. The structure of claim 3, wherein said one of said in-memory computing units comprises a plurality of computing subunits, said accumulation subunit comprising a local data register for storing local target data, said local target data being target data of a portion of said computing subunits.

5. The architecture of claim 3, wherein the accumulation subunit comprises a systolic data register for storing the output data.

6. The architecture of claim 1, wherein the computing subunit comprises a plurality of operation regions including 2 storage components and 1 adder component.

7. The structure of claim 6, wherein the operational area is to synchronously update the weight data.

8. The structure of claim 1, wherein the computing subunit further comprises: the system comprises an input data driver, an in-memory computing input driving component and an in-memory computing output driving component; the input data driver is used for receiving the input data, the in-memory computing input driving component is used for receiving the weight data, and the in-memory computing output driving component is used for outputting the target data.

9. A chip, comprising:

the in-memory computing circuit structure according to any one of claims 1 to 8.

10. An electronic device, comprising:

the chip of claim 9.