CN114093394A

CN114093394A - Transferable memory computing circuit and implementation method thereof

Info

Publication number: CN114093394A
Application number: CN202111273336.1A
Authority: CN
Inventors: 王润声; 宋嘉豪; 王源; 唐希源; 黄如
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-25
Anticipated expiration: 2041-10-29
Also published as: CN114093394B

Abstract

The invention discloses a transferable memory computing circuit and an implementation method thereof. The invention discloses a convertible memory computing circuit, which comprises a convertible memory computing array and a peripheral circuit, wherein the convertible memory computing array comprises 16 local arrays, each local array comprises 128 storage and computing columns, the 128 storage and computing columns are connected together through a row computing line, the storage and computing columns in the same column are reversely connected together through a total bit line and a total bit line, each storage and computing column comprises 8 six-tube storage units and 1 charge computing unit, the storage and computing columns are reversely connected in parallel through the local bit line and the local bit line, and the peripheral circuit comprises a word line drive circuit, a read-write peripheral circuit, a forward input drive circuit, 16 row analog-to-digital converters, 16 8-to-1 multiplexers, 16 column analog-to-digital converters and a total timing control circuit; the transposition function of the invention can lead the intelligent chip at the edge end to realize retraining of the edge end with lower power consumption; meanwhile, the stability and the precision of calculation are improved by charge domain calculation.

Description

Transferable memory computing circuit and implementation method thereof

Technical Field

The invention relates to the field of integrated circuit design (integrated circuit design), in particular to a transferable memory computing circuit and an implementation method thereof.

Background

In recent years, the deep learning (deep learning) algorithm has achieved a very good effect in various fields. Meanwhile, the parameter scale of deep neural networks (deep neural networks) is also becoming larger and larger. This results in a problem called memory wall (memory wall) that consumes a large amount of power to transport neural network parameters when processing deep learning tasks using the conventional memory-separate computation architecture. This power consumption problem makes it difficult for deep learning algorithms to be deployed to edge devices (edge devices) that have high requirements for power consumption. To solve the memory wall problem, a new computing architecture, in-memory-computing (in-memory-computing), has been proposed by designers in recent years.

The memory computing circuit is particularly very energy efficient due to its nature of analog computation, as well as the parallel processing. In recent years, various new memory computing chips have been proposed, which are classified into two types, current domain memory computing and charge domain memory computing, depending on the type of analog computing.

In the current-mode memory computing chip, the input is voltage-controlled, and the result of multiplication of the input with a weight is expressed as the magnitude of the current. Currents obtained by multiplying a plurality of inputs by the weights are superposed on the same calculation node, and the capacitance of the calculation node is discharged, so that the multiply-accumulate (multiply-accumulate) operation of the whole analog domain (analog domian) is completed. However, the calculation accuracy is affected because the calculation current may deviate from the ideal calculation result due to random fluctuation (variation) of the threshold of the transistor (transistor). This effect is particularly severe in advanced technology nodes.

In the charge type memory computing chip, the calculation result of multiplication determines whether to charge a computing capacitor, and the accumulation is to connect the computing capacitors together for charge sharing. The calculation of the capacitance is usually realized by a metal-oxide-metal (metal-oxide-metal) capacitor, and the precision of the capacitor is very high under an advanced process node, so that the calculation result is very accurate. The charge-type structure has the advantage of high precision, but an additional transistor is required to control the charge and discharge of the capacitor, so that the area of a computing unit (cell area) is larger than that of a traditional 6-tube unit, and the storage and computation density is lower.

In addition, some data cannot be uploaded to the cloud (cloud) for training the neural network for privacy protection purposes. A more efficient solution is to train a generic neural network model from the public data set, then download it locally, and fine-tune (fine-tuning) a part of the model through the neural network with the user's own data, so that the neural network can achieve the best results for everyone. This solution requires training at the edge end. However, training of neural networks (training), unlike inference (inference), requires taking the transpose of the weight matrix (transpose), which is rarely supported by current in-memory computing chips.

Therefore, a transposable and area efficient charge-type in-memory computation circuit is very important for the deployment of neural networks at the edge.

Disclosure of Invention

In view of the problems in the prior art, the invention provides a transferable memory computing circuit and an implementation method thereof, which are based on charge sharing and six-pipe storage unit charge domain computing and support transposed computing.

One objective of the present invention is to provide a transposable memory computing circuit.

The invention relates to a transferable in-memory computing circuit, which comprises a transferable in-memory computing array and peripheral circuits;

the transferable memory computing array comprises 16 multiplied by N local area arrays, wherein N is a natural number;

each local area array comprises 128 multiplied by M storage and calculation columns, wherein M is a natural number; all 128 xM memory and compute columns in the same local array are connected together by row compute lines; all the storage and calculation columns in the same column in the 16 XN local arrays are connected together through the total bit line and the total bit line in an inverse mode; the number of the row calculation lines is 16 multiplied by N, and the number of the total bit lines and the total bit line inverses are 128 multiplied by M respectively;

each storage and calculation column comprises 8 multiplied by K six-tube storage units and 1 charge calculation unit, wherein K is a natural number and is connected with a local bit line in an anti-parallel mode; the local bit line has a parasitic capacitance; the six-tube memory unit stores weight values; all 128 xM storage and calculation unit column corresponding six-pipe storage units in each local area array are positioned in the same row, and 8 xK rows of six-pipe storage units are provided; the 16 × N local area arrays have 16 × N × 8 × K ═ 128 × N × K rows of six-pipe memory cells;

all 128 xM storage and calculation unit columns in the same local area array are connected in parallel on word lines and word line inverses, and the number of the word lines and the word line inverses is 8 xK; the number of the word lines and the word line counters of the 16 × N local arrays is respectively 16 × N × 8 × K — 128 × N × K;

each charge calculation unit internally comprises two pre-charging transistors, a row enable switch, a row enable counter switch, an input switch and an input counter switch; one pre-charge line is connected to the grid ends of the two pre-charge transistors, the row enable line is connected to the control end of the row enable switch, the row enable line is reversely connected to the control end of the row enable reverse switch, one ends of the row enable switch and the row enable reverse switch are connected to the row calculation line, the other end of the row enable switch is connected to the local bit line, the other end of the row enable reverse switch is connected to the local bit line reverse, the input line is connected to the control end of the input switch, the input line is reversely connected to the control end of the input reverse switch, one end of the input switch is connected to the local bit line, one end of the input reverse switch is connected to the local bit line reverse, the other end of the input switch is connected to the general bit line, and the other end of the input reverse switch is connected to the general bit line reverse; the input switches and the input inverse switches which are positioned in the same column in the 16 multiplied by N local area arrays are respectively connected to the control ends of the same input line and the same input line inverse, namely the number of the input lines and the number of the input inverse lines are respectively 128 multiplied by M;

the peripheral circuit comprises a word line drive circuit, a read-write peripheral circuit, a forward input drive circuit, 16 multiplied by N row analog-to-digital converters, 16 multiplied by M8-to-1 multiplexers, 16 multiplied by M column analog-to-digital converters and a total time sequence control circuit;

the fronthaul input drive circuit is registered with fronthaul input values, and each channel of the fronthaul input drive circuit is respectively connected to each input line and each input line bar as well as a line enable line and a line enable line, so that the fronthaul input drive circuit is respectively connected to the control ends of the input switch and the input anti-switch of each charge calculation unit through the input lines and the input line bars and is respectively connected to the control ends of the line enable switch and the line enable anti-switch of each charge calculation unit through the line enable line and the line enable line; each channel of the read-write peripheral circuit is respectively connected to each total bit line and the total bit line bar; meanwhile, each general bit line is respectively connected to the corresponding input port of the 8-to-1 multiplexer, and the output port of the multiplexer is connected to the column analog-to-digital converter; each line analog-to-digital converter corresponds to a local area array, and each local area array is connected to a corresponding input port of the line analog-to-digital converter through a line calculation line; the word line driver is stored with a reverse transmission input value and has a reverse transmission input driving function, and each channel of the word line driver is connected to a corresponding word line; the master timing control circuit is respectively connected to the forward input drive circuit, the read-write peripheral circuit and the word line drive; the total time sequence control circuit is connected to the pre-charging line, so that the total time sequence control circuit is connected to the pre-charging transistor of each charge calculation unit through the pre-charging line; the word line drive has two configuration modes during calculation, namely a front-passing word line drive mode and a back-passing word line drive mode;

in the initial state, a row enable switch, a row enable inverse switch, an input switch and an input inverse switch are all in an off state, word line and word line inversions, row enable line and row enable line inversions, input line and input line inversions and pre-charging lines are all in a low level, and bit line inversions are in a pre-charging voltage;

in the forward mode, control of the input line and the input line inversion is related to the forward input value; in the reverse mode, the control of the input line and the input line reverse is independent of the reverse input value;

a forward mode: a) a pre-charging stage: the pre-charge line is at a low level, so that the local bit line and the local bit line are pre-charged to a pre-charge voltage through a pre-charge transistor in the charge calculation unit; b) the master timing control circuit applies a high level to the pre-charging line, and the pre-charging stage is finished; then word line drive applies high level to word line and word line in row where weight value is needed to be read in 16 XN local arrays at the same time, word line and word line with high level are selected, local bit line and local bit line are discharged to ground and keep precharge voltage according to weight value stored in six-tube memory unit on selected word line and word line, then word line and word line reverse applies low level again to complete weight reading operation; c) the forward input driving circuit reversely applies a high level to the input line or the input line in each channel according to the registered forward input value so as to control the input switch or the input reverse switch to be closed, and the closed input switch or the input reverse switch is used for reversely discharging the local bit line or the local bit line to the ground corresponding to the closed input switch or the input reverse switch; then the input line applied with high level and the input line are applied with low level again to complete the multiplication of the forwarding input value and the weight value; d) the fronthaul input driving circuit reversely applies high levels to all the line enable lines and the line enable lines, the line enable switches and the line enable reverse switches are closed, results obtained by multiplying 128 multiplied by M fronthaul input values and corresponding weight values are accumulated on the line calculation lines and transmitted to the line analog-to-digital converters, and the line analog-to-digital converters quantize the line analog-to-digital converters and output 16 multiplied by N line outputs;

and (3) a reverse transmission mode: a) a pre-charging stage: the pre-charging line is at a low potential, so that the local bit line and the local bit line are reversely pre-charged to a pre-charging voltage through a pre-charging transistor in the charge calculation unit; b) the main time sequence control circuit applies a high level to a pre-charging line, the pre-charging stage is finished, a word line driver applies the high level to a word line of a row where a weight value to be read in a local area array is located or keeps the high level at a low level according to a registered reverse transmission input value, and then the word line applied with the high level is applied with the low level again to complete multiplication of the reverse transmission input value and the weight value; c) the total time sequence control circuit applies high level to all input lines, the input switches are closed, all local area bit lines positioned in the same column in different local area arrays are connected together, accumulation is completed, the local area bit lines are transmitted to the column analog-to-digital converters, then the column analog-to-digital converters quantize the local area bit lines, and output 16 multiplied by M row outputs, so that transposition calculation is achieved.

The pre-charging voltage is 0.7-0.9V; the low level is 0.0V, and the high level is 0.7-0.9V.

N is more than or equal to 1 and less than or equal to 4, M is more than or equal to 1 and less than or equal to 9, and K is more than or equal to 1 and less than or equal to 4.

Another objective of the present invention is to provide a method for implementing a transferable in-memory computing circuit.

The invention discloses a realization method of a transferable memory computing circuit, which comprises the following steps:

1) initial state:

the row enable switch and the row enable inverse switch, the input switch and the input inverse switch are all in an off state, word lines and word line inversions, row enable lines and row enable line inversions, input lines and input line inversions and pre-charging lines are all in a low level, and bit lines and bit line inversions are in a pre-charging voltage;

2) a forward mode:

a) a pre-charging stage: the pre-charge line is at a low level, so that the local bit line and the local bit line are pre-charged to a pre-charge voltage through a pre-charge transistor in the charge calculation unit;

b) the master timing control circuit applies a high level to the pre-charging line, and the pre-charging stage is finished; then word line drive applies high level to word line and word line in row where weight value is needed to be read in 16 XN local arrays at the same time, word line and word line with high level are selected, local bit line and local bit line are discharged to ground and keep precharge voltage according to weight value stored in six-tube memory unit on selected word line and word line, then word line and word line reverse applies low level again to complete weight reading operation;

c) the forward input driving circuit reversely applies a high level to the input line or the input line in each channel according to the registered forward input value so as to control the input switch or the input reverse switch to be closed, and the closed input switch or the input reverse switch is used for reversely discharging the local bit line or the local bit line to the ground corresponding to the closed input switch or the input reverse switch; then the input line applied with high level and the input line are applied with low level again to complete the multiplication of the forwarding input value and the weight value;

d) the fronthaul input driving circuit reversely applies high levels to all the line enable lines and the line enable lines, the line enable switches and the line enable reverse switches are closed, results obtained by multiplying 128 multiplied by M fronthaul input values and corresponding weight values are accumulated on the line calculation lines and transmitted to the line analog-to-digital converters, and the line analog-to-digital converters quantize the line analog-to-digital converters and output 16 multiplied by N line outputs;

3) and (3) a reverse transmission mode:

a) a pre-charging stage: the pre-charging line is at a low potential, so that the local bit line and the local bit line are reversely pre-charged to a pre-charging voltage through a pre-charging transistor in the charge calculation unit;

b) the main time sequence control circuit applies a high level to a pre-charging line, the pre-charging stage is finished, a word line driver applies the high level to a word line of a row where a weight value to be read in a local area array is located or keeps the high level at a low level according to a registered reverse transmission input value, and then the word line applied with the high level is applied with the low level again to complete multiplication of the reverse transmission input value and the weight value;

c) the total time sequence control circuit applies high level to all input lines, the input switches are closed, all local area bit lines positioned in the same column in different local area arrays are connected together, accumulation is completed, the local area bit lines are transmitted to the column analog-to-digital converters, then the column analog-to-digital converters quantize the local area bit lines, and output 16 multiplied by M row outputs, so that transposition calculation is achieved.

Wherein, in step 2) c), when the current transmission input value is 1, the input switch and the input reverse switch are closed; when the forward input value is 0, the input switch and the input reverse switch are disconnected.

In step 3), when the bar-pass input value is 1, the word line is at high level, and when the bar-pass input value is 0, the word line is kept at low level; when the bar input and weight values are both 1, the local bit line will be discharged to ground, otherwise the local bit line will maintain the precharge voltage.

The invention has the advantages that:

compared with the traditional non-transposing memory computing circuit, the transposing computing function of the invention can enable the intelligent chip at the edge end to realize retraining of the edge end with lower power consumption; meanwhile, the charge domain calculation improves the stability and the precision of the calculation.

Drawings

FIG. 1 is a block diagram of one embodiment of a transferable memory computing circuit in accordance with the present invention;

FIG. 2 is a schematic diagram of a local area array of an embodiment of a transposable memory computing circuit of the present invention, wherein (a) is a block diagram of a local area array and (b) is a block diagram of a memory and compute column;

FIG. 3 is a flow chart of a forward wordline driving scheme according to an embodiment of a method for implementing a transferable memory computing circuit of the present invention, in which (a) - (d) are flow charts of four steps, respectively, a schematic diagram is shown on the left, and a timing diagram is shown on the right;

FIG. 4 is a flow chart of the reverse wordline driving mode of an embodiment of a method for implementing a transferable memory computing circuit according to the present invention, wherein (a) - (c) are flow charts of three steps, respectively, the left side is a schematic diagram, and the right side is a timing diagram.

Detailed Description

The invention will be further elucidated by means of specific embodiments in the following with reference to the drawing.

As shown in fig. 1, in the present embodiment, N ═ M ═ K ═ 1, the transferable in-memory computing circuit of the present embodiment includes 128 × 128 transferable in-memory computing array and peripheral circuits;

the transposable in-memory computing array comprises first to sixteenth local arrays;

each local area array comprises first to 128 th storage and computation columns; all 128 storage and computation columns in the same local area array are connected together through row computation lines; all the storage and calculation columns in the same column in the 16 local arrays are connected together through the total bit line and the total bit line in an inverse mode; the number of the row calculation lines is 16, and the number of the total bit lines and the total bit line inverses are 128, namely, the first to 128-th total bit lines and the first to 128-th total bit line inverses;

each storage and calculation column comprises first to eighth pipe storage units and 1 charge calculation unit, and the storage and calculation units are connected in anti-parallel through a local bit line and a local bit line; the local bit line has a parasitic capacitance; the six-tube memory unit stores weight values; all 128 storage and calculation unit column corresponding six-pipe storage units in each local area array are positioned in the same row, and 8 rows of six-pipe storage units are provided; the 16 local area arrays have 16 × 8-128 rows of six-tube storage units;

each six-tube memory unit internally comprises two cross-coupled inverters and two access tubes controlled by word lines and word line inverses; all 128 storage and calculation unit columns in the same local area array are connected in parallel with six tube storage units in the same row on word lines and word line inverses, and the number of the word lines and the word line inverses is 8 respectively; the number of the word lines and the word line inverses of the 16 local arrays is respectively 16 × 8-128;

each charge calculation unit internally comprises two pre-charging transistors, a row enable switch, a row enable counter switch, an input switch and an input counter switch; one pre-charge line is connected to the grid ends of the two pre-charge transistors, the row enable line is connected to the control end of the row enable switch, the row enable line is reversely connected to the control end of the row enable reverse switch, one ends of the row enable switch and the row enable reverse switch are connected to the row calculation line, the other end of the row enable switch is connected to the local bit line, the other end of the row enable reverse switch is connected to the local bit line reverse, the input line is connected to the control end of the input switch, the input line is reversely connected to the control end of the input reverse switch, one end of the input switch is connected to the local bit line, one end of the input reverse switch is connected to the local bit line reverse, and the other ends of the input switch and the input reverse switch are respectively connected to the main bit line reverse; the input switches and the input inverse switches in the same column in the 16 local area arrays are respectively connected to the control ends of the same input line and the same input line inverse, namely the number of the input lines and the number of the input inverse lines are respectively 128;

the peripheral circuit comprises a word line drive circuit, a read-write peripheral circuit, a forward input drive circuit, 16 row analog-to-digital converters, 16 1-from-8 multiplexers, 16 column analog-to-digital converters and a total time sequence control circuit;

the fronthaul input drive circuit is registered with fronthaul input values, and each channel of the fronthaul input drive circuit is respectively connected to each input line and each input line bar as well as a line enable line and a line enable line, so that the fronthaul input drive circuit is respectively connected to the control ends of the input switch and the input anti-switch of each charge calculation unit through the input lines and the input line bars and is respectively connected to the control ends of the line enable switch and the line enable anti-switch of each charge calculation unit through the line enable line and the line enable line; each channel of the read-write peripheral circuit is respectively connected to each total bit line and the total bit line bar; meanwhile, each general bit line is respectively connected to the corresponding input port of the 8-to-1 multiplexer, and the output port of the multiplexer is connected to the column analog-to-digital converter; each line analog-to-digital converter corresponds to a local area array, and each local area array is connected to a corresponding input port of the line analog-to-digital converter through a line calculation line; the word line driver is stored with a reverse transmission input value and has a reverse transmission input driving function, and each channel of the word line driver is connected to a corresponding word line; the master timing control circuit is respectively connected to the forward input drive circuit, the read-write peripheral circuit and the word line drive; the total time sequence control circuit is connected to the pre-charging line, so that the total time sequence control circuit is connected to the pre-charging transistor of each charge calculation unit through the pre-charging line; the word line driving has two configuration modes in calculation, namely a forward word line driving mode and an inverse word line driving mode.

The implementation method of the transferable memory computing circuit of the embodiment comprises the following steps:

1) initial state:

2) forward mode, as shown in fig. 3:

a) a pre-charging stage: the precharge line is at a low level (0.0 v) so that the local bit line and the local bit line are precharged back to the precharge voltage (0.9 v) by the precharge transistors in the charge calculation unit, as shown in fig. 3 (a);

b) the master timing control circuit applies a high level (0.9V) to the pre-charge line, and the pre-charge stage is finished; then word line driving applies high level to word line and word line reversal of row where weight value is needed to be read in 16 local arrays at the same time, local bit line and local bit line reversal are discharged to ground according to weight value stored in six-transistor memory unit on selected word line and word line reversal, and pre-charging voltage is kept, and then word line and word line reversal reapply low level to complete weight reading operation, as shown in fig. 3 (b);

c) the forward input drive circuit reversely applies high level to the input line or the input line in each channel according to the registered forward input value so as to control the input switch or the input reverse switch to be closed, the forward input value is closed when being 1 and is opened when being 0, and the closed input switch or the input reverse switch is correspondingly reversely discharged to the ground to the local bit line or the local bit line;

then the input line applied with high level and the input line are applied with low level again, and the multiplication of the forwarding input value and the weighted value is completed, as shown in fig. 3 (c);

d) the fronthaul input driving circuit reversely applies high level to all the row enable lines and the row enable lines, the row enable switches and the row enable reverse switches are closed, results of multiplication of 128 fronthaul input values and corresponding weight values are accumulated on a row calculation line and transmitted to a row analog-to-digital converter, and the row analog-to-digital converter performs quantization, as shown in fig. 3 (d);

3) reverse mode, as shown in fig. 4:

a) a pre-charging stage: the precharge line is at a low potential, so that the local bit line and the local bit line are precharged back to the precharge voltage (0.9 v) by the precharge transistor in the charge calculation unit, as shown in fig. 4 (a);

b) the main timing control circuit applies a high level to a precharge line, the precharge stage is finished, a word line driver applies a high level or keeps the high level to a word line of a row where a weight value to be read in a local area array is located according to a registered reverse transmission input value, if the reverse transmission input value is 1, the word line is at the high level (0.9), if the reverse transmission input value is 0, the word line keeps at the low level, if the reverse transmission input value and the weight value are both 1, the local area bit line is discharged to the ground, otherwise, the local area bit line keeps a precharge voltage, and then the word line applied with the high level is reapplied with the low level to complete multiplication of the reverse transmission input value and the weight value, as shown in fig. 4 (b);

c) the total timing control circuit applies a high level to all input lines, and the input switches are closed, so that all local bit lines in the same column in different local arrays are connected together, and the sum is accumulated, transmitted to the column analog-to-digital converters, and then quantized by the column analog-to-digital converters, as shown in fig. 4 (c).

Finally, it is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A transposable memory computing circuit, wherein the transposable memory computing circuit comprises a transposable memory computing array and peripheral circuitry;

2. The transposable memory computing circuit of claim 1 wherein the precharge voltage is 0.7-0.9V; the low level is 0.09V, and the high level is 0.7-0.9V.

3. A transposable memory computing circuit as claimed in claim 1, wherein N satisfies 1. ltoreq. N.ltoreq.4, M satisfies 1. ltoreq. M.ltoreq.9, and K satisfies 1. ltoreq. K.ltoreq.4.

4. A method for implementing a transferable in-memory computing circuit in accordance with claim 1, wherein the method comprises the steps of:

1) initial state:

2) a forward mode:

3) and (3) a reverse transmission mode:

4) c) applying high level to all input lines by the total time sequence control circuit, closing the input switches, connecting all local area bit lines positioned in the same column in different local area arrays together, finishing accumulation, transmitting to the column analog-to-digital converter, quantizing by the column analog-to-digital converter, and outputting 16 multiplied by M row outputs, thereby realizing transposition calculation.

5. The implementation method of claim 4, wherein in step 2) c), when the current transmission input value is 1, the input switch and the input inverse switch are closed; when the forward input value is 0, the input switch and the input reverse switch are disconnected.

6. The method of claim 4, wherein in step 3) b), when the bar-pass input value is 1, the word line is high, and when the bar-pass input value is 0, the word line is kept low; when the bar input and weight values are both 1, the local bit line will be discharged to ground, otherwise the local bit line will maintain the precharge voltage.