CN112926733B

CN112926733B - Special chip for voice keyword detection

Info

Publication number: CN112926733B
Application number: CN202110265116.8A
Authority: CN
Inventors: 黄科杰; 杨树园; 陆凯晨; 沈海斌
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2022-09-16
Anticipated expiration: 2041-03-10
Also published as: CN112926733A

Abstract

The invention discloses a chip special for voice keyword detection. The system comprises a control module, a clock generation module, an MFCC feature extraction module, a streaming neural network acceleration module, an input interface and an output interface; the control module is respectively connected with the input interface, the MFCC feature extraction module, the clock generation module, the streaming neural network acceleration module and the output interface; the flow type neural network acceleration module comprises a residual error pre-convolution layer calculation module, three residual error layer calculation modules, an average pooling layer module and a full-connection layer calculation module which are connected in sequence. The invention designs a voice keyword detection chip architecture supporting the light-weight stream convolution neural network, and the capacity of an on-chip memory is small; a production line is formed among all layers of calculation, a low-power-consumption and real-time voice keyword detection task is realized, the number of recognized voice keywords is large, and the recognition accuracy is high.

Description

Special chip for voice keyword detection

Technical Field

The invention relates to a flow convolution neural network chip, in particular to a special chip for voice keyword detection, relating to a computing technology in a memory and a software and hardware collaborative design technology of a flow convolution neural network algorithm and a chip architecture.

Background

In conventional von neumann architecture, the computing unit is separated from the memory, and the operands involved in the computation need to be read from the memory first to complete the computation, and then sent to the computing unit to complete the computation, and then the result is written back to the memory. In this process, most of the energy is consumed by memory access operation and calculation unit operation. Unlike von neumann architectures, in-memory computing embeds computation in memory, which can perform computation in addition to memory functions, greatly reducing the energy consumption due to data movement and memory access, and computation is implemented by analog circuits, greatly reducing the computation power consumption.

The neural network algorithm shows excellent performance in the fields of image recognition, natural language processing and the like. At present, a speech keyword detection algorithm based on a neural network mainly comprises a multilayer perceptron, a cyclic neural network algorithm, a convolutional neural network algorithm and the like. The number of algorithm parameters and the calculated amount of the multilayer perceptron are too large; the control logic of the cyclic neural network and the convolutional neural network is not concise, and the required part and the storage space are large. The flow convolution neural network algorithm is a special convolution neural network algorithm, and a convolution kernel of the flow convolution neural network algorithm only moves in one direction in the convolution process. When the hardware realizes the algorithm, as the input features are input into the hardware in a streaming way, the currently generated partial sum can be rapidly accumulated with the next generated partial sum without long storage, so that the partial sum required by the partial sum is greatly reduced, and the control logic is greatly simplified.

The voice keyword detection is to detect whether a specified keyword appears in a section of voice signal and the appearing position, and belongs to a sub-field of the voice recognition field. The voice keyword detection has rich application scenes, can be applied to awakening and controlling of intelligent equipment of the Internet of things, and can also be applied to identifying sensitive words in criminal investigation or public security. The voice keyword detection algorithm needs to be deployed on hardware to be executed, the hardware needs to be kept in a normally open state, and timely response can be guaranteed, so that low power consumption and real-time performance are very important in a voice keyword detection hardware system. The existing voice keyword detection chip has to be improved in the aspects of recognized keyword number, recognition accuracy, power consumption, time delay and the like.

Disclosure of Invention

In order to solve the problems in the background art, the invention aims to design a special chip for voice keyword detection by utilizing the advantages of low power consumption and high performance power consumption ratio of calculation in a memory.

The invention is based on the in-memory computing technology, adopts software and hardware cooperative optimization, designs a special chip supporting the lightweight flow convolutional neural network algorithm TC-resnet8, realizes all on-chip memories by register sets, and can realize low-power consumption and real-time voice keyword detection.

The technical scheme adopted by the invention is as follows:

s01: erasing resistance, namely erasing all resistance values in the calculation array in the memory in the chip and reducing the resistance values into initial resistance values;

s02: reading a resistance value, inputting a vector to the calculation array in the memory and executing calculation to obtain the resistance value read in the calculation array in the memory under the action of the control module;

s03: mapping optimization, namely, setting a resistance value to be written into a calculation array in a memory by using a network mapping algorithm according to the resistance value read in the S02, wherein the resistance value represents a parameter value of the current neural network;

s04: writing the resistance value, namely sending the resistance value optimized by mapping in the S03 to the chip through the input interface, and writing the resistance value into the corresponding position of the calculation array in the memory;

s05: and when the chip works normally, the chip receives the voice signal from the input interface, processes the voice signal and outputs a detection result from the output interface.

The chip comprises a control module, a clock generation module, an MFCC feature extraction module, a streaming neural network acceleration module, an input interface and an output interface;

the input interface is respectively connected with the MFCC feature extraction module, the streaming neural network acceleration module, the clock generation module and the control module; the input interface receives the resistance value input from the outside of the chip, sends the resistance value to the flow type neural network acceleration module to be endowed to the internal memory internal calculation array, receives the voice signal input from the outside of the chip and sends the voice signal to the MFCC characteristic extraction module, and receives the high-frequency Clock _ sys input from the outside of the chip and sends the high-frequency Clock _ sys to the Clock generation module;

the output interface is respectively connected with the streaming neural network acceleration module, the clock generation module and the control module; the output interface receives the initial resistance value from the streaming neural network acceleration module and outputs the initial resistance value to the outside of the chip, and receives the prediction result from the streaming neural network acceleration module and outputs the prediction result to the outside of the chip;

the MFCC feature extraction module is respectively connected with the input interface, the clock generation module, the control module and the streaming neural network acceleration module; the MFCC feature extraction module receives the voice signals from the input interface, processes the voice signals to generate voice features, and receives control signals from the control module to control the work;

the clock generation module is respectively connected with the input interface, the MFCC feature extraction module, the control module, the streaming neural network acceleration module and the output interface; the Clock generation module receives a high-frequency Clock _ sys from the input interface, frequency division is carried out based on a counter to generate a first Clock signal group Clock _ CIM and a second Clock signal group Clock _ dig required by the flow neural network acceleration module and Clock signals required by the input interface, the output interface, the MFCC control feature extraction module and the control module respectively, the Clock signal groups Clock _ CIM and Clock _ dig are sent to the flow neural network acceleration module, and the respective Clock signals are correspondingly sent to the input interface, the output interface, the MFCC feature extraction module and the control module;

the control module is respectively connected with the input interface, the MFCC feature extraction module, the clock generation module, the streaming neural network acceleration module and the output interface; the control module generates control signals of the input interface, the MFCC feature extraction module, the clock generation module, the streaming neural network acceleration module and the output interface, and then correspondingly sends the respective control signals to the input interface, the MFCC feature extraction module, the clock generation module, the streaming neural network acceleration module and the output interface;

the flow type neural network acceleration module is respectively connected with the input interface, the MFCC feature extraction module, the clock generation module, the control module and the output interface; the flow type neural network acceleration module sends the initial resistance value in the internal memory calculation array to the output interface, and the resistance value received from the input interface realizes the resistance value writing; the voice feature extraction module receives the voice feature processing generated prediction result and sends the prediction result to the output interface.

The streaming neural network acceleration module is mainly formed by sequentially connecting a residual pre-convolution layer calculation module, a first residual layer calculation module, a second residual layer calculation module, a third residual layer calculation module, an average pooling layer module and a full-link layer calculation module;

the pre-residual convolutional layer calculation module receives the voice characteristics sent by the MFCC module, is responsible for executing a first convolutional layer, generates a calculation result of the first convolutional layer, and sends the calculation result to the first residual layer calculation module;

the first residual layer calculation module is responsible for executing a first residual layer, generating a first residual layer calculation result and sending the first residual layer calculation result to the second residual layer calculation module;

the second residual error layer calculation module is responsible for executing a second residual error layer, generating a second residual error layer calculation result and sending the second residual error layer calculation result to the third residual error layer calculation module;

the third residual error layer calculation module is responsible for executing a third residual error layer, generating a third residual error layer calculation result and sending the third residual error layer calculation result to the average pooling module;

the average pooling layer module is responsible for executing the average pooling layer, generating a result of the average pooling layer and sending the result to the full-connection layer calculating module;

and the full-connection layer calculation module is responsible for executing the full-connection layer, generating a full-connection layer calculation result and sending the full-connection layer calculation result to the output interface.

The average pooling layer module comprises an adder, a register group and a divider which are connected in sequence, and the adder is connected with the register group; the third residual layer calculation module outputs a third residual layer calculation result to the adder, and the adder adds the third residual layer calculation result and an addition result stored in the register group and outputs and stores the addition result in the register group; the register group sends the addition result stored in the register group to the divider, the divider executes division on the addition result in the register group to obtain a division result, and then the division result is sent to the full-connection layer calculation module.

The first residual layer computing module, the second residual layer computing module and the third residual layer computing module have the same topological structure and respectively comprise a first convolution layer computing module, a second convolution layer computing module, a third convolution layer computing module, a multi-stage register and an adder, the first convolution layer computing module is respectively connected with the second convolution layer computing module and the third convolution layer computing module, the third convolution layer computing module passes through the multi-stage register and the adder, and the second convolution layer computing module is directly connected to the adder;

the first convolution layer calculation module, the second convolution layer calculation module and the third convolution layer calculation module respectively execute a first convolution layer, a second convolution layer and a third convolution layer in the residual error layers, a result output by the previous residual error layer calculation module or a result output by the previous residual error layer calculation module is input into the first convolution layer calculation module, a calculation result of the first convolution layer calculation module is output into the second convolution layer calculation module, and the first convolution layer calculation module outputs a calculation result from a data input interface D of the first convolution layer calculation module _in The input result is delayed and then delayed from the data delay output interface Reg _out The calculation result of the third convolutional layer calculation module is output to a multi-stage register for delaying, and the delayed result of the multi-stage register is output to an adder; and the calculation result of the second convolution layer calculation module is output to the adder, and is added with the delayed result of the multi-stage register to be output as the final result of the residual error layer calculation module.

The convolutional layer calculation module mainly comprises a calculation core in a memory, a shift register group, a first multi-way gate MUX1, a second multi-way gate MUX2, a rectification circuit Regu, a capacitance resistance circuit CR and an analog-to-digital converter ADC;

the shift register group receives the data input interface D from the MFCC feature extraction module _in The input data of the memory is stored, mainly composed of a plurality of registers reg, each register reg is connected to an input interface of the memory internal computation core after passing through a first multi-way gate MUX1 and a second multi-way gate MUX2, each output interface of the memory internal computation core is connected to a rectification circuit Regu, a plurality of rectification circuits Regu with fixed quantity are used as a group of rectification circuit groups in all the rectification circuits Regu, each rectification circuit Regu in one group of rectification circuit groups is connected to the same capacitance resistance circuit CR, and rectification circuits Regu in different groups of rectification circuit groups are connected to different capacitance resistance circuits CR; the output terminals of the capacitor resistor circuits CR are connected to respective analog-to-digital converters ADC.

The register reg outputs the multi-bit data to the first multiplexer MUX1, the first multiplexer MUX1 selects one bit from the reg multi-bit data according to the traversal of the bit address Baddr and inputs the selected bit to the second multiplexer MUX2, and the serial input of the input data is completed; the second multi-way gate MUX2 converts the one-bit data received from the first multi-way gate MUX1 from digital level to analog level and inputs the one-bit data to one of the input interfaces of the memory core;

each output interface of the computing core in the memory outputs a current signal to a rectifying circuit Regu, and the outputs of a plurality of rectifying circuits Regu in the same group of rectifying circuits are sent to the same capacitor resistor circuit CR for integral operation;

the topological structure of the full-connection layer computing module is basically the same as that of the convolution layer computing module, and specifically, the shift register group is replaced by the register group on the basis of the topological structure of the convolution layer computing module.

The structure of the in-memory computation core comprises a write circuit A, a write circuit B, an in-memory computation array, n first multi-way gate MUX1 and n second multi-way gate MUX 2; the write circuit A is connected with the bit line port of the in-memory computation array through n first multi-way gating devices MUX1, and the write circuit B is connected with the word line port of the in-memory computation array through n second multi-way gating devices MUX 2; storing a weight matrix W in the in-memory computing array, inputting an input vector X into an in-memory computing core to complete multiplication of the input vector X and the weight matrix, and outputting an output vector Y as a result;

the write circuit A receives a ROW address signal ROW sent by the control module _addr And a write enable signal WE, which is then acted by an internal row decoder to output a row strobe signal of n bits, each bit of the row strobe signal being input to an input of a first multi-way gate MUX 1; the write circuit B receives the column address signal COL sent by the control module _addr And write data W _data Then, m-bit column strobe signals are output, and each bit of the column strobe signals is respectively input to an input end of a second multi-way gate MUX 2; each input signal X in an input vector X input to a computational core in a memory ₀ ，X ₁ …X _n-1 Respectively input to the other input terminal of n first multiplexers MUX1, and the n first multiplexers respectively output respective signal to constitute bit line signal WL under the action of the same control signal WE ₀ ，WL ₁ ...WL _n-1 In-memory compute array, word line signal BL output by the in-memory compute array ₀ ，BL ₁ …BL _n-1 Respectively entering the other input ends of the n second multiplexers, and respectively outputting the output signals Y from the output ends of the n second multiplexers ₀ ，Y ₁ …Y _n-1 The composed output vector Y.

The input interface and the output interface adopt SPI serial port protocol to receive and send data, the input interface is also responsible for receiving external input clock, and the input interface receives the external input clock directly without SPI protocol.

The invention has the following advantages and beneficial effects:

the voice keyword detection chip architecture supporting the lightweight stream convolutional neural network TC-resnet8 is constructed based on in-memory calculation, and the capacity of an on-chip memory is small.

The invention forms a production line among each layer of calculation of the flow convolution neural network, and adopts power gating, clock gating and other low-power consumption technologies, thereby realizing the voice keyword detection task with low power consumption and real time, and having more recognized voice keywords and high recognition accuracy.

Drawings

FIG. 1 is a flow chart of the work of a chip dedicated to speech keyword detection;

FIG. 2 is an overall architecture diagram of a chip dedicated to speech keyword detection;

FIG. 3 is a block diagram of a streaming neural network acceleration module;

FIG. 4 is a block diagram of an average pooling layer calculation module;

FIG. 5 is a block diagram of a residual layer calculation module;

FIG. 6 is a diagram of a computation module, (a) a convolutional layer computation module structure; (b) a full connection layer calculation module structure diagram;

FIG. 7 is a diagram of a memory compute core architecture;

FIG. 8 is a schematic diagram of a stream convolution im2col transform;

FIG. 9 is a block diagram of a flow convolution neural network TC-resnet 8.

Detailed Description

The present invention is further illustrated by the following examples.

As shown in fig. 2, the embodied chip includes a control module, a clock generation module, an MFCC feature extraction module, a streaming neural network acceleration module, an input interface, and an output interface;

the input interface and the output interface adopt SPI serial port protocol to respectively receive and send data, the input interface is also responsible for receiving external input clock, and the input interface does not pass through the SPI protocol when receiving the external input clock, but directly receives the external input clock.

The input interface is respectively connected with the MFCC feature extraction module, the streaming neural network acceleration module, the clock generation module and the control module; the input interface receives a neural network mapping resistance value input from the outside of the chip at the S04 stage, sends the neural network mapping resistance value to the flow type neural network acceleration module to be endowed to an internal memory internal calculation array, receives a voice signal input from the outside of the chip at the S04S 05 stage, sends the voice signal to the MFCC characteristic extraction module, receives a high-frequency Clock _ sys input from the outside of the chip and sends the high-frequency Clock _ sys to the Clock generation module;

the output interface is respectively connected with the streaming neural network acceleration module, the clock generation module and the control module; the output interface receives the initial resistance value from the streaming neural network acceleration module and outputs the initial resistance value to the outside of the chip in the stage of S03, and receives the prediction result from the streaming neural network acceleration module and outputs the prediction result to the outside of the chip in the stage of S05;

the MFCC feature extraction module is respectively connected with the input interface, the clock generation module, the control module and the streaming neural network acceleration module; the MFCC feature extraction module receives a voice signal from the input interface, generates voice features through the processing of a Mel frequency cepstrum coefficient method, and receives a control signal from the control module to carry out work control;

the clock generation module is respectively connected with the input interface, the MFCC feature extraction module, the control module, the streaming neural network acceleration module and the output interface; the Clock generation module receives a high-frequency Clock _ sys from the input interface, frequency division is carried out based on a counter to generate a first Clock signal group Clock _ CIM and a second Clock signal group Clock _ dig required by the flow neural network acceleration module and Clock signals required by the input interface, the output interface, the MFCC control feature extraction module and the control module respectively, the Clock signal groups Clock _ CIM and Clock _ dig are sent to the flow neural network acceleration module, and the respective Clock signals are correspondingly sent to the input interface, the output interface, the MFCC feature extraction module and the control module; the first Clock signal group Clock _ CIM is used for driving and controlling the memory internal computation core, the rectification circuit Regu, the capacitance resistance circuit CR and the analog-digital converter ADC, and the second Clock signal group Clock _ dig is used for driving other circuits except the memory internal computation core, the rectification circuit Regu, the capacitance resistance circuit CR and the analog-digital converter ADC in the flow neural network acceleration module.

The control module is respectively connected with the input interface, the MFCC feature extraction module, the clock generation module, the streaming neural network acceleration module and the output interface; the control module generates control signals of the input interface, the MFCC feature extraction module, the clock generation module, the streaming neural network acceleration module and the output interface, and then correspondingly sends the respective control signals to the input interface, the MFCC feature extraction module, the clock generation module, the streaming neural network acceleration module and the output interface so as to control whether each interface/module works or not and the working time period;

as shown in fig. 3, the streaming neural network acceleration module is responsible for performing the inferential computation of the streaming neural network TC-resnet 8.

The flow type neural network acceleration module is respectively connected with the input interface, the MFCC feature extraction module, the clock generation module, the control module and the output interface; the streaming neural network acceleration module sends an initial resistance value in the internal memory calculation array to the output interface in the S03 stage, and maps the resistance value from the neural network received by the input interface in the S04 stage to realize resistance value writing; the speech feature processing results prediction are received from the MFCC feature extraction module at stage S05 and sent to an output interface.

the residual pre-convolutional layer calculation module receives the voice characteristics sent by the MFCC module, is responsible for executing a first convolutional layer of the stream neural network TC-resnet8, generates a calculation result of the first convolutional layer, and sends the calculation result to the first residual layer calculation module;

the first residual layer calculation module is responsible for executing a first residual layer of the streaming neural network TC-resnet8, generating a first residual layer calculation result and outputting the result through a data output interface D of the first residual layer calculation module _out Data input interface D for sending to second residual error layer calculation module _in ；

The second residual layer calculation module is responsible for executing a second residual layer of the streaming neural network TC-resnet8 to generate a second residual layer calculation resultData output interface D _out Data input interface D for transmitting to third residual error layer calculation module _in ；

The third residual error layer calculation module is responsible for executing a third residual error layer of the streaming neural network TC-resnet8, generating a third residual error layer calculation result and outputting the result through a data output interface D of the third residual error layer calculation module _out Data input interface D for sending to average pooling module _in ；

The average pooling layer module is responsible for executing the average pooling layer of the streaming neural network TC-resnet8, generating the result of the average pooling layer, and outputting the result through the data output interface D of the average pooling layer module _out Data input interface D for sending to full connection layer computing module _in ；

The full-connection layer calculation module is responsible for executing the full-connection layer of the streaming neural network TC-resnet8, generating a full-connection layer calculation result and outputting the result through a data output interface D of the full-connection layer calculation module _out And sending to the output interface.

The pre-residual convolution layer calculation module, the first residual layer calculation module, the second residual layer calculation module, the third residual layer calculation module and the full-link layer calculation module receive respective clock signals clk _ CIM and clk _ dig sent by the clock generator; the average pooling layer module receives clk _ dig sent by the clock generator; the pre-residual convolution layer calculation module, the first residual layer calculation module, the second residual layer calculation module, the third residual layer calculation module, the full link layer calculation module and the average pooling layer module respectively receive respective control signals ctrl sent by the control module.

As shown in fig. 4, the average pooling layer module includes an adder, a register set and a divider connected in sequence, and the adder is connected to the register set; the third residual layer calculation module outputs a third residual layer calculation result to the adder, and the adder adds the third residual layer calculation result and an addition result stored in the register group and outputs and stores the addition result in the register group; the register group sends the addition result stored in the register group to the divider, the divider performs division on the addition result in the register group by using a fixed divisor to obtain a division result, and then the division result is sent to the full-connection layer calculation module.

The adder, the register group and the divider respectively receive respective control signals ctrl sent by the control module. The adder, the register set and the divider all receive a clock signal Clk _ dig sent by the clock generation module.

The first residual layer calculation module, the second residual layer calculation module and the third residual layer calculation module have the same topological structure, and as shown in fig. 5, each of the first residual layer calculation module, the second residual layer calculation module, the third convolution layer calculation module, the multi-level register and the adder is included, the first convolution layer calculation module is respectively connected with the second convolution layer calculation module and the third convolution layer calculation module, the third convolution layer calculation module is connected with the adder directly through the multi-level register and the adder;

the first convolution layer calculation module, the second convolution layer calculation module and the third convolution layer calculation module respectively execute the first convolution layer, the second convolution layer and the third convolution layer in the residual layer network structure, and the result output by the former residual layer calculation module or the result output by the former residual convolution layer calculation module is input to the data input interface D of the first convolution layer calculation module _in The calculation result of the first convolution layer calculation module is output through a data output interface D thereof _out Data input interface D for outputting to second convolution layer calculation module _in The first convolution layer calculation module inputs its own data into the interface D _in The input result is delayed and then delayed from the data delay output interface Reg _out Data input interface D for outputting to third convolution layer calculation module _in The calculation result of the third convolution layer calculation module is output through a data output interface D thereof _out Outputting the result to a multi-stage register for delaying, and outputting the delayed result to an adder; the calculation result of the second convolution layer calculation module is output through a data output interface D thereof _out And the result is output to an adder, added with the delayed result of the multi-stage register and output as the final result of the residual error layer calculation module.

The first convolution layer calculation module, the second convolution layer calculation module, the third convolution layer calculation module, the multi-stage register and the adder respectively receive respective control signals ctrl from the control module. The first convolutional layer computing module, the second convolutional layer computing module and the third convolutional layer computing module respectively receive clock signals clk _ CIM and clk _ dig from the clock generator. The multi-stage registers and the adder receive a clock signal clk _ dig from the clock generator, respectively.

The topological structures of the pre-residual convolution layer calculation module, the first convolution layer calculation module, the second convolution layer calculation module and the third convolution layer calculation module are all the same, as shown in fig. 6(a), the convolution layer calculation module mainly comprises a calculation core in a memory, a shift register group, a first multi-way gate MUX1, a second multi-way gate MUX2, a rectification circuit Regu, a capacitance resistance circuit CR and an analog-to-digital converter ADC;

the shift register group receives the data input interface D from the MFCC feature extraction module _in The input data of the storage core are stored, and the storage core mainly comprises a plurality of registers reg, each register reg is connected to an input interface of a memory internal calculation core after passing through a first multi-way gate MUX1 and a second multi-way gate MUX2, the number of the registers reg, the number of the first multi-way gate MUX1, the number of the second multi-way gate MUX2 and the number of the input interfaces of the memory internal calculation core are the same, each output interface of the memory internal calculation core is connected to a rectifying circuit Regu, a plurality of rectifying circuits Regu with fixed number are used as a group of rectifying circuit groups in all the rectifying circuits Regu, each rectifying circuit Regu in one group of rectifying circuit groups is connected to the same capacitor resistor circuit CR, and rectifying circuits Regu in different groups of rectifying circuit groups are connected to different capacitor resistor circuits CR; the output end of the capacitor resistor circuit CR is connected with the respective analog-to-digital converter ADC and then output.

each output interface of the computing core in the memory outputs a current signal to a rectifying circuit Regu, and the outputs of a plurality of rectifying circuits Regu in the same group of rectifying circuits are sent to the same capacitor resistor circuit CR for integral operation; the rectification circuit Regu is responsible for voltage stabilization; the capacitance-resistance circuit CR is responsible for the integration operation. The analog-to-digital converter ADC converts the analog signal output from the capacitance-resistance circuit CR into a digital signal.

In the convolutional layer calculation module, because input data of two adjacent calculations of the convolutional layer are overlapped, a shift register group is adopted in the convolutional layer calculation module, and the shift register group realizes shift operation with different digits according to convolution step length. The shift register group can shift out the data which is not needed any more each time, the data which is adjacent and is overlapped in calculation is stored, the updated data is shifted in, namely, each register reg in the shift register group can shift out one bit of data to the next register reg, the last register reg shifts out one bit of data to be discarded, and the first register reg shifts in the newly received one bit of data.

The topological structure of the fully-connected layer computing module is basically the same as that of the convolutional layer computing module, and specifically, the shift register group is replaced by a register group on the basis of the topological structure of the convolutional layer computing module. As shown in fig. 6(b), the register group stores input data. Because the input data of two adjacent calculations of the full-connection layer are not overlapped, a common register group is used instead of a shift register group, the shift register group is replaced by the register group, and the rest structures are consistent with the convolution layer calculation module.

The structure of the compute core in memory is shown in fig. 7, and includes a write circuit a, a write circuit B, a compute array in memory, n first multi-way gates MUX1, and n second multi-way gates MUX 2; the write circuit A is connected with the bit line port of the in-memory computation array through n first multi-way gating devices MUX1, and the write circuit B is connected with the word line port of the in-memory computation array through n second multi-way gating devices MUX 2; storing a weight matrix W in the in-memory computing array, inputting an input vector X into an in-memory computing core to complete multiplication of the input vector X and the weight matrix, and outputting an output vector Y as a result;

specifically, the write circuit a receives a ROW address signal ROW sent by the control module _addr And a write enable signal WE, then outputs an n-bit row strobe signal after the action of an internal row decoder, each bit of the row strobe signal being input to an input terminal of a respective first multi-way gate MUX 1; the write circuit B receives the column address signal COL sent by the control module _addr And write data W _data Then m-bit column strobe signals are output, and each bit of the column strobe signals is respectively input to an input end of a respective second multi-way gate MUX 2; each input signal X in an input vector X input to a computational core in a memory ₀ ，X ₁ …X _n-1 Respectively input to the other input terminal of n first multiplexers MUX1, and the n first multiplexers respectively output respective signal to constitute bit line signal WL under the action of the same control signal WE ₀ ，WL ₁ …WL _n-1 To the in-memory computation array, the word line signal BL output by the in-memory computation array ₀ ，BL ₁ ...BL _n-1 Respectively enter the other input ends of the n second multi-way gates, and the output ends of the n second multi-way gates respectively output the output signals Y ₀ ，Y ₁ …Y _n-1 The remaining end of the formed output vector Y receives one bit of the m-bit column strobe signal output by the write circuit B. And programming the resistance value of the specified position of the calculation array in the memory through the row strobe signal and the column strobe signal.

X ₀ ，X ₁ ...X _n-1 Input signals/interfaces for the computational cores in the memory constitute input vectors X, Y for the computational cores in the memory ₀ ，Y ₁ ...Y _n-1 The output vector Y of the compute core in memory is formed for the output signal/interface of the compute core in memory. WL ₀ ，WL ₁ …WL _n-1 Bit line signals WL, BL forming an in-memory compute array ₀ ，BL ₁ …BL _n-1 The word line signals BL of the in-memory compute array are composed.

When the memory core writes at the resistance value of S04, the first multiplexer MUX1 selects the output of the write circuit a as the value of WL, and the second multiplexer MUX2 selects the output of the write circuit B as the value of BL.

When the memory core is operating normally at S05 and the resistance value of S02 is read, the first multiplexer MUX1 selects the input vector X as the value of WL and the second multiplexer MUX2 selects BL as the value of the output vector Y.

When the streaming neural network acceleration module executes the operation processing of the streaming neural network, each layer of the streaming neural network executes to form pipeline operation, and because the feature diagram size and convolution step length of each layer are different, in order to ensure that the execution time of each stage of the pipeline is consistent, the control module schedules the calculation frequency of different modules. The specific scheduling conditions are: the frame shift xms is a time interval.

The residual pre-convolution layer calculation module executes calculation once every xms, namely, the calculation of a convolution window is completed; the first residual error layer calculation module performs calculation once every 2 xms; the second residual error layer calculation module performs calculation every 4 Xms; the third residual layer calculation module, the adder, the divider and the full-link layer calculation module perform calculation once every 8x ms (the residual layer calculation module performs calculation once for each calculation corresponding to the three inner convolutional layer calculation modules). The whole flow type neural network acceleration module outputs a prediction result once every 8x ms.

After a voice signal is input into the chip of the invention and passes through the input interface, the voice signal firstly enters the MFCC feature extraction module, and the MFCC feature extraction module outputs the voice feature to the flow neural network acceleration module once every xms. Every xms is a time interval, and in the time interval, the streaming neural network acceleration module determines whether each module in the streaming neural network acceleration module in the current interval executes calculation according to a control signal generated by the controller:

if the calculation is executed, activating from a dormant state, starting a power supply of a calculation array in the memory, executing the calculation, and entering the dormant state after the calculation is finished;

if the calculation is not required to be executed, continuously keeping the dormant state;

in the dormant state, the power supply of the computing array in the memory is turned off, namely power gating, the peripheral digital circuit performs clock gating and low-voltage power supply, and the overall power consumption of the chip is extremely low. In each time interval xms, because the calculation core in the memory has high calculation throughput and high calculation speed, the time required by calculation is extremely short, and therefore, most of the time of the chip is in a low-power-consumption sleep state.

The process of the method is shown in figure 1:

s02: reading a resistance value, inputting a vector into the calculation array in the memory under the action of the control module, executing calculation and obtaining the resistance value read in the calculation array in the memory through the output interface;

s03: mapping optimization, namely, setting a resistance value to be written into a calculation array in a memory by using a network mapping algorithm according to the resistance value read in the step S02, wherein the resistance value is used as a neural network mapping resistance value, and the resistance value represents a parameter value of a current neural network;

s05: when the chip works normally, the chip receives a voice signal from the input interface, the voice signal is processed into voice characteristics through the MFCC characteristic extraction module and then input into the streaming neural network acceleration module, and the streaming neural network acceleration module receives the voice characteristics, processes the voice characteristics to generate a prediction result, and outputs the prediction result from the output interface.

The specific embodiment is as follows:

the network mapping scheme embodied by the invention is as follows: taking the convolutional layer of TC-resnet8 as an example, as shown in fig. 8, the input feature map size is 1 × c × t, the weight size is 1 × h × w × k, c is the number of input channels, t is the input feature map width, w is the weight width, and k is the number of output channels; after im2col transformation, the weights are transformed into a matrix of size (c × w) × k, corresponding to the weight matrix of the computation kernel in the memory, each convolution window of the input signature is transformed into a vector of size 1 × (c × w), corresponding to the input vector of the computation kernel in the memory, so that the computation of one convolution window is transformed into a vector matrix multiplication, which can be directly performed by the computation kernel in the memory. The in-memory calculation array takes the form of differential calculations, with each signed weight of n b corresponding to 2x (n-1) resistors, so that the two-dimensional weight matrix corresponds to an in-memory calculation array of size (w × c +1) x (2 × k x (n-1)), with the (w × c +1) th row storing the weight offset. The shift register group includes (w × c +1) registers with size n, and since there is data reuse in the calculation of two adjacent convolution windows, as shown by the overlapping portion indicated by hatching in fig. 9, the shift register group has a shift distance h each time when the step size is 1. The (w × h +1) th register stores the input offset and does not participate in shifting. Before each calculation, the shift register group completes the input data update through shifting, then inputs the data into the in-memory calculation core through the first multiplexer MUX1 and the second multiplexer MUX2 in fig. 6, and finally outputs the result in a digital form from the analog-to-digital converter ADC after the calculation is completed by the in-memory calculation core.

The invention optimizes the stream convolution neural network TC-resnet8, adjusts the number of channels to be a uniform value on the premise of ensuring the accuracy, simplifies the control logic, quantizes the weight to be low bit width by using the neural network quantization technology, and reduces the storage capacity and the calculation capacity. The structure of the optimized TC-resnet8 network is shown in fig. 9, where white squares represent the input/output characteristics diagram and shaded squares represent weights. The numbers represent the size of the respective dimensions. TC-resnet8 includes a first convolutional layer, three residual layers, an average pooling layer and a full link layer. The first convolutional layer convolution step is 1 and the convolution kernel width is 3. Each residual error layer can be divided into a main path and a branch path, the main path comprises a first convolution layer and a second convolution layer, the step length of the first convolution layer is 2, the step length of the second convolution layer is 1, and the convolution kernel widths of the first convolution layer and the second convolution layer are both 9; the branch path includes a third convolution layer of step size 2, with a convolution kernel width of 1. The sizes of the output characteristic graphs of the main path and the branch path are completely the same, and the output characteristic graphs of the residual error layer are obtained after the corresponding positions of the two characteristic graphs are sequentially added.

Claims

1. A special chip for voice keyword detection is characterized in that:

the system comprises a control module, a clock generation module, an MFCC feature extraction module, a streaming neural network acceleration module, an input interface and an output interface;

the input interface is respectively connected with the MFCC feature extraction module, the streaming neural network acceleration module, the clock generation module and the control module; the input interface receives the resistance value input from the outside of the chip and sends the resistance value to the calculation array in the memory inside the flow type neural network acceleration module, receives the voice signal input from the outside of the chip and sends the voice signal to the MFCC characteristic extraction module, and receives the high-frequency Clock _ sys input from the outside of the chip and sends the high-frequency Clock _ sys to the Clock generation module;

the MFCC feature extraction module is respectively connected with the input interface, the clock generation module, the control module and the streaming neural network acceleration module; the MFCC feature extraction module receives the voice signal from the input interface, generates voice features through processing, and receives a control signal from the control module to perform work control;

the clock generation module is respectively connected with the input interface, the MFCC feature extraction module, the control module, the streaming neural network acceleration module and the output interface; the Clock generation module receives a high-frequency Clock _ sys from the input interface, performs frequency division based on the counter to generate a first Clock signal group Clock _ CIM and a second Clock signal group Clock _ dig required by the flow neural network acceleration module and Clock signals required by the input interface, the output interface, the MFCC feature extraction module and the control module respectively, sends the Clock signal groups Clock _ CIM and Clock _ dig to the flow neural network acceleration module, and correspondingly sends the respective Clock signals to the input interface, the output interface, the MFCC feature extraction module and the control module;

the flow type neural network acceleration module is respectively connected with the input interface, the MFCC feature extraction module, the clock generation module, the control module and the output interface; the flow type neural network acceleration module sends an initial resistance value in a calculation array in an internal memory to an output interface to realize resistance value reading; receiving a resistance value from the input interface to realize resistance value writing; the voice feature processing generated prediction result is received from the MFCC feature extraction module and is sent to the output interface.

2. The chip dedicated to speech keyword detection according to claim 1, wherein:

3. The chip dedicated to speech keyword detection according to claim 2, wherein:

4. The chip dedicated to speech keyword detection according to claim 2, wherein:

the first residual layer calculation module, the second residual layer calculation module and the third residual layer calculation module have the same topological structure and respectively comprise a first convolution layer calculation module, a second convolution layer calculation module, a third convolution layer calculation module, a multi-stage register and an adder, the first convolution layer calculation module is respectively connected with the second convolution layer calculation module and the third convolution layer calculation module, the third convolution layer calculation module passes through the multi-stage register and the adder, and the second convolution layer calculation module is directly connected to the adder;

the first convolution layer calculation module, the second convolution layer calculation module and the third convolution layer calculation module respectively execute a first convolution layer, a second convolution layer and a third convolution layer in the residual error layers, a result output by the previous residual error layer calculation module or a result output by the previous residual error layer calculation module is input into the first convolution layer calculation module, and a calculation result of the first convolution layer calculation moduleOutput to the second convolution layer calculation module, the first convolution layer calculation module inputs the self data into the interface D _in The input result is delayed and then delayed from the data delay output interface Reg _out The calculation result of the third convolutional layer calculation module is output to a multi-stage register for delaying, and the delayed result of the multi-stage register is output to an adder; and the calculation result of the second convolution layer calculation module is output to the adder, and is added with the delayed result of the multi-stage register to be output as the final result of the residual error layer calculation module.

5. The chip dedicated to speech keyword detection according to claim 2 or 4, wherein:

the convolutional layer calculation module mainly comprises a calculation core in the memory, a shift register group, a first multi-path gate MUX1, a second multi-path gate MUX2, a rectification circuit Regu, a capacitance resistance circuit CR and an analog-to-digital converter ADC;

the shift register group receives the data from the data input interface D _in The input data of the memory are stored, and mainly comprise a plurality of registers reg, each register reg is connected to an input interface of a memory internal computation core after passing through a first multi-way gate MUX1 and a second multi-way gate MUX2, each output interface of the memory internal computation core is connected to a rectification circuit Regu, a plurality of rectification circuits Regu with fixed quantity are used as a group of rectification circuit groups in all the rectification circuits Regu, each rectification circuit Regu in one group of rectification circuit groups is connected to the same capacitance resistance circuit CR, and rectification circuits Regu in different groups of rectification circuit groups are connected to different capacitance resistance circuits CR; the output end of the capacitor resistance circuit CR is connected with the respective analog-to-digital converter ADC;

each output interface of the computing core in the memory outputs a current signal to the rectifying circuit Regu, and the outputs of the rectifying circuits Regu in the same group of rectifying circuits are sent to the same capacitor resistor circuit CR for integral operation.

6. The chip dedicated to speech keyword detection according to claim 5, wherein:

7. The chip dedicated to speech keyword detection according to claim 6, wherein:

the write circuit A receives a ROW address signal ROW sent by the control module _addr And a write enable signal WE, which is then acted by an internal row decoder to output a row strobe signal of n bits, each bit of the row strobe signal being input to an input of a first multi-way gate MUX 1; the write circuit B receives the column address signal COL sent by the control module _addr And write data W _data Then, m-bit column strobe signals are output, each bit of the column strobe signals being input to one input end of one second multi-way gate MUX2 respectively; input into an input vector X of a computational core in memoryEach input signal X _o ，X ₁ ...X _n-1 Respectively input to the other input terminal of n first multiplexers MUX1, and the n first multiplexers respectively output respective signal to constitute bit line signal WL under the action of the same control signal WE _o ，WL ₁ ...WL _n-1 In-memory compute array, word line signal BL output by the in-memory compute array ₀ ，BL ₁ ...BL _n-1 Respectively entering the other input ends of the n second multiplexers, and respectively outputting the output signals Y from the output ends of the n second multiplexers ₀ ,Y ₁ …Y _n-1 The composed output vector Y.

8. The chip dedicated to speech keyword detection according to claim 1, wherein:

the input interface and the output interface adopt SPI serial port protocol to respectively receive and send data, the input interface is also responsible for receiving external input clock, and the input interface receives the external input clock directly without the SPI protocol.

9. The work control method of the chip dedicated for voice keyword detection according to any one of claims 1 to 8, characterized in that: the method comprises the following steps:

s03: mapping optimization, namely setting a resistance value to be written into a calculation array in the memory by using a network mapping algorithm according to the resistance value read in the S02;