CN111258544A

CN111258544A - Multiplier, data processing method, chip and electronic equipment

Info

Publication number: CN111258544A
Application number: CN201811450728.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2020-06-09
Anticipated expiration: 2038-11-30
Also published as: CN111258544B

Abstract

The application provides a multiplier, a data processing method, a chip and an electronic device, wherein the multiplier comprises: the output end of the judgment circuit is connected with the input end of the data expansion circuit, the output end of the judgment circuit is connected with the first input end of the coding circuit, the output end of the data expansion circuit is connected with the second input end of the coding circuit, the output end of the coding circuit is connected with the input end of the compression circuit, the multiplier can expand the received low-bit-width data, the expanded data meets the bit-width requirement of the multiplier for processing the data, the final multiplication result is still the result of multiplication of the original bit-width data, the operation of the multiplier for processing the low-bit-width data is guaranteed, and the area of an AI chip occupied by the multiplier is effectively reduced.

Description

Multiplier, data processing method, chip and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a multiplier, a data processing method, a chip, and an electronic device.

Background

With the continuous development of digital electronic technology, the rapid development of various Artificial Intelligence (AI) chips has higher and higher requirements for high-performance digital multipliers. As one of algorithms widely used by an intelligent chip, a neural network algorithm is a common operation in which multiplication is performed by a multiplier.

In general, when data with different bit widths are multiplied, the existing multiplier with corresponding bit numbers is needed to be used for operation. However, for the operation of data with low bit width, the conventional multiplier capable of processing data with high bit width cannot be used for multiplication, so that the area of the AI chip occupied by the multiplier is large.

Disclosure of Invention

In view of the above, it is desirable to provide a multiplier, a data processing method, a chip and an electronic device.

An embodiment of the present invention provides a multiplier, where the multiplier includes: the output end of the judgment circuit is connected with the input end of the data expansion circuit, the output end of the judgment circuit is connected with the first input end of the coding circuit, the output end of the data expansion circuit is connected with the second input end of the coding circuit, and the output end of the coding circuit is connected with the input end of the compression circuit;

the judgment circuit is used for judging whether the received data needs to be processed through a data expansion circuit connected with the output end of the judgment circuit, the data expansion circuit is used for carrying out expansion processing on the received data, the coding circuit is used for carrying out coding processing on the received data to obtain a partial product of a target code, and the compression circuit is used for carrying out accumulation processing on the partial product of the target code.

In one embodiment, the encoding circuit comprises a third input terminal for receiving an input function selection mode signal; the compression circuit includes a first input for receiving an input function selection mode signal.

In one embodiment, the determining circuit includes: a data input port and a data output port; the data input port is configured to receive data to be subjected to multiplication, the data output port is configured to output the received data, and the fourth data input port is configured to output a second received data.

In one embodiment, the data expansion circuit includes: the data expansion module comprises a data input port, a data expansion mode selection signal input port, a function selection mode signal output port and an expanded data output port; the data input port is used for receiving the data output by the judging circuit, the data expansion mode selection signal input port is used for receiving a data expansion mode selection signal corresponding to the received data through expansion processing, the function selection mode signal output port is used for outputting a function selection mode signal determined according to the mode of the data expansion circuit through expansion processing of the received data, and the expanded data output port is used for outputting the data after the expansion processing.

In one embodiment, the encoding circuit includes: the Booth encoding circuit comprises a Booth encoding sub-circuit and a partial product obtaining sub-circuit, wherein the output end of the Booth encoding sub-circuit is connected with the first input end of the partial product obtaining sub-circuit;

the Booth coding sub-circuit is used for carrying out Booth coding on the received data to obtain a coded signal, and the partial product obtaining sub-circuit is used for obtaining a partial product of a target code according to the coded signal.

In one embodiment, the booth encoding sub-circuit comprises: the data input port is used for receiving data subjected to Booth coding processing, and the coding signal output port is used for outputting a coding signal obtained after the Booth coding processing is performed on the received data.

In one embodiment, the partial product acquisition sub-circuit comprises: the device comprises an encoding signal input port, a data input port and a partial product output port, wherein the encoding signal input port is used for receiving the encoding signal, the data input port is used for receiving the data, and the partial product output port is used for outputting a partial product of a target code acquired according to the encoding signal and the received data.

In one embodiment, the compression circuit comprises: a Wallace tree group sub-circuit and an accumulation sub-circuit; the output end of the Wallace tree group sub-circuit is connected with the input end of the accumulation sub-circuit; the Wallace tree group sub-circuit is configured to accumulate the partial products of the target code, and the accumulation sub-circuit is configured to accumulate the received input data.

In one embodiment, the wallace tree group sub-circuit comprises: a Wallace tree unit to accumulate each column of the partial product of the target code.

In one embodiment, the accumulation sub-circuit comprises: and the adder is used for performing addition operation on the two received data with the same bit width.

In one embodiment, the adder comprises: the carry signal input port is used for receiving a carry signal, the sum signal input port is used for receiving a sum signal, and the operation result output port is used for outputting a result of accumulation processing of the carry signal and the sum signal.

According to the multiplier provided by the embodiment, the multiplier can be used for expanding received low-bit-width data, the expanded data meets the bit-width requirement of the multiplier for data processing, and the final multiplication result is still the result of multiplication of the original bit-width data, so that the multiplier can be used for processing the low-bit-width data, and the area of an AI chip occupied by the multiplier is effectively reduced.

The embodiment of the invention provides a data processing method, which comprises the following steps:

receiving data to be processed;

judging whether the bit width of the data to be processed is equal to the bit width of the data which can be processed by the multiplier or not;

if the data to be processed is not equal to the preset data, performing data expansion processing on the data to be processed to obtain expanded data;

coding the expanded data to obtain a partial product after sign bit expansion;

and accumulating the partial products after the sign bit is expanded to obtain an operation result.

In one embodiment, after determining whether the bit width of the data to be processed is equal to the bit width of the data processable by the multiplier, the method further includes: and if the sign bit is equal to the sign bit, coding the data to be processed to obtain a partial product after sign bit expansion.

In one embodiment, the encoding the extended data to obtain a sign-bit-extended partial product includes:

performing Booth coding processing on the expanded data to obtain a coded signal;

and obtaining the partial product after the sign bit is expanded according to the data to be processed and the coding signal.

In one embodiment, the obtaining the sign-bit-extended partial product according to the data to be processed and the encoded signal includes:

obtaining an original partial product according to the data to be processed and the coded signal;

and sign bit expansion processing is carried out on the original partial product to obtain the partial product after sign bit expansion.

In one embodiment, the performing data expansion processing on the data to be processed to obtain expanded data includes: and performing data expansion processing on the data to be processed through 0 or the sign bit value of the data to be processed to obtain expanded data.

In one embodiment, the bit width of the expanded data is equal to the bit width of data currently processed by the multiplier.

In one embodiment, the accumulating the partial products after sign bit extension to obtain an operation result includes:

accumulating the partial product after the sign bit is expanded through a Wallace tree group sub-circuit to obtain a first operation result;

and performing accumulation processing on the first operation result through an accumulation sub-circuit to obtain an operation result.

According to the data processing method provided by the embodiment, the received low bit width data can be expanded, the expanded data meets the bit width requirement of the data which can be processed by the multiplier, and the final multiplication result is still the result of the multiplication of the original bit width data, so that the multiplier can process the operation of the low bit width data, and the area of an AI chip occupied by the multiplier is effectively reduced.

The embodiment of the present invention provides a machine learning arithmetic device, which includes one or more multipliers described in the first aspect; the machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of multipliers, the multipliers can be linked through a specific structure and transmit data;

the multipliers are interconnected through a PCIE bus and transmit data so as to support larger-scale machine learning operation; a plurality of multipliers share the same control system or own respective control systems; a plurality of multipliers share a memory or own respective memories; the interconnection mode of a plurality of multipliers is any interconnection topology.

The combined processing device provided by the embodiment of the invention comprises the machine learning processing device, the universal interconnection interface and other processing devices; the machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user; the combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and is configured to store data of the machine learning arithmetic device and the other processing device.

The neural network chip provided by the embodiment of the invention comprises the multiplier, the machine learning arithmetic device or the combined processing device.

The neural network chip packaging structure provided by the embodiment of the invention comprises the neural network chip.

The board card provided by the embodiment of the invention comprises the neural network chip packaging structure.

The embodiment of the application provides an electronic device, which comprises the neural network chip or the board card.

An embodiment of the present invention provides a chip, including at least one multiplier as described in any one of the above.

The electronic equipment provided by the embodiment of the invention comprises the chip.

Drawings

Fig. 1 is a schematic structural diagram of a multiplier according to an embodiment;

FIG. 2 is a schematic diagram of another multiplier according to another embodiment;

FIG. 3 is a circuit diagram of an embodiment of a multiplier;

FIG. 4 is a schematic diagram illustrating a distribution rule of partial products obtained by 16-bit data multiplication according to an embodiment;

FIG. 5 is a circuit diagram of another embodiment of a multiplier;

FIG. 6 is a specific circuit diagram of a compression circuit for 8-bit data operation according to another embodiment;

FIG. 7 is a flowchart illustrating a data processing method according to an embodiment;

FIG. 8 is a flowchart illustrating a method for obtaining an encoded signal according to an embodiment;

FIG. 9 is a flowchart illustrating a method for obtaining a partial product of a target code according to an embodiment;

FIG. 10 is a flowchart illustrating a method for obtaining an operation result according to an embodiment;

FIG. 11 is a flowchart illustrating a specific method for obtaining an operation result according to an embodiment;

FIG. 12 is a flow chart illustrating another data processing method according to an embodiment;

FIG. 13 is a flowchart illustrating a method for obtaining a partial product after sign bit expansion according to another embodiment;

FIG. 14 is a flowchart illustrating a specific method for obtaining a partial product after sign bit expansion according to another embodiment;

FIG. 15 is a block diagram of a combined processing device according to an embodiment;

FIG. 16 is a block diagram of another integrated processing device according to an embodiment;

fig. 17 is a schematic structural diagram of a board card according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The multiplier provided by the application can be applied to an AI chip, a Field-Programmable Gate Array (FPGA) chip or other hardware circuit devices for multiplication processing, and the specific structural schematic diagram of the multiplier is shown in fig. 1 and 2.

As shown in fig. 1, fig. 1 is a structural diagram of a multiplier according to an embodiment. As shown in fig. 1, the multiplier includes: a correction encoding circuit 11 and a correction compression circuit 12; the output end of the correction coding circuit 11 is connected with the input end of the correction compression circuit 12; the modified encoding circuit 11 is configured to perform encoding processing on the received data to obtain a partial product after sign bit extension, and obtain a partial product of a target code according to the partial product after sign bit extension, and the modified compression circuit 12 is configured to perform accumulation processing on the partial product of the target code.

Specifically, the correction encoding circuit 11 may include a plurality of data processing units having different functions, and the data received by the correction encoding circuit 11 may be used as a multiplier in a multiplication operation or may be used as a multiplicand in a multiplication operation. Optionally, the data may be fixed point numbers. Optionally, the modified encoding circuit 11 may receive data with a plurality of different bit widths, that is, the multiplier provided in this embodiment may process multiplication operations of data with a plurality of different bit widths. However, in the same multiplication, the multiplier and the multiplicand received by the correction encoding circuit 11 may be data having the same bit width, that is, the multiplier and the multiplicand have the same bit width. For example, the multiplier provided in this embodiment may process a multiplication operation of 8 bits by 8 bits data, a multiplication operation of 16 bits by 16 bits, a multiplication operation of 32 bits by 32 bits data, and a multiplication operation of 64 bits by 64 bits data, which is not limited in this embodiment.

Optionally, the correction coding circuit 11 may perform binary coding on the received data, which is equivalent to perform binary coding on the received multiplier, and obtain a sign bit extended partial product according to the received multiplicand, where a bit width of the sign bit extended partial product may be equal to 2 times of a bit width of the data currently processed by the multiplier. Illustratively, the correction coding circuit 11 receives data with a bit width of 16 bits, if the multiplier performs 8-bit data multiplication currently processed, the correction coding circuit 11 needs to divide the data with the bit width of 16 bits into two groups of data with 8 bits higher and 8 bits lower, and at this time, the bit width of the partial product after sign bit expansion may be equal to 2 times the bit width of the data currently processed by the multiplier; if the multiplier performs a 16-bit data multiplication operation currently, the correction coding circuit 11 needs to perform an operation on the entire 16-bit data, and at this time, the bit width of the partial product after the sign bit is extended may be equal to 2 times the bit width of the data currently processed by the multiplier.

Optionally, the modified encoding circuit 11 includes a first input end for receiving an input function selection mode signal; the modified compression circuit 12 includes a first input terminal for receiving the input function selection mode signal. Optionally, the function selection mode signal is used to determine a data bit width processed by the multiplier.

It should be noted that the function selection mode signal may be various, and different function selection mode signals correspond to multiplication operations of the multiplier that can currently process data with different bit widths. Alternatively, the function selection mode signals received by the correction encoding circuit 11 and the correction compressing circuit 12 may be equal in the same multiplication.

For example, if the correction coding circuit 11 and the correction compression circuit 12 can receive a plurality of function selection mode signals, taking three function selection mode signals as an example, the mode may be 00, 01, and 10, respectively, the mode 00 may indicate that the multiplier can process 16-bit data, the mode 01 may indicate that the multiplier can process 32-bit data, the mode 10 may indicate that the multiplier can process 64-bit data, the mode 00 may indicate that the multiplier can process 64-bit data, the mode 01 may indicate that the multiplier can process 16-bit data, and the mode 10 may indicate that the multiplier can process 32-bit data.

In the multiplier provided by this embodiment, the sign bit extended partial product is obtained by encoding the received data through the correction encoding circuit, the target encoded partial product is obtained according to the sign bit extended partial product, and the target encoded partial product is accumulated through the correction compression circuit to obtain the multiplication result.

Fig. 2 is a structural diagram of a multiplier according to another embodiment. As shown in fig. 2, the multiplier includes: a judgment circuit 11, a data expansion circuit 12, an encoding circuit 13, and a compression circuit 14; the output end of the judging circuit 11 is connected with the input end of the data expanding circuit 12, the output end of the judging circuit 11 is connected with the first input end of the coding circuit 13, the output end of the data expanding circuit 12 is connected with the second input end of the coding circuit 13, and the output end of the coding circuit 13 is connected with the input end of the compressing circuit 14. The judging circuit 11 is configured to judge whether the received data needs to be processed by a data expansion circuit 12 connected to an output end of the judging circuit 11, the data expansion circuit 12 is configured to perform expansion processing on the received data, the encoding circuit 13 is configured to perform encoding processing on the received data to obtain a partial product of a target code, and the compressing circuit 14 is configured to perform accumulation processing on the partial product of the target code.

Specifically, the judging circuit 11 may be a circuit for judging the bit width of the received data and the bit width of the data processable by the multiplier, which is 2N. Optionally, the encoding circuit 13 may include a plurality of data processing units with different functions, and the data received by the encoding circuit 13 may be used as a multiplier in a multiplication operation, and may also be used as a multiplicand in a multiplication operation. The data received by the encoding circuit 13 may be two data output by the judgment circuit 11, or may be data obtained by performing expansion processing on the two received data by the data expansion circuit 12. Alternatively, the data processing unit with different functions may be a data processing unit with a binary encoding function. Alternatively, the multiplier and multiplicand may be multi-bit wide floating point numbers. Optionally, the compression circuit 14 may perform accumulation processing on the partial product of the target code obtained by the encoding circuit 13 to obtain a multiplication result.

It should be noted that the multiplier may perform multiplication on data with a fixed 2N-bit width, and it is also understood that the encoding circuit and the compression circuit in the multiplier may perform multiplication on data with a 2N-bit width. However, in the same multiplication, the multiplier and the multiplicand received by the encoding circuit 13 are data having the same bit width. For example, the multiplier provided in this embodiment may process a data multiplication operation of 8 bits by 8 bits, a data multiplication operation of 16 bits by 16 bits, a data multiplication operation of 32 bits by 32 bits, and a data multiplication operation of 64 bits by 64 bits, which is not limited in this embodiment. Optionally, there may be one input port of the data processing unit with different functions, the function of each input port of each data processing unit may be the same, there may also be one output port, the function of each output port of each data processing unit may be different, and the circuit structures of the data processing units with different functions may be different.

Optionally, the encoding circuit 13 includes a third input end, configured to receive an input function selection mode signal; the compression circuit 14 includes a first input terminal for receiving an input function selection mode signal.

In the multiplier provided by this embodiment, the determining circuit determines whether the received data needs to be processed by the next data expansion circuit, if the received data does not need to be processed by the next data expansion circuit, the determining circuit directly inputs the received data to the encoding circuit for encoding to obtain the partial product of the target code, otherwise, the received data is input to the data expansion circuit for expansion, the data expansion circuit inputs the expanded data to the encoding circuit for encoding to obtain the partial product of the target code, and the compression circuit performs accumulation processing on the partial product of the target code to obtain the final operation result, which can perform expansion processing on the received low-bit-width data, and the expanded data satisfies the bit-width requirement of the multiplier for processing the data, so that the final multiplication result is still the multiplication result of the original bit-width data, therefore, the operation that the multiplier can process low-bit-width data is ensured, and the area of the AI chip occupied by the multiplier is effectively reduced.

Fig. 3 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes the correction coding circuit 11, and the correction coding circuit 11 includes: a low booth encoding unit 111, a low partial product acquisition unit 112, a selector 113, a high booth encoding unit 114, a high partial product acquisition unit 115, a low selector bank unit 116, and a high selector bank unit 117; a first output terminal of the low booth coding unit 111 is connected to the input terminal of the selector 112, a second output terminal of the low booth coding unit 111 is connected to the first input terminal of the low product obtaining unit 112, an output terminal of the selector 112 is connected to a first input terminal of the high booth coding unit 113, an output terminal of the high booth coding unit 113 is connected to a first input terminal of the high product obtaining unit 115, an output terminal of the low selector set unit 116 is connected to a second input terminal of the low product obtaining unit 112, and an output terminal of the high selector set unit 117 is connected to a second input terminal of the high product obtaining unit 115. Wherein, the lower booth coding unit 111 is configured to perform booth coding processing on lower data in the received data to obtain a lower coded signal, the lower partial product obtaining unit 112 is configured to obtain a lower partial product of the target code according to the lower encoded signal, the selector 113 is configured to gate a complementary bit value of the high-order data during booth coding, the high-order booth coding unit 114 is configured to perform booth coding on the received high-order data and the complementary bit value to obtain a high-order coded signal, the high-order partial product obtaining unit 115 is configured to obtain a high-order partial product of the target code according to the high-order coded signal, the low selector bank unit 116 is used to gate the value in the low bit partial product of the target code, the high selector bank unit 117 is used to gate the value in the high bit partial product of the target code.

Specifically, the correction encoding circuit 11 may receive a multiplier and a multiplicand in the multiplication, perform booth encoding on the multiplier to obtain an encoded signal, and obtain a partial product of a target code from the encoded signal and the received multiplicand. Before the low-order data is subjected to the booth encoding process, the low-order booth encoding unit 111 may automatically perform a bit complementing process on the low-order data in the data received by the correction encoding circuit 11, and perform the booth encoding process on the low-order data after the bit complementing process to obtain a low-order encoded signal, where the data may be a multiplier in a multiplication operation. Alternatively, if the multiplier bit width received by the modified coding circuit 11 is N, the lower data may be data of low N/2 bits, and the bit complementing process may be characterized by complementing a bit value 0 for a lower bit of the lowest bit value in the lower data. Illustratively, if the multiplier can currently handle 8-bit by 8-bit fixed point multiplication, the multiplier is "y₇y₆y₅y₄y₃y₂y₁y₀Before performing the booth encoding process, the low-order booth encoding unit 111 may automatically perform a bit-filling process on the multiplier, and convert the multiplier into data "y" after bit-filling₇y₆y₅y₄y₃y₂y₁y₀0". Optionally, the number of the low-order coded signals may be equal to 1/2 of the low-order data bit width, and the number of the low-order coded signals may be equal to the number of partial products after sign bit expansion corresponding to the low-order data. It should be noted that, no matter whether the bit width of the data currently processed by the multiplier is the same as the bit width of the data received by the multiplier, when implementing the booth encoding process, the low-order booth encoding unit 111 needs to perform the bit complement process on the low-order data.

Meanwhile, the high-order booth coding unit 114 may perform booth coding on the high-order data in the multiplier received by the correction coding circuit 11 to obtain a high-order coded signal, but before performing booth coding on the high-order data, the selector 113 needs to obtain a strobe value, which may be used as a bit-complement value when performing booth coding on the high-order data, and then combine the high-order data with the bit-complement value to obtain the bit-complemented high-order data, and perform booth coding on the bit-complemented high-order data by the high-order booth coding unit 114 to obtain the high-order coded signal. Alternatively, the selector 113 may be a two-way selector, and the gate value may be 0, or may be the highest bit value of the lower data in the multiplier. Illustratively, a multiplier may process a multiplication operation of data with a bit width of N bits and 2N bits, where the bit width of the data received by the modified coding circuit 11 is 2N bits, and if the multiplier is currently processing an operation of data with a bit width of N bits, the data gated by the selector 113 is 0, that is, the multiplier needs to divide the received data with a bit width of 2N bits into data with a bit width of high N bits and data with a bit width of low N bits for processing respectively; if the multiplier is currently processing the operation of the data with the bit width of 2N bits, the data gated by the selector 113 is the highest bit value in the lower bit data, which corresponds to that the multiplier needs to perform booth encoding processing on the received data with the bit width of 2N bits as a whole. In addition, the selector 112 may also determine the gated complement value according to the received different function selection mode signals.

It should be noted that the lower partial product obtaining unit 114 may obtain, according to each lower encoded signal, a partial product after sign bit extension corresponding to the lower data, and a value in the lower partial product of the target code obtained after gating by the lower selector group unit 116, so as to obtain the lower partial product of the target code. Optionally, the high-order partial product obtaining unit 115 may obtain, according to each high-order coded signal, a partial product after sign bit extension corresponding to the high-order data is obtained, and a value in the high-order partial product of the target code obtained after gating by the high-order selector group unit 117, so as to obtain the high-order partial product of the target code. Optionally, in the booth encoding process, the number of the obtained low-order coded signals may be equal to the number of the obtained high-order coded signals, and may also be equal to the number of partial products after sign bit extension corresponding to the low-order data, or the number of partial products after sign bit extension corresponding to the high-order data. Optionally, the modified coding circuit 11 may include N/4 low-order booth coding units 111 and may further include N/4 high-order booth coding units 114. Optionally, the correction coding circuit 11 may include N/4 low-order partial product obtaining units 112, and may further include N/4 high-order partial product obtaining units 115. Optionally, each of the lower partial product obtaining unit 112 and each of the upper partial product obtaining units 115 may include 2N number of value generating sub-units, and each of the value generating sub-units may obtain one value of the partial product after sign bit extension. Wherein, the N may represent the bit width of the data received by the multiplier.

In the multiplier provided by this embodiment, the low-order booth encoding unit, the selector, and the high-order booth encoding unit in the modified encoding circuit perform booth encoding processing on received data to obtain low-order and high-order encoded signals, and the low-order partial product obtaining unit and the high-order partial product obtaining unit obtain a partial product of a target code according to the low-order and high-order encoded signals, and then accumulate the partial product of the target code to obtain a multiplication result.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 3, the multiplier includes the low-order booth encoding unit 111, and the low-order booth encoding unit 111 includes: a low-order data input port 1111 and a low-order encoded signal output port 1112. The low-order data input port 1111 is configured to receive low-order data subjected to booth encoding processing, and the low-order encoded signal output port 1112 is configured to output a low-order encoded signal obtained by performing booth encoding processing on the low-order data.

Specifically, in the multiplication, the correction coding circuit 11 in the multiplier needs to perform booth coding processing on the multiplier, and the lower booth coding unit 111 in the correction coding circuit 11 may receive three bits of values in the lower data of the multiplier through the lower data input port 1111, where the three bits are used as a group of data to be coded, and the three values may be adjacent three bits of values in the lower data. Each low-order booth encoding unit 111 processes the received data to be encoded, and outputs the obtained low-order encoded signal through a low-order encoded signal output port 1114. In addition, the first low-order booth encoding unit 111 in the modified encoding circuit 11 can receive the complement value 0 and the lower two-order value in the low-order data through the low-order data input port 1111.

Illustratively, if the multiplier receives data "y" that is 16 bits wide₁₅y₁₄y₁₃y₁₂y₁₁y₁₀y₉y₈y₇y₆y₅y₄y₃y₂y₁y₀", the lowest bit value to the highest bit value correspond to the numbers 0, …, 15, and the lower booth encoding unit 111 may encode the lower data y₇y₆y₅y₄y₃y₂y₁y₀Performing Booth encoding to obtain 9-bit data y after performing bit-complementing processing on 8-bit low-bit data before Booth encoding₇y₆y₅y₄y₃y₂y₁y₀0, the lower booth encoding units 111 may be respectively for y₇y₆y₅y₄y₃y₂y₁y₀0 in y₇y₆y₅，y₅y₄y₃，y₃y₂y₁，y₁y₀Four groups of data 0 are respectively subjected to booth encoding processing, and adjacent three-bit values in the four groups of data divided by 9-bit data can be received through a lower-bit data input port 1111 in the lower-bit booth encoding unit 111.

Each time the booth coding process is performed, the data obtained by performing the bit complement process on the lower-order data may be divided into a plurality of groups of data to be coded, and the lower-order booth coding unit 111 may perform the booth coding process on the divided groups of data to be coded at the same time. Optionally, the principle of dividing the multiple groups of data to be encoded may be characterized in that every 3-bit value adjacent to each other in the data after the bit complementing processing is used as a group of data to be encoded, and the highest-order value in each group of data to be encoded may be used as the lowest-order value in the next group of data to be encoded. Alternatively, the encoding rules of booth encoding can be seen in table 1, where y in table 1_2i+1，y_2iAnd y_2i-1Can represent the corresponding numerical value of each group of data to be coded, X can represent the multiplicand received by the correction coding circuit 11, and after Booth coding processing is carried out on each group of corresponding data to be coded, the corresponding coded signal PP is obtained_i(i ═ 0, 1, 2.., n). Alternatively, as shown in table 1, the encoded signal obtained after booth encoding may include five classes, which are-2X, -X, X and 0, respectively. Illustratively, if the multiplicand received by the correction coding circuit 11 is "x₇x₆x₅x₄x₃x₂x₁x₀", then X may be represented as" X₇x₆x₅x₄x₃x₂x₁x₀”。

TABLE 1

Illustratively, continuing with the above example, when i is 0, y_2i+1＝y₁，y_2i＝y₀，y_2i-1＝y_-1Then y is_-1Can represent y₀The post-padding value 0 (i.e., the multiplier after the padding process is expressed as y)₇y₆y₅y₄y₃y₂y₁y₀y_-1) In the Booth encoding process, y can be coded_-1y₀y₁，y₁y₂y₃，y₃y₄y₅And y₅y₆y₇And respectively encoding four groups of data to be encoded to obtain 4 low-order encoded signals, wherein the highest order value in each group of data to be encoded can be used as the lowest order value in the next group of data to be encoded.

In the multiplier provided by this embodiment, the low-order booth coding unit performs booth coding on the low-order data to obtain the low-order coded signal corresponding to the low-order data, and the low-order product obtaining unit obtains the low-order product of the target code according to the low-order coded signal, and further performs accumulation processing on the low-order product and the high-order product of the target code to obtain a multiplication result.

In one embodiment, continuing with the specific structural diagram of the multiplier shown in fig. 3, the multiplier includes the lower partial product obtaining unit 112, and the lower partial product obtaining unit 112 includes: a low-order encoded signal input port 1121, a strobe value input port 1122, a data input port 1123, and a partial product value output port 1124; the lower-order coded signal input port 1121 is configured to receive a lower-order coded signal output by the lower-order booth coding unit 111, the strobe value input port 1122 is configured to receive a value in a lower-order partial product of the target code output after being strobed by the lower-order selector bank unit 116, the data input port 1123 is configured to receive data of a multiplication operation, and the value input port 1124 is configured to receive a value in a lower-order partial product of the target code.

Specifically, the lower-order partial product obtaining unit 112 may receive the lower-order encoded signal output by the lower-order booth encoding unit 111 through the lower-order encoded signal input port 1121, and may receive the multiplicand in the multiplication operation through the data input port 1123. Optionally, the lower partial product obtaining unit 112 may obtain a partial product after sign bit extension corresponding to the lower data according to the received lower encoded signal and the received multiplicand in the multiplication operation. Optionally, if the multiplicand bit width received by the data input port 1123 is N, the bit width of the partial product after sign bit extension may be equal to 2N. For example, if the lower-bit product obtaining unit 112 receives a multiplicand X with a bit width of N bits, the lower-bit product obtaining unit 112 may directly obtain a corresponding sign-extended partial product according to the multiplicand X and five types of encoded signals-2X, -X, X and 0, where a lower (N +1) bit value of the sign-extended partial product may be equal to a value of an original partial product, and an upper (N-1) bit value of the sign-extended partial product may be equal to a sign bit value of the original partial product, where the sign bit value is a highest bit value of the original partial product. When the encoded signal is-2X, the original partial product may be obtained by inverting X with one bit left or right and adding 1, when the encoded signal is 2X, the original partial product may be obtained by left-shifting X with one bit, when the encoded signal is-X, the original partial product may be obtained by inverting X with one bit and adding 1, when the encoded signal is X, the original partial product may be data in which the sign bit value of X (i.e., the most significant bit value of X) is combined with X, and when the encoded signal is +0, the original partial product may be 0, i.e., each bit value in the 9-bit original partial product is equal to 0.

It should be noted that the low-order partial product obtaining unit 112 may receive, through the gated value input port 1122, a corresponding bit value in the partial product after sign bit extension corresponding to the data with different bit widths gated by the low-order selector group unit 116, and obtain the low-order partial product of the target code according to the partial product after sign bit extension corresponding to the low-order data currently obtained by the multiplier and the corresponding bit value after gating.

In the multiplier provided by this embodiment, the low-order-portion-product obtaining unit may obtain the low-order portion of the target code according to each low-order-coded signal, and the low-order-portion-product obtaining unit obtains the low-order portion of the target code according to the low-order-coded signal, and further accumulates the low-order portion and the high-order portion of the target code to obtain a multiplication result.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 3, the multiplier includes the selector 113, and the selector 113 includes: a function selection mode signal input port 1131(mode), a first strobe value input port 1132, a second strobe value input port 1133, and an operation result output port 1134; the function selection mode signal input port 1131 is configured to receive a function selection mode signal corresponding to data with different bit widths that needs to be processed by a multiplier, the first strobe value input port 1132 is configured to receive a first strobe value, the second strobe value input port 1133 is configured to receive a second strobe value, and the operation result output port 1134 outputs the first strobe value or the second strobe value after being strobed.

Specifically, the selector 113 may determine, through the function selection mode signal received by the function selection mode signal input port 1131, a data bit width currently processable by the multiplier, and determine that the operation result output port 1134 outputs the first strobe value or the second strobe value. Optionally, the first strobe data may be the highest bit value of 0 or lower data, and the second strobe data may be the highest bit value of 0 or lower data.

For example, during the multiplication, if the multiplier and the multiplicand received by the correction coding circuit 11 are both 16-bit data, and the function selection mode signal input port 1131(mode) of the selector 113 can receive two different function selection mode signals, where mode is 0, mode is 1, and mode is 0 can indicate that the multiplier can process 8-bit data, and mode is 1 can indicate that the multiplier can process 16-bit data, and when the mode received by the function selection mode signal input port 1121(mode) of the selector 112 is 0, the multiplier can currently process 8-bit data operations, at this time, the selector 113 can receive a second gate value through the second gate value input port 1133, where the second gate value can be equal to 0; when mode received by the function selection mode signal input port 1131(mode) of the selector 113 is 1, then the multiplier can currently process 16-bit data operations, at which point the selector 113 can receive a first strobe value through the first strobe value input port 1132, which may be equal to the most significant bit value of the lower-bit data.

It should be noted that, if the multiplier can currently process an 8-bit data multiplication operation, the multiplier can perform a multiplication operation on 8-bit data and 8-bit data corresponding to a 16-bit multiplier and a 16-bit multiplicand, that is, the 8-bit data operation is performed on the 8-bit multiplier and the 8-bit multiplicand through the high-bit booth encoding unit 114, the 8-bit data operation is performed on the 8-bit multiplier and the 8-bit data operation is performed on the 8-bit booth encoding unit 111, and when the multiplier performs an 8-bit data multiplication operation, the selector 113 can receive a second gated value 0 through the second gated value input port 1133, where a complementary value after the complementary processing of the 8-bit data is equal to 0; if the multiplier can currently process a 16-bit data multiplication operation, the multiplier can directly perform the multiplication operation on the 16-bit multiplier and the 16-bit multiplicand, that is, the correction coding circuit 11 directly performs booth coding on the 16-bit multiplier, at this time, the selector 113 may receive a first strobe value through the first strobe value input port 1132, where the first strobe value is the highest bit value in the lower 8-bit data.

According to the multiplier provided by the embodiment, the function selection mode signal received by the selector can determine the bit complement value of the high-order data during Booth encoding processing, so that Booth encoding processing is performed on the data after bit complement, the multiplier can perform multiplication operation on data with various bit widths, and the area of an AI chip occupied by the multiplier is effectively reduced.

Fig. 3 is a schematic diagram illustrating a specific structure of a multiplier according to another embodiment, wherein the multiplier includes the high-order booth encoding unit 114, and the high-order booth encoding unit 114 includes: a high-order data input port 1141 and a high-order coded signal output port 1142; the high-order data input port 1141 is configured to receive high-order data subjected to booth coding, and the high-order coded signal output port 1142 is configured to output a high-order coded signal obtained by performing booth coding on the high-order data.

Specifically, in the multiplication operation, the correction coding circuit 11 in the multiplier needs to perform booth coding processing on the multiplier, and the high-order booth coding unit 114 in the correction coding circuit 11 may receive three-bit values in high-order data in the multiplier through the high-order data input port 1141, where the three-bit values are used as a group of data to be coded, and the three values may be adjacent three-bit values in the high-order data.

Illustratively, continuing with the example of a 16-bit data multiply operation, the high-order Booth encoding units 114 may separately pair y₇y₆y₅y₄y₃y₂y₁y₀0 in y₇y₆y₅，y₅y₄y₃，y₃y₂y₁，y₁y₀The four groups of data 0 are respectively subjected to booth encoding processing, and continuous three-bit values in the four groups of data divided by 9-bit data can be received through a high-bit data input port 1141 in the high-bit booth encoding unit 114.

It should be noted that the principle of the higher booth encoding unit 114 processing the higher data to be encoded at each booth encoding process may be the same as the principle of the lower booth encoding unit 111 processing the lower data to be encoded. The internal circuit configuration of the higher booth encoding unit 114 and the lower booth encoding unit 111 may be the same, and the function of the external output port may be the same.

In the multiplier provided by this embodiment, the booth coding processing is performed on the high-order data by the high-order booth coding unit to obtain the high-order coded signal corresponding to the high-order data, and the high-order product obtaining unit obtains the high-order product of the target code according to the high-order coded signal, and further performs accumulation processing on the high-order product and the low-order product of the target code to obtain a multiplication result.

In one embodiment, continuing with the specific structural diagram of the multiplier shown in fig. 3, the multiplier includes the high-order partial product obtaining unit 115, and the high-order partial product obtaining unit 115 includes: a high order encoded signal input port 1151, a strobe value input port 1152, a data input port 1153, and a partial product value output port 1154; the high-order coded signal input port 1151 is configured to receive the high-order coded signal output by the high-order booth coding unit 111, the strobe numerical value input port 1152 is configured to receive a numerical value in a high-order partial product of the target code output after being strobed by the high-order selector bank unit 117, the data input port 1153 is configured to receive data for multiplication, and the numerical value input port 1154 in the partial product is configured to receive a numerical value in a high-order partial product of the target code.

Specifically, the high-order partial product obtaining unit 115 may receive the high-order coded signal output by the high-order booth coding unit 114 through the high-order coded signal input port 1151, and may receive a multiplicand in the multiplication operation through the data input port 1153. Optionally, the high-order partial product obtaining unit 115 may obtain a partial product after sign bit extension corresponding to the high-order data according to the received high-order coded signal and the received multiplicand in the multiplication operation. Optionally, if the multiplicand bit width received by the data input port 1153 is N, the bit width of the partial product after sign bit extension may be equal to 2N.

It should be noted that the high-order partial product obtaining unit 115 may receive, through the strobe value input port 1122, a corresponding bit value in the partial product after sign bit extension corresponding to the different bit-width data strobed by the high-order selector group unit 117, and obtain the high-order partial product of the target code according to the partial product after sign bit extension corresponding to the high-order data currently obtained by the multiplier and the corresponding bit value after strobe.

In the multiplier provided by this embodiment, the high-order-portion-product obtaining unit may obtain the high-order portion of the target code according to each high-order-coded signal, and the high-order-portion-product obtaining unit obtains the high-order portion of the target code according to the high-order-coded signal, and further accumulates the high-order portion and the low-order portion of the target code to obtain a multiplication result.

In one embodiment, continuing with the specific structure diagram of the multiplier shown in fig. 3, the multiplier includes the low selector bank unit 116, and the low selector bank unit 116 includes: a low selector 1161, a plurality of said low selectors 1161 are used for gating the value in the low bit partial product of the target code.

Specifically, the number of the low selectors 1161 in the low selector bank unit 116 may be equal to 3/8 times the square of the bit width of the data currently received by the multiplier, and the internal circuit structure of the plurality of low selectors 1161 in the low selector bank unit 116 may be the same. Optionally, during the multiplication, the corresponding lower partial product obtaining unit 112 connected to each lower booth encoding unit 111 may include 2N number of value generating sub-units, where the N number of value generating sub-units may be connected to N number of lower selectors 1161, and each value generating sub-unit is connected to one lower selector 1161, where N represents a bit width of data currently received by the multiplier. Optionally, the N value generating sub-units corresponding to the N low selectors 1161 may be value generating sub-units corresponding to high N values in a low partial product of the target code, and the internal circuit structures of the N low selectors 1161 and the selector 113 may be completely the same, and meanwhile, the external input ports of the N low selectors 1161 have two other input ports besides the function selection mode signal input port (mode). Optionally, if the multiplier can process N data operations with different bit widths, and the bit width of the data received by the multiplier is N, the signals respectively received by the two other input ports of the low selector 1161 may be 0, and when the multiplier performs the data operation with N bit widths, the sign bit value in the partial product after the corresponding sign bit is extended, which is obtained by the low booth encoding unit 111. The N/4 lower partial product obtaining units 112 may be connected to N/4 groups of N lower selectors 1161, sign bit values received by the N lower selectors 1161 of each group may be the same or different, but sign bit values received by the N lower selectors 1161 of the same group are the same, and the sign bit value may be obtained according to the sign bit value in the sign bit expanded partial product obtained by the lower partial product obtaining unit 112 connected to each group of N lower selectors 1161.

In addition, in the 2N number of value generating sub-units included in each lower partial product obtaining unit 112, the corresponding N/2 number of value generating sub-unit may not be connected to the lower selector 1161, at this time, the value obtained by the N/2 number of value generating sub-unit may be data with different bit widths currently processed by the multiplier, and a corresponding bit value in a partial product obtained by extending a sign bit of a corresponding lower data, or it may be understood that the value obtained by the N/2 number of value generating sub-unit may be all values between the corresponding lower N/2-1 bit and the lowest bit in the partial product obtained by extending the sign bit.

In addition, in the 2N number of value generation sub-units included in each lower partial product obtaining unit 112, the remaining N/2 number generating sub-units may also be connected to N/2 low selectors 1161, each number generating sub-unit may be connected to 1 low selector 1161, the internal circuit structure of the N/2 low selectors 1161 and the selector 113 may be the same, and the external input ports of the N/2 low selectors 1161 have two other input ports, in addition to the function selection mode signal input port (mode), which receive signals respectively, can carry out N/2 bit data operation for the multiplier, obtain the sign bit value in the partial product after the corresponding sign bit is expanded, and the multiplier performs N-bit data operation to obtain corresponding bit values in the partial product after the corresponding sign bit is expanded. The N/4 lower partial product obtaining units 112 may be connected to N/4 groups of N/2 lower selectors 1161, the sign bit values received by the N/2 lower selectors 1161 of each group may be the same or different, but the sign bit values received by the N/2 lower selectors 1161 of the same group are the same, and the sign bit value may be obtained according to the sign bit value in the extended partial product obtained by the corresponding connected lower partial product obtaining unit 112 according to each group of N/2 lower selectors 1161.

In addition, the corresponding bit value in the sign bit expanded partial product received by the N/2 low bit selectors 1161 of each group may be determined according to the corresponding bit value in the sign bit expanded partial product obtained by the low bit product obtaining unit 112 to which the group of low bit selectors 1161 is connected, and the corresponding bit value received by each of the N/2 low bit selectors 1161 of each group may be the same or different. The position of the 2N number generation subunit in each lower partial product obtaining unit 112 may be shifted to the left by two number generation subunits based on the position of the 2N number generation subunit in the previous lower partial product obtaining unit 112. Optionally, only the first low-bit product of the low-bit products of the target code may have a bit width equal to 2N, the remaining low-bit products may have two more bits less than the last low-bit product, and the last low-bit product may have a bit width equal to (3N/2+ 2).

In the multiplier provided by this embodiment, the low selector set unit in the multiplier may gate the value in the low-order partial product to obtain the low-order partial product of the target code, and then accumulate the low-order partial product and the high-order partial product of the target code by the correction compression circuit to obtain the multiplication result.

In one embodiment, continuing with the specific structure diagram of the multiplier shown in fig. 3, the multiplier includes the high selector bank unit 117, and the low selector bank unit 117 includes: a high bit selector 1171, a plurality of said high bit selectors 1171 for gating the value in the high bit partial product of the target code.

Specifically, the number of the upper selectors 1171 in the upper selector bank unit 117 may be equal to 3/8 times the square of the bit width of the data currently received by the multiplier, and the internal circuit structure of the plurality of upper selectors 1171 in the upper selector bank unit 117 may be the same. Optionally, during the multiplication, the corresponding upper partial product obtaining unit 115 connected to each upper booth encoding unit 114 may include 2N number of value generating sub-units, where the N number of value generating sub-units may be connected to N number of upper selectors 1171, and each value generating sub-unit is connected to one upper selector 1171, where N represents a bit width of data currently received by the multiplier. Optionally, the N value generating subunits corresponding to the N high-order selectors 1171 may be value generating subunits corresponding to low-order N values in the high-order partial product of the target code, and the internal circuit structures of the N high-order selectors 1171 and the selector 113 may be completely the same, and meanwhile, an external input port of the N high-order selectors 1171 has two other input ports besides the function selection mode signal input port (mode). Optionally, if the multiplier can process N data operations with different bit widths, and the bit width of the data received by the multiplier is N, the signals respectively received by the two other input ports of the high-bit selector 1171 may be 0, and when the multiplier performs the data operation with the bit width of N bits, the high-bit booth encoding unit 114 obtains a corresponding bit value in the partial product after the corresponding sign bit is extended. The N/4 high bit partial product obtaining unit 115 may be connected to N/4 sets of N high bit selectors 1171, and the corresponding bit values received by the N high bit selectors 1171 of each set may be the same or different.

In addition, in the 2N number of value generation sub-units included in each of the high-order partial product acquisition units 115, n/2 high selectors 1171 may be connected to corresponding N/2 value generating sub-units, 1 high selector 1171 may be connected to each value generating sub-unit, the internal circuit structure of the N/2 high selectors 1171 and the selector 113 may be the same, and the external input ports of the N/2 high bit selectors 1171 have, in addition to the function selection mode signal input port (mode), two other input ports, which receive the signals respectively, can carry out N/2 bit data operation for the multiplier, obtain the sign bit value in the partial product after the corresponding sign bit is expanded, and the multiplier performs N-bit data operation to obtain a sign bit value in the partial product after the sign bit is expanded correspondingly. The N/4 high-order partial product obtaining units 115 may be connected to N/4 sets of N/2 high-order selectors 1171, sign bit values received by the N/2 high-order selectors 1171 of each set may be the same or different, but sign bit values received by the N/2 high-order selectors 1171 of the same set are the same, and the sign bit value may be obtained according to each set of N/2 high-order selectors 1171, corresponding to the sign bit value in the partial product obtained by the connected high-order partial product obtaining unit 115 after sign bit expansion. In addition, the corresponding bit value in the sign bit expanded partial product received by the N/2 upper selectors 1171 of each group may be determined by the sign bit value in the sign bit expanded partial product obtained by the upper partial product obtaining unit 115 to which the group of upper selectors 1171 is connected, and the corresponding bit value received by each of the N/2 upper selectors 1171 of each group may be the same or different.

It should be noted that, in the 2N number of value generation subunits included in each high-order partial product obtaining unit 115, the remaining N/2 number of value generation subunits may not be connected to the high-order selector 1171, at this time, the value obtained by the N/2 number of value generation subunit may be data with different bit widths currently processed by the multiplier, and a corresponding bit value in a partial product after sign bit expansion obtained by a corresponding high-order data, or it may be understood that the value obtained by the N/2 number of value generation subunit may be all values corresponding to bits from 3N/2-1 bit higher to N +1 bit lower in the partial product after sign bit expansion. The positions of the 2N number of sub-units for generating values in each high-order partial product obtaining unit 115 may be shifted to the left by two sub-units for generating values based on the positions of the 2N number of sub-units for generating values in the last high-order partial product obtaining unit 115. Optionally, only the bit width of the first high-order partial product in the high-order partial products of the target code may be equal to 3N/2, and the remaining high-order partial products have two less high values based on the last high-order partial product.

In the multiplier provided by this embodiment, the high selector set unit in the multiplier can gate the value in the high-order partial product to obtain the high-order partial product of the target code, and then the high-order partial product and the low-order partial product of the target code are accumulated by the correction compression circuit to obtain the multiplication result.

Fig. 3 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes the modified compression circuit 12, and the modified compression circuit 12 includes: a modified Wallace tree group circuit 121 and an accumulation circuit 122, wherein the output end of the modified Wallace tree group circuit 121 is connected with the input end of the accumulation circuit 122; the modified wallace tree group circuit 121 is configured to accumulate values in each column of a partial product of a target code obtained when data with different bit widths are calculated, and the accumulation circuit 122 is configured to accumulate received input data.

Specifically, the modified wallace tree group circuit 121 may perform accumulation processing on each column number value in the partial product of the target code obtained by the modified encoding circuit 11, and the accumulation circuit 122 may perform accumulation processing on two operation results obtained by the modified wallace tree group circuit 121 to obtain a final result of multiplication. When the modified wallace tree group circuit 121 performs the accumulation processing, the distribution rule of all partial products of the target code can be characterized in that the position of the lowest bit value of the corresponding partial product of each row is staggered by two bits to the right compared with the position of the lowest bit value of the corresponding partial product of the next row, and the modified wallace tree group circuit 121 performs the accumulation processing on each column number value in all partial products of the target code according to the distribution rule. Optionally, the partial product of the target code may include a lower bit partial product of the target code and an upper bit partial product of the target code. Optionally, the two operation results obtained by the modified wallace tree group circuit 121 may include a sum output signal S and a carry output signal C.

For example, if the multiplier currently processes a 16 bit by 16 bit fixed point multiplication, the distribution rule of the 4 lower bit products and the 4 upper bit products of the target code obtained by the modified coding circuit 11 is shown in fig. 4, where "○" represents each bit value in the lower bit products,

indicating each bit value in the upper partial product, "●" indicating the sign-extended bit value of either the lower partial product or the upper partial product.

According to the multiplier provided by the embodiment, the low-order part and the high-order part of the target code can be accumulated through the modified Wallace tree group circuit, the accumulated result is accumulated again through the accumulation circuit, and the final result of multiplication is obtained.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 3, the multiplier includes the modified wallace tree group circuit 121, and the modified wallace tree group circuit 121 includes: a low-order Wallace tree subcircuit 1211, a selector 1212 and a high-order Wallace tree subcircuit 1213, wherein an output terminal of the low-order Wallace tree subcircuit 1211 is connected with an input terminal of the selector 1212, and an output terminal of the selector 1212 is connected with an input terminal of the high-order Wallace tree subcircuit 1213; the low Wallace tree sub-circuits 1211 is configured to accumulate each column value of the partial product of the target code, the selector 1212 is configured to gate the carry input signal received by the high Wallace tree sub-circuit 1213, and the high Wallace tree sub-circuits 1213 are configured to accumulate each column value of the partial product of the target code.

Specifically, the circuit structures of the plurality of low-order wallace tree sub-circuits 1211 and the plurality of high-order wallace tree sub-circuits 1213 may be implemented by a combination of a full adder and a half adder, or by a combination of a 4-2 compressor, or may be understood as a circuit capable of processing a multi-bit input signal and adding the multi-bit input signal to obtain a two-bit output signal. Optionally, the number of the upper-order wallace tree sub-circuits 1213 in the modified wallace tree group circuit 121 may be equal to the bit width N of the data currently received by the multiplier, or may be equal to the number of the lower-order wallace tree sub-circuits 1211, and the lower-order wallace tree sub-circuits 1211 may be connected in series, or the upper-order wallace tree sub-circuits 1213 may be connected in series. Optionally, the output terminal of the last lower Wallace tree sub-circuit 1211 is connected to the input terminal of the selector 1212, and the output terminal of the selector 1212 is connected to the input terminal of the first upper Wallace tree sub-circuit 1211. Optionally, each lower Wallace tree sub-circuit 1211 of the modified Wallace tree group circuit 121 may add each column of all partial products of the target code, and each lower Wallace tree sub-circuit 1211 may output two signals, i.e., a Carry signal Carry_iWith a Sum signal Sum_iWhere i may represent the number corresponding to each lower Wallace tree sub-circuit 1211, the number of the first lower Wallace tree sub-circuit 1211 is 0. Alternatively, the number of input signals received by each of the lower Wallace tree sub-circuits 1211 may be equal to the number of encoded signals or the number of partial products of the target encoding. The sum of the numbers of the upper-order Wallace tree sub-circuits 1213 and the lower-order Wallace tree sub-circuits 1211 in the modified Wallace tree group circuit 121 may be equal to 2N, the total number of columns from the lowest column to the highest column in all partial products of the target code may be equal to 2N, the N lower-order Wallace tree sub-circuits 1211 may perform the accumulation operation on each of the lower N columns of all partial products of the target code, and the N upper-order Wallace tree sub-circuits 1213 may perform the accumulation operation on each of the upper N columns of all partial products of the target code.

Illustratively, if received by a multiplierThe data bit width is N bits, and the current multiplier performs N-bit data multiplication, at this time, the selector 1212 may gate the last low-order Wallace tree sub-circuit 1211 in the modified Wallace tree group circuit 121 to output the carry output signal Cout_NAs a carry input signal Cin received by the first high order Wallace Tree sub-circuit 1213 of the modified Wallace Tree group circuit 121_N+1It can also be understood that the multiplier can currently operate on the received N as a whole; the current multiplier performs an N/2 bit data multiplication, at which time the selector 1212 may gate 0 as the carry input signal Cin received by the first higher order Wallace Tree sub-circuit 1213 of the modified Wallace Tree group circuit 121_N+1It will also be appreciated that the multiplier may now divide the received N-bit data into upper N/2-bit and lower N/2-bit data for multiplication operations, respectively, where the corresponding numbers i from the first 1211 to the last 1211 of the lower wallace tree are 1, 2, …, N, respectively, and the corresponding numbers i from the first 1213 to the last 1213 of the upper wallace tree are N +1, N +2, …, 2N, respectively.

It should be noted that, for each of the low-order Wallace tree sub-circuits 1211 and the high-order Wallace tree sub-circuit 1213 of the modified Wallace tree group circuit 121, the received signal may include the carry input signal Cin_iPartial product value input signal, carry output signal Cout_i. Optionally, the partial product value input signals received by each of the lower Wallace tree subcircuits 1211 and the upper Wallace tree subcircuits 1213 may be values of corresponding columns in all partial products of the target code, and the carry signal Cout output by each of the lower Wallace tree subcircuits 1211 and the upper Wallace tree subcircuits 1213_iMay be equal to N_Cout＝floor((N_I+N_Cin)/2) -1. Wherein N is_IMay represent the number of data input bits, N, of the Wallace Tree subcircuit_CinMay represent the carry-in bit number, N, of the Wallace Tree subcircuit_CoutCan represent the least carry output bit number of the Wallace tree subcircuit, floor (-) can represent rounding downA function. Optionally, the carry input signal received by each of the lower-level wallace tree sub-circuits 1211 and the upper-level wallace tree sub-circuits 1213 in the modified wallace tree group circuit 121 may be a carry output signal output by the last lower-level wallace tree sub-circuit 1211 or the upper-level wallace tree sub-circuit 1213, and the carry input signal received by the first lower-level wallace tree sub-circuit 1211 is 0. The carry input signal received by the first high-order Wallace tree sub-circuit 1213 may be determined by the bit width of the data currently processed by the multiplier and the bit width of the data received by the multiplier.

According to the multiplier provided by the embodiment, the partial product of the target code can be accumulated by the modified Wallace tree group circuit to obtain two paths of output signals, and the two paths of output signals are accumulated again by the accumulation circuit to obtain a multiplication result.

Fig. 3 is a schematic diagram of a specific structure of a multiplier according to another embodiment, in which the multiplier includes the accumulation circuit 122, and the accumulation circuit 122 includes: and the carry adder 1221 is used for performing addition operation on the received two data with the same bit width.

Specifically, the adder 1221 may be a carry adder with different bit widths. Optionally, the adder 1221 may receive the two paths of signals output by the modified wallace tree group circuit 121, perform addition operation on the two paths of output signals, and output a multiplication result. Alternatively, the adder 1221 may be a carry look ahead adder.

According to the multiplier provided by the embodiment, the two paths of signals output by the modified Wallace tree group circuit can be accumulated through the accumulation circuit, the multiplication result is output, the process can be used for carrying out multiplication on data with different bit widths, and the area of an AI chip occupied by the multiplier is effectively reduced.

In one embodiment, continuing with the specific structural diagram of the multiplier shown in fig. 3, the multiplier includes the adder 1221, and the adder 1221 includes: a carry signal input port 1221a, a bit signal input port 1221b, and an operation result output port 1221 c; the carry signal input port 1221a is configured to receive a carry signal, the sum signal input port 1221b is configured to receive a sum signal, and the operation result output port 1221c is configured to output a result of performing accumulation processing on the carry signal and the sum signal.

Specifically, the adder 1221 may receive the Carry signal Carry output by the modified wallace tree group circuit 121 through the Carry signal input port 1221a, receive the Sum bit signal Sum output by the modified wallace tree group circuit 121 through the Sum bit signal input port 1221b, add the Carry signal Carry and the Sum bit signal Sum, and output the result through the operation result output port 1221 c.

It should be noted that, during multiplication, the multiplier may adopt an adder 1221 with different bit widths to perform addition operation on the Carry output signal Carry and the Sum output signal Sum output by the modified wallace tree group circuit 121, where the bit width of the processable data of the adder 1221 may be equal to 2 times of the bit width M of the data currently processed by the multiplier. Optionally, each of the low Wallace tree sub-circuits 1211 and the high Wallace tree sub-circuit 1213 of the modified Wallace tree group circuit 121 may output a Carry output signal Carry_iAnd a Sum bit output signal Sum_i(i ═ 1, …, 2M, i is the corresponding number for each lower or higher walsh tree sub-circuit, starting with 1). Optionally, the adder 1221 receives Carry { [ Carry { ] { [ Carry { ] received by₁：Carry_2M-1]0, that is, the bit width of the Carry output signal Carry received by the adder 1221 is 2M, the first 2M-1 bit values in the Carry output signal Carry correspond to the Carry output signals of the first 2M-1 lower and upper walsh tree sub-circuits in the modified walsh tree group circuit 121, and the last bit value in the Carry output signal Carry may be replaced by 0. Optionally, the Sum bit output signal Sum received by the adder 1221 has a bit width M, and the value of the Sum bit output signal Sum may be equal to the Sum bit output signal of each of the lower or upper walsh tree sub-circuits of the modified walsh tree group circuit 121.

For example, if the multiplier is currently processing 8-bit by 8-bit fixed point multiplication, the adder 1221 may be a 16-bit Carry adder, as shown in fig. 11, the modified wallace tree group circuit 121 may output a Sum output signal Sum and a Carry output signal Carry of 16 lower and upper wallace tree sub-circuits, however, the Sum output signal received by the 16-bit Carry adder may be a complete Sum signal Sum output by the modified wallace tree group circuit 121, and the received Carry output signal may be a Carry output signal Carry of the modified wallace tree group circuit 121 excluding all Carry output signals output by the last upper wallace tree sub-circuit 1213, combined with 0.

According to the multiplier provided by the embodiment, the accumulation circuit can accumulate two paths of signals output by the modified Wallace tree group circuit and output a multiplication result, the process can multiply data with different bit widths, and the area of an AI chip occupied by the multiplier is effectively reduced.

Fig. 5 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes the determining circuit 11, and the determining circuit 11 includes: a data input port 111 and a data output port 112; the data input port 111 is configured to receive data to be multiplied, the data output port 112 is configured to output the received data, and the fourth data input port 114 is configured to output a second received data.

Specifically, the judgment circuit 11 receives two data to be multiplied through the data input port 111. Optionally, the data received by the determining circuit 11 may be a multiplier and a multiplicand in a multiplication operation, and bit widths of the multiplier and the multiplicand may be the same. Alternatively, the judgment circuit 11 may output the received two data through the data output port 112 and input the two data to the data expansion circuit 12 at the same time, or input the two data to the encoding circuit 13 at the same time.

It should be noted that, if the determining circuit 11 determines that the bit width of the two received data is N and is smaller than the bit width 2N of the data that can be processed by the multiplier, at this time, the determining circuit 11 needs to input the two received data with the bit width of N bits to the data expanding circuit 12 for expansion processing, so as to obtain two data with the bit width of 2N bits; if the determining circuit 11 determines that the bit width of the two received data is 2N and is equal to the bit width 2N of the data that can be processed by the multiplier, at this time, the determining circuit 11 may directly input the two received data with bit widths of 2N to the encoding circuit 13 for encoding.

Fig. 5 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes the data expansion circuit 12, and the data expansion circuit 12 includes: a data input port 121, a data expansion mode selection signal input port 122, a function selection mode signal output port 123, and an expanded data output port 124; the data input port 121 is configured to receive the data output by the determining circuit 11, the data expansion mode selection signal input port 122 is configured to receive a data expansion mode selection signal corresponding to performing expansion processing on the received data, the function selection mode signal output port 123 is configured to output a function selection mode signal determined according to a mode in which the data expansion circuit 12 performs expansion processing on the received data, and the expanded data output port 124 is configured to output the data after the expansion processing.

Specifically, the data expansion mode selection signal received by the data expansion mode selection signal input port 122 may be three, and three different data expansion mode selection signals may be 00, 01, and 10, where the signal 00 indicates that the data expansion circuit 12 may expand the received N-bit data into 2N-bit data, a high N-bit value in the 2N-bit data may be equal to a value of the received N-bit data, and low N-bit values may all be equal to an expanded value 0, at this time, the function selection mode signal output port 123 may output the function selection mode signal 00, and in an operation result with a 4N-bit wide obtained by the multiplier, the high 2N-bit value may be a final operation result; signal 01 indicates that the data expansion circuit 12 can expand the received N-bit data into 2N-bit data, the lower N-bit value of the 2N-bit data can be equal to the value of the received N-bit data, and the upper N-bit values can be equal to the expanded value 0, at this time, the function selection mode signal output port 123 can output a function selection mode signal 00, and the lower 2N-bit value of the operation result with 4N-bit width obtained by the multiplier can be the final operation result; the signal 10 indicates that the data expansion circuit 12 can expand the received N-bit data into 2N-bit data, the lower N-bit value of the 2N-bit data can be equal to the value of the received N-bit data, and the upper N-bit values can be equal to the sign bit value of the data received by the data expansion circuit 12, at this time, the function selection mode signal output port 123 can output the function selection mode signal 01, and the lower 2N-bit value of the operation result with 4N-bit width obtained by the multiplier can be the final operation result.

It should be noted that, if the bit width of the two data received by the multiplier is 2N and is equal to the bit width 2N of the data that can be processed by the multiplier, the determining circuit 11 may directly input the two received data into the encoding circuit 13 for booth encoding; if the bit width of the two data received by the multiplier is N, which is smaller than the bit width 2N of the data that can be processed by the multiplier, and the data expansion mode selection signal received by the data expansion circuit 12 is 10, the judgment circuit 11 may input the two received data to the data expansion circuit 12 for expansion processing, and input the expanded data to the encoding circuit 13 for booth encoding processing.

In the multiplier provided by this embodiment, the data expansion circuit may perform expansion processing on received data, input the expanded data to the encoding circuit to perform encoding processing to obtain a partial product of a target code, and perform accumulation processing on the partial product of the target code by using the compression circuit to obtain a final operation result.

Fig. 5 is a schematic structural diagram of a multiplier according to another embodiment, where the multiplier includes the encoding circuit 13, and the encoding circuit 13 includes: a booth coding sub-circuit 131 and a partial product acquisition sub-circuit 132, an output of the booth coding sub-circuit 131 being connected to a first input of the partial product acquisition sub-circuit 132. The booth coding sub-circuit 131 is configured to perform booth coding on the received data to obtain a coded signal, and the partial product obtaining sub-circuit 132 is configured to obtain a partial product of the target code according to the coded signal.

Specifically, the data received by the booth coding sub-circuit 131 may be input by the determination circuit 11, or may be input by the data expansion circuit 12, and the received data may be a multiplier in multiplication, and the booth coding processing may be performed on the multiplier to obtain a coded signal. Before the booth encoding process, the booth encoding sub-circuit 131 may automatically perform a bit-filling process on the received multiplier, where the bit-filling process may be to fill a bit value 0 after the lowest bit value of the data. Illustratively, if the multiplier is currently processing 8-bit by 8-bit fixed point multiplication, the multiplier is y₇y₆y₅y₄y₃y₂y₁y₀Then, before Booth encoding, the Booth encoding sub-circuit 132 can automatically perform bit-filling processing on the multiplier to convert the multiplier into y₇y₆y₅y₄y₃y₂y₁y₀0. Alternatively, the number of the above-mentioned coded signals may be equal to 1/2 of the data bit width currently processed by the multiplier, the number of the coded signals may be equal to the number of the original partial products, and the partial product obtaining sub-circuit 132 may obtain the corresponding sign bit extended partial product according to each coded signal.

According to the multiplier provided by the embodiment, the coding circuit can be used for coding the received data to obtain the partial product of the target code, the compression circuit is used for accumulating the partial product of the target code to obtain the final operation result, the process can be used for expanding the received low-bit-width data, and the expanded data meets the bit-width requirement of the multiplier for processing the data, so that the final multiplication result is still the result of multiplying the original bit-width data, the multiplier can be used for processing the low-bit-width data, and the area of an AI chip occupied by the multiplier is effectively reduced.

Fig. 5 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes the booth coding sub-circuit 131, and the booth coding sub-circuit 131 includes: a data input port 1311 and an encoded signal output port 1312; the data input port 1311 is configured to receive data subjected to booth encoding processing, and the encoded signal output port 1312 is configured to output an encoded signal obtained by performing booth encoding processing on the received data.

Specifically, if the data input port 1311 receives a piece of data, the booth coding sub-circuit 131 may automatically perform bit padding on the piece of data to obtain a piece of data having a bit width that is greater than the bit width of the original data by one bit, and at the same time, the booth coding sub-circuit 131 may perform booth coding on the piece of data after bit padding to obtain a plurality of coded signals, and output the plurality of coded signals through the coded signal output port 1312. Optionally, the booth encoding sub-circuit 131 may receive a multiplier in the multiplication operation through the data input port 1311, and the booth encoding sub-circuit 131 may perform booth encoding processing on the multiplier.

In the multiplier provided by this embodiment, the booth coding sub-circuit may perform booth coding on received data to obtain coded signals, then the partial product obtaining sub-circuit may obtain a corresponding partial product of a target code according to each coded signal, and may perform accumulation processing on the partial product of the target code through the compression circuit to obtain a multiplication result, the multiplier may perform expansion processing on received low-bit-width data, the expanded data satisfies a bit-width requirement of the multiplier for being able to process the data, so that a final multiplication result is still a result of multiplication on the original bit-width data, thereby ensuring that the multiplier can process operation on the low-bit-width data, and effectively reducing an area of an AI chip occupied by the multiplier.

Fig. 5 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes the partial product obtaining sub-circuit 132, and the partial product obtaining sub-circuit 132 includes: an encoded signal input port 1321, a data input port 1322, and a partial product output port 1323; the code signal input port 1321 is configured to receive the code signal, the data input port 1322 is configured to receive the data, and the partial product output port 1323 is configured to output a partial product of a target code obtained from the code signal and the received data.

Specifically, as can be seen from table 1, the partial product obtaining sub-circuit 132 may receive five different types of encoded signals output by the booth encoded sub-circuit 132 through the encoded signal input port 1321, where each type of encoded signal is defined as-2X, -X, and 0, and according to the received encoded signal, a partial product of a corresponding target code may be obtained. Optionally, the data input port 1322 may receive data in a multiplication operation, which may be a multiplicand in the multiplication operation. Optionally, the partial product obtaining sub-circuit 132 may obtain a corresponding original partial product according to the encoded signal, and perform sign bit extension processing on the original partial product to obtain a sign bit extended partial product. The bit width of the partial product after sign bit expansion may be equal to 2 times of the bit width 2N of the data currently processed by the multiplier, the bit width of the original partial product may be equal to 2N +1, and the data with 2N-1 bits higher than the partial product after sign bit expansion may all be equal to the sign bit value in the original partial product. Optionally, the partial product of the target code may be a partial product after sign bit extension, and the original partial product may be a partial product without sign bit extension.

Optionally, in the distribution rule of all partial products of the target codes acquired by the partial product acquisition sub-circuit 132, starting from the partial product of the second target code, the partial product of each target code may be shifted by two bits to the left compared with the partial product of the previous target code, and starting from the partial product of the second target code, the two-bit higher value is not accumulated.

In the multiplier provided by this embodiment, the partial product obtaining sub-circuit can obtain the corresponding partial product of the target code according to each code signal, and the compression circuit can accumulate the partial products of the target code to obtain the multiplication result, the multiplier can perform expansion processing on the received low-bit-width data, and the expanded data meets the bit-width requirement for processing the data of the multiplier, so that the final multiplication result is still the result of performing multiplication on the original bit-width data, thereby ensuring that the multiplier can process the operation on the low-bit-width data, and effectively reducing the area of the AI chip occupied by the multiplier.

Fig. 5 is a schematic diagram of a specific structure of a multiplier according to another embodiment, where the multiplier includes the compression circuit 14, and the compression circuit 14 includes: a wallace tree group sub-circuit 141 and an accumulation sub-circuit 142; wherein, the output terminal of the wallace tree group sub-circuit 141 is connected with the input terminal of the accumulation sub-circuit 142; the wallace tree group sub-circuit 141 is configured to perform an accumulation process on the partial product of the target code, and the accumulation sub-circuit 142 is configured to perform an accumulation process on the received input data.

Specifically, the wallace tree group sub-circuit 141 may accumulate the values in all partial products of the target code obtained by the encoding circuit 13, and accumulate two output results obtained by the wallace tree group sub-circuit 141 through the accumulation sub-circuit 142 to obtain the final result of the multiplication.

According to the multiplier provided by the embodiment, the Wallace tree group sub-circuit can accumulate partial products of target codes, and the accumulation sub-circuit accumulates the accumulated results again to obtain the final result of multiplication, the multiplier can expand the received low-bit-width data, and the expanded data meets the bit-width requirement of the multiplier for processing the data, so that the final result of multiplication is still the result of multiplication of the original bit-width data, the multiplier can process the low-bit-width data, and the area of an AI chip occupied by the multiplier is effectively reduced.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 5, the multiplier includes the wallace tree group sub-circuit 141, and the wallace tree group sub-circuit 141 includes: the Wallace tree units 1411-141 n are used for accumulating each column of the partial product of the target code, and the Wallace tree units 1411-141 n are used for accumulating the partial product of the target code.

Specifically, the circuit structures of the Wallace tree sub-circuits 1411 to 141n may be implemented by a combination of a full adder and a half adder, or by a combination of 4-2 compressors, and it is understood that the Wallace tree sub-circuits 1411 to 141n are circuits capable of processing multi-bit input signals and adding the multi-bit input signals to obtain two-bit output signals. Alternatively, the number n of Wallace tree subcircuits included in Wallace tree group subcircuits 141 may be equal to 2 times the bit width of data currently being processed by the multiplier, and each Wallace tree subcircuit may be connected in series. Optionally, each Wallace tree sub-circuit in the Wallace tree group sub-circuit 141 may add each column of all partial products of the target code, and each Wallace tree sub-circuit may output two signals, namely, Carry signal Carry_iWith a Sum signal Sum_iWherein i may represent the number corresponding to each Wallace tree sub-circuit, and the number of each Wallace tree sub-circuit is 0. Alternatively, the number of input signals received by each Wallace tree sub-circuit may be equal to the number of encoded signals or the number of partial products of the target code.

It should be noted that the signal received by each of the Wallace Tree group subcircuits 141 may include a carry input signal Cin_iPartial product input signal, carry output signal Cout_i. Optionally, the partial product input signal received by each wallace tree unit may be a value of each column in the partial product of all target codes, and the carry signal Cout output by each wallace tree unit_iMay be equal to N_Cout＝floor((N_I+N_Cin)/2) -1. Wherein N is_IMay represent the number of data input bits, N, of the Wallace Tree cell_CinMay represent the carry-in number, N, of the Wallace Tree cell_CoutThe least carry-out bits of the Wallace tree cell can be represented, and floor (·) can represent a floor rounding function. Optionally, the carry input signal received by each wallace tree unit in the wallace tree group sub-circuit 141 may be the carry output signal output by the last wallace tree unit, and the carry input signal received by the first wallace tree unit is 0.

In one embodiment, continuing with the detailed structural diagram of the multiplier shown in fig. 5, the multiplier includes the accumulation sub-circuit 142, and the accumulation sub-circuit 142 includes: and an adder 1421, where the adder 1421 is configured to add the two same-bit-width data.

Specifically, the adder 1421 can be an adder with different bit widths. Optionally, the adder 1421 may receive two signals output by the wallace tree group sub-circuit 141, perform addition operation on the two output signals, and output a multiplication result. Optionally, the adder 1421 may be a carry look ahead adder.

Optionally, the adder 1421 includes: a carry signal input port 1421a, a sum signal input port 1421b, and an operation result output port 1421 c; the carry signal input port 1421a is configured to receive a carry signal, the sum bit signal input port 1421b is configured to receive a sum bit signal, and the operation result output port 1421c is configured to output a result of performing accumulation processing on the carry signal and the sum bit signal.

Optionally, the adder 1421 may receive the Carry signal Carry output by the wallace tree group sub-circuit 141 through the Carry signal input port 1421a, receive the Sum bit signal Sum output by the wallace tree group sub-circuit 141 through the Sum bit signal input port 1421b, add the result of the Sum bit signal Sum and the Carry signal Carry, and output the result through the operation result output port 1421 c.

It should be noted that, during multiplication, the multiplier may adopt an adder 1421 with different bit widths to perform addition operation on the Carry output signal Carry and the Sum output signal Sum output by the wallace tree group sub-circuit 141, where the bit width of the processable data of the adder 1421 may be equal to 2 times of the bit width 2N of the data currently processed by the multiplier. Optionally, each wallace tree unit in the wallace tree group sub-circuit 141 may output a Carry output signal Carry_iAnd a Sum bit output signal Sum_i(i ═ 0, …, 4N-1, i is the corresponding number for each wallace tree cell, starting with number 0). Optionally, the adder 1421 receives Carry { [ Carry { ])₀：Carry_4N-2]0), that is, the bit width of the Carry output signal Carry received by the adder 1421 is 4N, the first 4N-1 bit values in the Carry output signal Carry correspond to the Carry output signals of the first 4N-1 wallace tree units in the wallace tree group sub-circuit 141, and the Carry output signals correspond to the Carry output signals of the first 4N-1 wallace tree units in the wallace tree group sub-circuit 141The last bit value in the Carry may be replaced with a 0. Optionally, the Sum bit output signal Sum received by the adder 1421 has a bit width M, and the value in the Sum bit output signal Sum may be equal to the Sum bit output signal of each wallace tree unit in the wallace tree group sub-circuit 141.

Illustratively, if the multiplier is currently processing 8 × 8 multiplication operations, the adder 1421 may be a 16-bit Carry look ahead adder, as shown in fig. 6, the wallace tree group sub-circuit 141 may output Sum and Carry output signals Carry of 16 wallace tree units, however, the Sum output signal received by the 16-bit Carry look ahead adder may be the complete Sum signal Sum output by the wallace tree group sub-circuit 141, and the Carry output signal received may be the Carry output signal Carry of the wallace tree group sub-circuit 141 after all Carry output signals except the Carry output signal output by the last wallace tree unit are combined with 0. In fig. 6, Wallace _ i represents a Wallace tree unit, i is the number of the Wallace tree unit from 0, a solid line connected between every two Wallace tree units represents that the Wallace tree unit corresponding to the high-order number has a carry output signal, a dotted line represents that the Wallace tree unit corresponding to the high-order number does not have a carry output signal, and the ladder circuit represents a two-way selector.

According to the multiplier provided by the embodiment, the two paths of signals output by the Wallace tree group sub-circuit can be subjected to accumulation operation through the accumulation sub-circuit, and a multiplication operation result is output.

Fig. 7 is a flowchart illustrating a data processing method according to an embodiment, where the method may be processed by the multipliers shown in fig. 1 and fig. 3, and this embodiment relates to a process of performing a multiplication operation on data with different bit widths. As shown in fig. 7, the method includes:

s101, receiving data to be processed.

Specifically, the multiplier may receive data to be processed, which may be a multiplier and a multiplicand in a multiplication operation, through the correction coding circuit. The multiplier can also receive different function selection mode signals through all the selectors in the correction coding circuit and the correction compression circuit during each multiplication operation, and the function selection mode signals received by all the selectors in the correction coding circuit and all the selectors in the correction compression circuit during the same operation can be the same. Optionally, the data may be fixed point numbers. If the multiplier receives different function selection mode signals, the multiplier can process data operations with different bit widths, and meanwhile, the corresponding relation between the different selection mode signals and the data with different bit widths processed by the multiplier can be flexibly set, and the embodiment is not limited at all.

It should be noted that, if the bit width of the multiplier to be processed and the multiplicand received by the correction coding circuit is not equal to the bit width of the processable data corresponding to the function selection mode signal received by the multiplier, the multiplier divides the received data to be processed into a plurality of groups of data having the same bit width as the data currently processable by the multiplier according to the bit width of the data currently processable by the multiplier, and performs parallel processing, where the bit width of the data to be processed received by the correction coding circuit may be greater than the bit width of the data currently processable by the multiplier. Optionally, the parallel processing may be characterized by processing each divided group of data to be processed at the same time. If the bit width of the data to be processed received by the correction coding circuit is equal to the bit width of the data which can be processed and corresponds to the function selection mode signal received by the multiplier, the multiplier directly processes the received data to be processed. Optionally, the data to be processed may include high-order data to be processed and low-order data to be processed. If the bit width of the data to be processed is 2N, the upper N bits are the upper data to be processed, and the lower N bits are the upper data to be processed.

Optionally, the bit width of the multiplier and the multiplicand to be processed received by the correction coding circuit may be 8 bits, 16 bits, 32 bits, or 64 bits, which is not limited in this embodiment. Wherein, the bit width of the multiplier to be processed can be equal to the bit width of the multiplicand to be processed.

S102, gating a signal to be coded, and performing Booth coding processing on the data to be processed according to the signal to be coded to obtain a coded signal.

Specifically, the multiplier may determine the signal to be encoded after being gated by the selector by modifying the functional mode selection signal received by the encoding circuit, and perform booth encoding on the data to be processed according to the determined signal to be encoded to obtain the encoded signal. Optionally, the data to be processed may be a multiplier in a multiplication operation, and may include upper data to be processed and lower data to be processed, where if the bit width of the data to be processed is 2N, the upper N bits may be the upper data to be processed, and the lower N bits may be the lower data to be processed. Optionally, the signal to be encoded may be 0, or may be the highest bit value in the low-bit data to be processed.

It should be noted that, if the bit width of the data received by the multiplier is 2N, and the bit width of the data currently processed by the multiplier is also 2N, the correction coding circuit may gate the highest bit value in the lower bit data to be processed through the selector, as the complement bit value in the higher bit data, and at this time, the multiplier may perform multiplication operation on the received 2N bit data as a whole; if the bit width of the data currently processed by the multiplier is N, the multiplier needs to divide the received 2N-bit data into high N-bit data and low N-bit data for parallel processing, and at this time, the correction coding circuit may gate 0 through the selector as a complementary bit value in the high-bit data.

S103, obtaining a partial product of the target code according to the code signal and the data to be processed.

Specifically, the partial product obtaining unit in the multiplier may obtain a partial product of a target code corresponding to the function selection mode signal received by the current multiplier according to the multiplicand to be processed and the code signal. Alternatively, the partial products of the target code may be partial products obtained by expanding corresponding sign bits by the multiplier, and the number of the partial products after expanding the sign bits may be equal to the number of the code signals.

For example, if the bit width of the data received by the multiplier is 2N and the multiplier processes N-bit wide data currently, the partial product of the target code may be a partial product obtained by expanding a corresponding sign bit of the upper N-bit data and a partial product obtained by expanding a corresponding sign bit of the lower N-bit data.

And S104, accumulating the partial product of the target code to obtain an operation result.

Specifically, the multiplier may perform accumulation processing on the partial product of the target code by the correction compression circuit, and obtain an operation result.

In the data processing method provided by this embodiment, data to be processed is received, a signal to be encoded is gated, booth encoding processing is performed on the data to be processed according to the signal to be encoded to obtain an encoded signal, a partial product of a target code is obtained according to the encoded signal and the data to be processed, and the partial product of the target code is accumulated to obtain an operation result.

As shown in fig. 8, a data processing method according to another embodiment, where the gating of the signal to be encoded in S102 and the booth encoding of the data to be processed according to the signal to be encoded to obtain the encoded signal includes:

and S1021, obtaining high-order data and low-order data to be coded according to the signal to be coded and the data to be processed.

Specifically, the correction coding circuit may determine a plurality of to-be-coded high-order data corresponding to-be-processed high-order data according to the to-be-coded signal. Optionally, before performing the booth encoding on the data to be processed, the correction encoding circuit needs to perform a bit-complementing process on the received multiplier to be processed, that is, to complement a bit value of 0 at a lower bit of the lowest bit value in the multiplier. Optionally, the low-order data to be processed and the complement value 0 may obtain a plurality of groups of low-order data to be encoded, and the high-order data to be processed and the signal to be encoded obtained after gating may obtain a plurality of groups of low-order data to be encoded. Optionally, the number of groups of lower data to be encoded may be equal to the number of groups of upper data to be encoded, and may also be equal to 1/4 bits wide of the data received by the multiplier.

It should be noted that the principle of dividing the plurality of groups of low-level data to be encoded may be characterized in that each 3-bit value adjacent to each other in the low-level data after the complementary bit processing is used as a group of low-level data to be encoded, and the highest-level value in each group of low-level data to be encoded may be used as the lowest-level value in the next group of low-level data to be encoded. Optionally, the principle of dividing the multiple groups of high-order data to be encoded may be characterized in that the signal to be encoded obtained by gating is used as a complementary bit value of the high-order data, every 3 adjacent bit values in the high-order data after complementary bit are used as a group of high-order data to be encoded, and the highest bit value in each group of high-order data to be encoded may be used as the lowest bit value in the next group of high-order data to be encoded.

And S1022, performing Booth encoding processing on the high-order data and the low-order data to be encoded to obtain a high-order encoded signal and a low-order encoded signal.

Specifically, the encoding rule in the booth encoding process may be referred to in table 1, and it can be known from table 1 that, by performing booth encoding on the divided low-order data to be encoded and the high-order data by the low-order booth encoding unit and the high-order booth encoding unit, five different types of encoded signals, which are-2X, -X, and 0, can be obtained.

The data processing method provided by this embodiment receives data to be processed, obtains high-order data and low-order data to be encoded according to the signal to be encoded and the data to be processed, performs booth encoding processing on the high-order data and the low-order data to be encoded, obtains a high-order encoded signal and a low-order encoded signal, obtains a partial product of a target code according to the low-order encoded signal, the high-order encoded signal and the data to be processed, and performs accumulation processing on the partial product of the target code to obtain an operation result.

With reference to fig. 8, the step of obtaining the partial product of the target code according to the code signal and the data to be processed in S103 includes:

and S1031, obtaining a low bit partial product of the target code according to the low bit coded signal and the data to be processed.

It should be noted that, if the bit width of the data to be processed received by the multiplier is 2N, and the multiplier can process N bits of data currently, the multiplier needs to divide the 2N bits of data to be processed into high N bits of data and low N bits of data to be processed for parallel operation, and at this time, the multiplier can obtain a low bit partial product of the target code according to the low bit coded signal and the low N bits of data to be processed through the correction coding circuit; if the multiplier can process the data of 2N bits currently, the multiplier needs to obtain the low-bit partial product of the target code according to the low-bit coded signal and the to-be-processed 2N-bit data. Wherein, the bit width of the lower bit product of the target code may be 4N, and the number of the lower bit products of the target code may be equal to N/2.

S1032, obtaining a high-order partial product of the target code according to the high-order coded signal and the data to be processed.

It should be noted that, if the bit width of the to-be-processed data received by the multiplier is 2N, and the multiplier can process N bits of data currently, the multiplier needs to divide the 2N bits of to-be-processed data into high N bit data and low N bit data to be processed for parallel operation, and at this time, the multiplier can obtain a high bit partial product of the target code according to the high bit coded signal and the high N bit data to be processed through the correction coding circuit; if the multiplier can process the data of 2N bits currently, the multiplier needs to obtain the high-bit partial product of the target code according to the high-bit coded signal and the to-be-processed 2N-bit data. The bit width of the upper partial product of the target code may be 4N, and the number of the upper partial products of the target code may be equal to N/2.

According to the data processing method provided by this embodiment, a low-order partial product of a target code is obtained according to the low-order coded signal and the data to be processed, a high-order partial product of the target code is obtained according to the high-order coded signal and the data to be processed, and the low-order partial product and the high-order partial product of the target code are accumulated to obtain an operation result.

In one embodiment, as shown in fig. 9, the step of obtaining the lower partial product of the target code according to the lower coded signal and the data to be processed in S1031 includes:

and S1031a, obtaining a lower bit partial product after sign bit expansion according to the lower bit coded signal and the data to be processed.

Specifically, the multiplier obtains the original low-order partial product corresponding to the data with different bit widths currently processed by the multiplier according to the received function selection mode signal, the low-order coded signal and the data to be processed, and performs sign bit extension processing on the original low-order partial product to obtain the sign bit extended low-order partial product. Optionally, the original lower bit partial product may be a lower bit partial product without sign bit extension, and may also be understood as a partial product obtained by corresponding lower bit data without sign bit extension. Optionally, the bit width of the lower bit product after sign bit extension may be equal to 2 times of the bit width M of the data received by the multiplier, and the bit width of the original lower bit product may be equal to M + 1. Optionally, the sign extended lower bit partial product may include the M +1 bit value in the original lower bit partial product and the sign bit value in the M-1 bit original lower bit partial product.

It should be noted that, if the lower part of the product obtaining unit receives an 8-bit multiplicand x₇x₆x₅x₄x₃x₂x₁x₀(i.e., X), the lower portionThe product acquisition unit may be based on the multiplicand x₇x₆x₅x₄x₃x₂x₁x₀(i.e., X) directly obtains the corresponding original lower partial product with five types of lower encoded signals-2X, 2X, -X, X and 0, when the lower encoded signal is-2X, the original lower partial product may be the left-right bit of X, inverted, and then added with 1, when the lower encoded signal is 2X, the original lower partial product may be the left shift of X by one bit, when the lower encoded signal is-X, the original lower bit partial product may be the bitwise negation of X plus 1, and when the lower encoded signal is X, the original lower partial product may be the data of X combined with the higher one-digit value of the X most significant digit, wherein, the value of the higher bit of the most significant bit of X may be equal to the value of the sign bit of X, when the lower encoded signal is +0, the original lower bit partial product may be 0, i.e. each bit value in the 9 bit partial product equals 0.

S1031b, gating the value in the lower partial product of the target code by the lower selector bank unit.

Specifically, each of the low selectors in the low selector bank unit may gate the corresponding bit value in the low partial product of the target code according to the received different function selection signals.

And S1031c, obtaining the lower bit partial product of the target code according to the value in the lower bit partial product of the target code and the value in the lower bit partial product after sign bit expansion.

Specifically, the low-order partial product obtaining unit may obtain, according to the value in the low-order partial product of the target code obtained after the gating by the low-order selector bank unit and the partial bit value in the low-order partial product after the sign bit expansion obtained by the multiplier currently processing the corresponding bit width data, the low-order partial product of the target code corresponding to the bit width data currently processed by the multiplier.

According to the data processing method provided by this embodiment, a low-order partial product after sign bit extension is obtained according to the low-order coded signal and the data to be processed, a value in the low-order partial product of a target code is gated through a low-order selector bank unit, a low-order partial product of the target code is obtained according to the value in the low-order partial product of the target code and the value in the low-order partial product after sign bit extension, and the low-order partial product of the target code and the high-order partial product of the target code are accumulated to obtain an operation result.

In one embodiment, with reference to fig. 9, the step of obtaining the upper partial product of the target code according to the upper coded signal and the data to be processed in S1032 includes:

s1032a, obtaining the high-order bit partial product after sign bit expansion according to the high-order bit coded signal and the data to be processed.

Specifically, the multiplier obtains the original high-order partial product corresponding to the data with different bit widths currently processed by the multiplier according to the received function selection mode signal, the high-order coded signal and the data to be processed, and performs sign bit extension processing on the original high-order partial product to obtain the sign bit extended high-order partial product. Optionally, the original high-order partial product may be a high-order partial product without sign bit extension, and may also be understood as a partial product without sign bit extension, which is obtained by high-order data corresponding to the high-order data. Optionally, the bit width of the upper bit product after sign bit extension may be equal to 2 times of the bit width M of the data received by the multiplier, and the bit width of the original upper bit product may be equal to M + 1. Optionally, the sign-extended upper partial product may include the M +1 bit value in the original upper partial product and the sign bit value in the M-1 bit original upper partial product.

S1032b, gating the value in the upper partial product of the target code by the upper selector bank unit.

Specifically, each of the high selectors in the high selector bank unit may gate the corresponding bit value in the high partial product of the target code according to the received different function selection signals.

S1032c, obtaining the upper partial product of the target code according to the value of the upper partial product of the target code and the value of the upper partial product after sign bit extension.

Specifically, the high-order partial product obtaining unit may obtain, according to the value in the high-order partial product of the target code obtained after the gating by the high-order selector bank unit and the partial bit value in the high-order partial product after the sign bit extension obtained by the multiplier currently processing the corresponding bit width data, the high-order partial product of the target code corresponding to the bit width data currently processed by the multiplier.

According to the data processing method provided by this embodiment, the high-order partial product of the target code after sign bit extension is obtained according to the high-order coded signal and the data to be processed, the value in the high-order partial product of the target code is gated through the high-order selector bank unit, the high-order partial product of the target code is obtained according to the value in the high-order partial product of the target code and the value in the high-order partial product of the target code after sign bit extension, and the high-order partial product of the target code and the low-order partial product of the target code are accumulated to obtain the operation result.

As shown in fig. 10, a data processing method according to another embodiment, the step of performing accumulation processing on the partial product of the target code in S104 to obtain an operation result includes:

s1041, accumulating the low-order partial product and the high-order partial product of the target code by a modified Wallace tree group circuit to obtain a first operation result.

Specifically, the multiplier may perform accumulation processing on each column number according to a distribution rule on all low-order bit portions and all high-order bit portions of the target code by modifying the wallace tree group circuit, so as to obtain a first operation result. Optionally, the first operation result may include a Sum output signal Sum and a Carry output signal Carry, where bit widths of the Sum output signal Sum and the Carry output signal Carry may be the same.

And S1042, accumulating the first operation result through an accumulation circuit to obtain an operation result.

Specifically, the multiplier may add the Carry output signal Carry output from the modified wallace tree group circuit and the Sum output signal Sum by an adder in the accumulation circuit, and output an addition result. Optionally, each wallace tree unit in the modified wallace tree group circuit may output a Carry output signal Carry_iAnd a Sum bit output signal Sum_i(i ═ 0, …, N-1, i is the corresponding number for each wallace tree unit, starting with number 0). Optionally, the Carry { [ Carry ] received by the adder₀：Carry_N-2]0, that is, the bit width of the Carry output signal Carry received by the adder is N, the first N-1 bit value in the Carry output signal Carry corresponds to the Carry output signal of the first N-1 wallace tree units in the modified wallace tree group circuit, and the last bit value in the Carry output signal Carry may be replaced by 0. Optionally, the Sum bit output signal Sum received by the adder has a bit width N, and the value of the Sum bit output signal Sum may be equal to the Sum bit output signal of each wallace tree unit in the modified wallace tree group circuit.

For example, if the multiplier is currently processing 8 × 8 multiplication operations, the adder may be a 16-bit Carry look ahead adder, as shown in fig. 6, the modified wallace tree group circuit may output Sum output signals Sum and Carry output signals Carry of 16 wallace tree units, but the Sum output signal received by the 16-bit Carry look ahead adder may be the complete Sum signal Sum output by the modified wallace tree group circuit, and the received Carry output signal may be the Carry signal Carry combined with 0 of all Carry output signals except the Carry output signal output by the last wallace tree unit in the modified wallace tree group circuit.

In the data processing method provided by this embodiment, the modified wallace tree group circuit performs accumulation processing on the low-order part and the high-order part of the target code to obtain a first operation result, and the accumulation circuit performs accumulation processing on the first operation result to obtain an operation result.

As shown in fig. 11, a multiplication method according to another embodiment, where in S1041, the accumulating circuit accumulates the first operation result to obtain an operation result, includes:

s1041a, accumulating the column number in the partial product of the target code through the low-order Wallace tree group subcircuit to obtain the accumulation operation result.

Specifically, according to the distribution rule of all the lower bit partial products and all the upper bit partial products of the target code, the total column number of the corresponding numerical values of all the partial products of the target code is 2N (N is the bit width of the data currently processed by the multiplier), and the number corresponding to each column of numerical values from the lowest bit numerical value may be 0, …, 2N-1, where the numbers 0 to N-1 may be referred to as the lower N column of numerical values. Optionally, the accumulation operation result may be a carry output signal Cout output by the last wallace tree unit in the lower wallace tree group circuit.

It should be noted that the N wallace tree units included in the lower wallace tree group sub-circuit may perform the accumulation operation on the low N column numbers according to the numbering order to obtain the accumulation operation result. Optionally, the accumulation operation result may include Carry output signals Carry, Sum of each wallace tree unit, and output signal Cout of the last wallace tree unit in the lower wallace tree group sub-circuit.

And S1041b, gating the accumulation operation result through a selector to obtain a carry gating signal.

Specifically, the selector in the modified compression circuit may gate the output signal Cout or 0 of the last wallace tree unit in the low-order wallace tree group circuit according to the received function selection mode signal to obtain a carry gate signal.

And S1041c, accumulating by a high-order Wallace tree group circuit according to the carry strobe signal and the column number values in the partial product of the target code to obtain an operation result.

Specifically, according to the distribution rule of all partial products of the target code, the total number of columns of the corresponding numerical values of all partial products of the target code is 2N (N is the bit width of the data currently processed by the multiplier), and the number corresponding to each column of numerical values from the lowest bit numerical value may be 0, …, 2N-1, where the numbers N to 2N-1 may be referred to as high N columns of numerical values.

It should be noted that N wallace tree units included in the high-order wallace tree group circuit may perform an accumulation operation on the high N column numbers according to the numbering order, and output a second operation result. The carry input signal received by the first wallace tree unit in the high-order wallace tree group circuit may be a carry strobe signal output by the selector.

In the data processing method provided by this embodiment, the low-order wallace tree group sub-circuit performs accumulation processing on the column number values in the partial products of the target codes to obtain accumulated operation results, the selector gates the accumulated operation results to obtain carry gating signals, and the high-order wallace tree group circuit performs accumulation processing on the column number values in the partial products of the target codes according to the carry gating signals and the column number values in the partial products of the target codes to obtain operation results.

Fig. 12 is a flowchart illustrating a data processing method according to another embodiment, which can be processed by the multipliers shown in fig. 2 and fig. 5, where the embodiment relates to a process of performing a multiplication operation on data with different bit widths. As shown in fig. 12, the method includes:

s201, receiving data to be processed.

Specifically, the number of the data to be processed received by the multiplier may be two, and the data is a multiplier and a multiplicand in a multiplication operation.

S202, judging whether the bit width of the data to be processed is equal to the bit width of the data which can be processed by the multiplier.

Specifically, the multiplier determines whether the bit width of the received two pieces of data to be processed is equal to the bit width of the data that can be processed by the multiplier. In this embodiment, the bit width of the data that can be processed by the multiplier is fixed, i.e., 2N bits. Optionally, the bit width of the data to be processed received by the determining circuit may be N, or may also be 2N.

And S203, if the data to be processed are not equal, performing data expansion processing on the data to be processed to obtain expanded data.

Specifically, if the bit width of the data to be processed received by the determining circuit is not equal to the bit width 2N of the data that can be processed by the multiplier, the multiplier may perform data expansion processing on the data to be processed through the data expansion circuit, and expand the data to be processed into data with a bit width of 2N.

Optionally, the performing data expansion processing on the data to be processed to obtain expanded data includes: and performing data expansion processing on the data to be processed through 0 or the sign bit value of the data to be processed to obtain expanded data. Optionally, the bit width of the expanded data is equal to the bit width of the data currently processed by the multiplier.

It should be noted that the data expansion circuit may receive three data expansion mode selection signals, which are respectively denoted as 00, 01, and 10, where the signal 00 denotes that the data expansion circuit may expand the received N-bit data to be processed into 2N-bit data, a high N-bit value in the 2N-bit data may be equal to a value of the received N-bit data, and low N-bit values may all be equal to an expanded value 0, at this time, the data expansion circuit may output the function selection mode signal 00, and in an operation result of a 4N-bit width obtained by the multiplier, the high 2N-bit value may be a final operation result; signal 01 indicates that the data expansion circuit can expand the received N-bit data into 2N-bit data, the lower N-bit value in the 2N-bit data can be equal to the value of the received N-bit data, and the upper N-bit values can be equal to the expanded value 0, at this time, the data expansion circuit can output a function selection mode signal 00, and in the operation result with a 4N-bit width obtained by the multiplier, the lower 2N-bit value can be the final operation result; the signal 10 indicates that the data expansion circuit can expand the received N-bit data into 2N-bit data, the lower N-bit value of the 2N-bit data can be equal to the value of the received N-bit data, and the upper N-bit values can be equal to the sign bit value of the data received by the data expansion circuit, at this time, the data expansion circuit can output the function selection mode signal 01, and the lower 2N-bit value of the operation result with 4N-bit width obtained by the multiplier can be the final operation result.

And S204, encoding the expanded data to obtain a partial product after sign bit expansion.

Specifically, the multiplier may perform binary coding processing on the expanded data through a coding circuit, and obtain a partial product after sign bit expansion according to a received multiplicand to be processed and a binary coded result. Alternatively, the number of partial products after sign bit extension may be equal to N.

And S205, accumulating the partial product after the sign bit is expanded to obtain an operation result.

Specifically, the multiplier may accumulate the partial product after sign bit expansion by using a compression circuit, and obtain an operation result.

For example, a multiplier may process data with a bit width of 16 bits and receive two data with a bit width of 8 bits, and the multiplier may expand the received two data with a bit width of 8 bits into two data with a bit width of 16 bits through a data expansion circuit, and after performing a multiplication operation on the data, may obtain one data with a bit width of 32 bits; if the data expansion circuit expands two data with 8bit width into the low 8 bits of 0 and the high 8 bits of received data, at this time, the data expansion mode selection signal received by the data expansion circuit is 00, the output function selection mode signal is also 00, and the multiplier can intercept the high 16 bits of data in the data with 32 bit width as the final operation result; if the data expansion circuit expands two data with 8bit width into the data with 8 high bits both being 0 and the data with 8 low bits being received, at this time, the data expansion mode selection signal received by the data expansion circuit is 01, the output function selection mode signal is also 00, and the multiplier can intercept the data with 16 low bits in the data with 32 bit width as the final operation result; if the data expansion circuit expands two data with 8bit width into a sign bit value with an upper 8bit value in the received data with 8bit width and a lower 8bit value in the received data, at this time, the data expansion mode selection signal received by the data expansion circuit is 10, the output function selection mode signal is also 01, and the multiplier can intercept the lower 16 bit data in the data with 32 bit width as the final operation result.

The data processing method provided in this embodiment receives data to be processed, determines whether a bit width of the data to be processed is equal to a bit width of data that can be processed by a multiplier, performs data expansion processing on the data to be processed if the bit width of the data to be processed is not equal to the bit width of the data that can be processed by the multiplier, obtains expanded data, coding the expanded data to obtain sign bit expanded partial product, accumulating the sign bit expanded partial product to obtain operation result, the method can carry out expansion processing on the received low bit width data, the data after the expansion processing meets the bit width requirement of a multiplier for processing the data, so that the result of the final multiplication operation is still the result of the multiplication operation on the original bit-wide data, therefore, the operation that the multiplier can process low-bit-width data is ensured, and the area of the AI chip occupied by the multiplier is effectively reduced.

After determining whether the bit width of the data to be processed is equal to the bit width of the data processable by the multiplier, the method according to another embodiment further includes: and if the sign bit is equal to the sign bit, coding the data to be processed to obtain a partial product after sign bit expansion.

Specifically, if the bit width of the data to be processed received by the multiplier is equal to the bit width 2N of the data currently processed by the multiplier, the judgment circuit in the multiplier may input the received data to be processed to the encoding circuit, and the encoding circuit directly performs binary encoding on the data to be processed to obtain the partial product after sign bit expansion. In this case, the multiplier does not need to perform data expansion processing on the data to be processed.

According to the data processing method provided by the embodiment, if the bit width of the data to be processed received by the multiplier is equal to the bit width of the data currently processed by the multiplier, the coding circuit can directly code the data to be processed to obtain the partial product after sign bit expansion, and accumulate the partial product after sign bit expansion to obtain the operation result.

As shown in fig. 13, a multiplication method according to another embodiment, where the step of encoding the expanded data in S204 to obtain a sign-bit-expanded partial product includes:

s2041, performing Booth coding processing on the expanded data to obtain a coded signal.

Specifically, the multiplier may perform booth coding processing on the expanded multiplier to be processed through a booth coding sub-circuit to obtain a coded signal. Optionally, in the booth encoding process, data with a bit width of 3 bits in the input multiplier may obtain data after one-bit encoding, the encoding rule in the booth encoding process may refer to table 1, and it can be known from table 1 that the booth encoding sub-circuit performs booth encoding on the multiplier to obtain five different types of encoded signals, where each type of encoded signal is defined as-2X, -X, and 0, respectively.

S2042, according to the data to be processed and the coded signal, obtaining a partial product after sign bit expansion.

Specifically, the partial product obtaining sub-circuit may obtain the partial product after sign bit expansion by data expansion processing according to the expanded multiplicand to be processed and the encoded signal.

The data processing method provided by this embodiment performs booth coding processing on the expanded data to obtain a coded signal, obtains a partial product after sign bit expansion according to the data to be processed and the coded signal, and performs accumulation processing on the partial product after sign bit expansion to obtain an operation result.

In one embodiment, as shown in fig. 14, the step of obtaining the partial product after sign bit expansion according to the data to be processed and the encoded signal in S2042 includes:

s2042a, obtaining an original partial product according to the data to be processed and the coded signal.

In particular, the number of original partial products may be equal to the number of encoded signals. Alternatively, the original partial product may be a partial product without sign bit extension.

Illustratively, if the partial product fetch sub-circuit receives an 8-bit multiplicand x₇x₆x₅x₄x₃x₂x₁x₀(i.e., X), then the partial product fetch subcircuit may be based on the multiplicand X₇x₆x₅x₄x₃x₂x₁x₀(i.e., X) and five types of coded signals-2X, 2X, -X, X and 0 directly obtain corresponding original partial products, when the coded signal is-2X, the original partial products can be obtained by inverting and adding 1 to X after inverting one bit left and right, when the coded signal is 2X, the original partial products can be obtained by left shifting X one bit, when the coded signal is-X, the original partial products can be obtained by inverting and adding 1 to X in terms of bit, when the coded signal is X, the original partial products can be obtained by combining X with the highest-bit value of X, wherein the highest-bit value of X can be equal to the value of X combined with the highest-bit value of XThe sign bit value of X, when the encoded signal is +0, the original partial product may be 0, i.e. each bit value in the 9-bit partial product is equal to 0.

S2042b, sign bit expansion processing is carried out on the original partial product, and the partial product after sign bit expansion is obtained.

Specifically, the partial product obtaining sub-circuit may perform sign bit extension processing on the original partial product according to a sign bit value of the original partial product, so as to obtain the partial product after sign bit extension. Optionally, the bit width of the original partial product may be equal to N +1, and the bit width of the partial product after sign bit extension may be equal to 2N. Optionally, the low N +1 bit value in the partial product after the sign bit extension is the N +1 bit value of the original partial product, and the high N-1 bit value in the partial product after the sign bit extension is the sign bit value of the original partial product.

According to the data processing method provided by the embodiment, the original partial product is obtained according to the data to be processed and the coded signal, sign bit expansion processing is performed on the original partial product to obtain the partial product after sign bit expansion, and accumulation processing is performed on the partial product after sign bit expansion to obtain the operation result.

In another embodiment of the data processing method, the step of accumulating the partial product after sign bit extension in S205 to obtain an operation result includes:

s2051, accumulating the partial product after the sign bit is expanded through the Wallace tree group subcircuit to obtain a first operation result.

Specifically, the multiplier may accumulate all partial products after sign bit expansion by the wallace tree group sub-circuit according to a distribution rule to obtain a first operation result. Optionally, the first operation result may include a Sum output signal Sum and a Carry output signal Carry, where bit widths of the Sum output signal Sum and the Carry output signal Carry may be the same.

And S2052, accumulating the first operation result through the accumulation sub-circuit to obtain an operation result.

Specifically, the multiplier may add the Carry output signal Carry and the Sum output signal Sum output by the wallace tree group sub-circuit by an adder in the accumulation sub-circuit, and output an addition result.

According to the data processing method provided by the embodiment, the Wallace tree group sub-circuit is used for accumulating the partial product after sign bit expansion to obtain a first operation result, and the accumulation sub-circuit is used for accumulating the first operation result to obtain an operation result.

The embodiment of the application also provides a machine learning operation device, which comprises one or more multipliers mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one multiplier is included, the multipliers can be linked and transmit data through a specific structure, for example, the PCIE bus interconnects and transmits data to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The embodiment of the application also provides a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 15 is a schematic view of a combined processing apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices can cooperate with the machine learning calculation device to complete calculation tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device obtains the required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Alternatively, as shown in fig. 16, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure. As shown in fig. 17, fig. 17 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, receiving means 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 grains (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 grains are adopted in each group of memory units, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The receiving device is electrically connected with the chip in the chip packaging structure. The receiving device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the receiving device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the receiving device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the receiving apparatus.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device may be a data processor, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of circuit combinations, but those skilled in the art should understand that the present application is not limited by the described circuit combinations, because some circuits may be implemented in other ways or structures according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments, and that the devices and modules referred to are not necessarily required for this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A multiplier, characterized in that it comprises: the output end of the judgment circuit is connected with the input end of the data expansion circuit, the output end of the judgment circuit is connected with the first input end of the coding circuit, the output end of the data expansion circuit is connected with the second input end of the coding circuit, and the output end of the coding circuit is connected with the input end of the compression circuit;

2. The multiplier of claim 1, wherein the coding circuit comprises a third input terminal for receiving an input function selection mode signal; the compression circuit includes a first input for receiving an input function selection mode signal.

3. The multiplier of claim 1, wherein the decision circuit comprises: a data input port and a data output port; the data input port is configured to receive data to be subjected to multiplication, the data output port is configured to output the received data, and the fourth data input port is configured to output a second received data.

4. The multiplier of claim 1, wherein the data spreading circuit comprises: the data expansion module comprises a data input port, a data expansion mode selection signal input port, a function selection mode signal output port and an expanded data output port; the data input port is used for receiving the data output by the judging circuit, the data expansion mode selection signal input port is used for receiving a data expansion mode selection signal corresponding to the received data through expansion processing, the function selection mode signal output port is used for outputting a function selection mode signal determined according to the mode of the data expansion circuit through expansion processing of the received data, and the expanded data output port is used for outputting the data after the expansion processing.

5. The multiplier of claim 1, wherein the encoding circuit comprises: the Booth encoding circuit comprises a Booth encoding sub-circuit and a partial product obtaining sub-circuit, wherein the output end of the Booth encoding sub-circuit is connected with the first input end of the partial product obtaining sub-circuit;

6. The multiplier of claim 5, wherein the Booth encoding subcircuit comprises: the data input port is used for receiving data subjected to Booth coding processing, and the coding signal output port is used for outputting a coding signal obtained after the Booth coding processing is performed on the received data.

7. The multiplier of claim 5, wherein the partial product acquisition sub-circuit comprises: the device comprises an encoding signal input port, a data input port and a partial product output port, wherein the encoding signal input port is used for receiving the encoding signal, the data input port is used for receiving the data, and the partial product output port is used for outputting a partial product of a target code acquired according to the encoding signal and the received data.

8. The multiplier of claim 1, wherein the compression circuit comprises: a Wallace tree group sub-circuit and an accumulation sub-circuit; the output end of the Wallace tree group sub-circuit is connected with the input end of the accumulation sub-circuit; the Wallace tree group sub-circuit is configured to accumulate the partial products of the target code, and the accumulation sub-circuit is configured to accumulate the received input data.

9. The multiplier of claim 8, wherein the wallace tree group sub-circuit comprises: a Wallace tree unit to accumulate each column of the partial product of the target code.

10. The multiplier of claim 8, wherein the accumulation sub-circuit comprises: and the adder is used for performing addition operation on the two received data with the same bit width.

11. The multiplier of claim 10, wherein the adder comprises: the carry signal input port is used for receiving a carry signal, the sum signal input port is used for receiving a sum signal, and the operation result output port is used for outputting the result of the accumulation processing of the carry signal and the sum signal.

12. A method of data processing, the method comprising:

receiving data to be processed;

coding the expanded data to obtain a partial product after sign bit expansion;

13. The method according to claim 12, further comprising, after determining whether the bit width of the data to be processed is equal to the bit width of the data processable by the multiplier: and if the sign bit is equal to the sign bit, coding the data to be processed to obtain a partial product after sign bit expansion.

14. The method of claim 12, wherein said encoding said data after spreading to obtain a sign-bit-spread partial product comprises:

15. The method of claim 14, wherein obtaining the sign-bit-extended partial product based on the data to be processed and the encoded signal comprises:

16. The method according to claim 12, wherein the performing data expansion processing on the data to be processed to obtain expanded data includes: and performing data expansion processing on the data to be processed through 0 or the sign bit value of the data to be processed to obtain expanded data.

17. The method of claim 16, wherein the bit width of the expanded data is equal to the bit width of data currently processed by the multiplier.

18. The method of claim 12, wherein accumulating the sign-bit-extended partial product to obtain an operation result comprises:

19. A machine learning operation device, wherein the machine learning operation device comprises one or more multipliers according to any one of claims 1 to 11, and is used for acquiring input data and control information to be operated from other processing devices, executing specified machine learning operation, and transmitting the execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of multipliers, the plurality of computing devices can be connected through a specific structure and transmit data;

20. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus according to claim 19, a universal interconnect interface and other processing apparatus;

and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

21. The combined processing device according to claim 20, further comprising: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

22. A neural network chip, wherein the machine learning chip comprises the machine learning arithmetic device of claim 19 or the combined processing device of claim 20.

23. An electronic device, characterized in that it comprises a chip according to claim 20.

24. The utility model provides a board card, its characterized in that, the board card includes: a memory device, a receiving device and a control device and a neural network chip as claimed in claim 22;

wherein the neural network chip is respectively connected with the storage device, the control device and the receiving device;

the storage device is used for storing data;

the receiving device is used for realizing data transmission between the chip and external equipment;

and the control device is used for monitoring the state of the chip.

25. The board of claim 24,

the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;

the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;

the receiving device is as follows: a standard PCIE interface.