CN109523019B

CN109523019B - Accelerator, accelerating system based on FPGA, control method and CNN network system

Info

Publication number: CN109523019B
Application number: CN201811639964.5A
Authority: CN
Inventors: 邬志影
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2024-05-21
Anticipated expiration: 2038-12-29
Also published as: CN109523019A

Abstract

The present disclosure provides an accelerator comprising: the system comprises a plurality of computing units PE and at least one multiplexer MUX, wherein the MUX is respectively connected with two interconnected PEs, and the number of PEs performing operation is determined by changing the connection state of the MUX. Through the accelerator provided by the disclosure, the effect of reducing power consumption is realized, and the effect of balancing the power calculation demands is realized. The disclosure also provides an acceleration system and a control method based on the FPGA, a CNN network system and a control method.

Description

Accelerator, accelerating system based on FPGA, control method and CNN network system

Technical Field

The embodiment of the disclosure relates to the technical field of Internet, in particular to an accelerator, an acceleration system and a control method based on FPGA, a CNN network system and a control method.

Background

FPGA (Field-Programmable GATE ARRAY), a Field Programmable gate array, is a product of further development on the basis of Programmable devices such as PAL, GAL, CPLD. The programmable device is used as a semi-custom circuit in the field of Application Specific Integrated Circuits (ASICs), which not only solves the defect of custom circuits, but also overcomes the defect of limited gate circuits of the original programmable device. And the core of the FPGA is the accelerator. Based on the accelerator, a corresponding operation is realized.

In the prior art, the accelerator consists of a plurality of computing units PE. In order to meet the calculation force demand, when designing the bottom layer FPGA, a plurality of PE are set as much as possible, and then the calculation demand is calculated by adopting a pulsation calculation mode.

Disclosure of Invention

The embodiment of the disclosure provides an accelerator, an acceleration system and a control method based on FPGA, a CNN network system and a control method.

According to one aspect of the disclosed embodiments, the disclosed embodiments provide an accelerator comprising: a plurality of computation units PE and at least one multiplexer MUX, the MUX is respectively connected with two interconnected PEs, and the number of PEs performing operation is determined by changing the connection state of the MUX.

In some embodiments, the MUX is plural, and any two adjacent PEs are connected by one MUX.

In some embodiments, the PEs are arranged in a matrix array.

In some embodiments, the connection states include an output state, a cascade state, and a disconnected state.

According to another aspect of the embodiments of the present disclosure, there is also provided an acceleration system based on an FPGA, including: the accelerator of any of the above embodiments, wherein the computing controller is connected to the accelerator and the register, respectively, and the transmitter is connected to the accelerator and the memory, respectively.

According to another aspect of the embodiments of the present disclosure, there is also provided a method for controlling an FPGA-based acceleration system, the method being based on the FPGA-based acceleration system as described above, including:

Determining configuration information according to the acquired calculation force demand information, wherein the configuration information comprises the number of PE and the connection state of each MUX;

and sending the configuration information to an accelerator so that the accelerator can determine the connection state of the MUX according to the configuration information.

According to another aspect of the embodiments of the present disclosure, there is also provided an acceleration system based on an FPGA, including: the accelerator of any one of the above disclosed embodiments, further comprising:

The computing controller is used for: and determining configuration information according to the acquired calculation force demand information, wherein the configuration information comprises the number of the PE and the connection state of each MUX, and transmitting the configuration information to the accelerator.

According to another aspect of the embodiments of the present disclosure, there is also provided a CNN network system including: an FPGA-based acceleration system as described above, and a processor coupled to the FPGA-based acceleration system.

In some embodiments, the processor is configured to: and determining configuration information according to the acquired calculation force demand information, wherein the configuration information comprises the number of PE and the connection state of each MUX, and transmitting the configuration information to the acceleration system based on the FPGA.

According to another aspect of the embodiments of the present disclosure, there is also provided a method for controlling a CNN network system, the method being based on the system as described above, including:

acquiring calculation force demand information;

determining configuration information from the calculation force demand information and a preset configuration information table, wherein the configuration information comprises the number of PE and the connection state of each MUX;

and sending the configuration information to an FPGA-based acceleration system so that the FPGA-based acceleration system can determine the connection state of the MUX according to the configuration information.

The accelerator provided by the embodiment of the disclosure realizes the effect of reducing power consumption and the effect of balancing the calculation force requirement.

Drawings

The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure.

The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

fig. 1 is a schematic structural view of an accelerator according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a computing module provided by an embodiment of the present disclosure;

Fig. 3 is a schematic diagram of a computing module according to another embodiment of the invention.

Fig. 4 is a schematic structural diagram of an acceleration system based on FPGA according to an embodiment of the present disclosure;

Fig. 5 is a flow chart of a control method of an acceleration system based on an FPGA according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of a CNN network system according to an embodiment of the present disclosure;

Fig. 7 is a flow chart of a control method of a CNN network system according to an embodiment of the present disclosure;

Reference numerals:

1. a memory; 2. a transmitter; 3. a register; 4. a computing controller; 5. an accelerator;

6. a processor.

Detailed Description

In order to enable those skilled in the art to better understand the technical scheme of the invention, the accelerator, the accelerating system and the control method based on the FPGA, the CNN network system and the control method provided by the invention are described in detail below with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Embodiments described herein may be described with reference to plan and/or cross-sectional views with the aid of idealized schematic diagrams of the present disclosure. Accordingly, the example illustrations may be modified in accordance with manufacturing techniques and/or tolerances. Thus, the embodiments are not limited to the embodiments shown in the drawings, but include modifications of the configuration formed based on the manufacturing process. Thus, the regions illustrated in the figures have schematic properties and the shapes of the regions illustrated in the figures illustrate the particular shapes of the regions of the elements, but are not intended to be limiting.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

According to one aspect of the disclosed embodiments, the disclosed embodiments provide an accelerator.

The accelerator includes: the system comprises a plurality of computing units PE and at least one multiplexer MUX, wherein the MUX is respectively connected with two interconnected PEs, and the number of PEs performing operation is determined by changing the connection state of the MUX.

In the prior art, accelerators are designed using either a pulsed design or a pseudo instruction set approach. When the accelerator is completed, the computational power of the accelerator is fixed. Equalization of the calculation force requirements cannot be achieved.

For example, an accelerator includes m PEs, and each PE participates in performing an arithmetic operation when it is required to calculate the demand for the computation force NT. The power consumption that it generates is enormous and creates a waste of computing power.

In the present embodiment, however, at least one MUX is provided. For example, when the MUX is one, a MUX may be provided between any two adjacently connected PEs. For example, a MUX is provided between PE10 and PE 11. By changing the connection state of the MUX, the number of PEs performing the arithmetic operation can be changed. For example, when the MUX is in a cascade state, PE10 and PE11 are two PEs that are connected to each other. While when the MUX is off, the number of PEs performing the arithmetic operation is only the PEs connected before the PE10 (including the PE 10).

That is, assuming that ResNet T of computing power is required for ResNet T and the maximum computing power of the accelerator can reach 1T, if the scheme provided by the embodiment is adopted, the second half of PEs of the plurality of PEs can be disconnected through the MUX, so that the operation is closed, further, the power consumption is reduced, and the balanced distribution of computing power is realized.

When there are several MUXs, such as n PEs, m MUXs, n > m+1, MUXs can be set between any two adjacent PEs.

In one possible implementation solution, the number of muxes is multiple, and any two adjacent PEs are connected through one MUX.

In this embodiment, when there are multiple muxes, for example, n PEs, m muxes, and n=m+1, the muxes are set between two adjacent PEs, so that two adjacent PEs are connected by the muxes.

In one possible implementation, the PEs are arranged in a matrix array.

Referring specifically to fig. 1, fig. 1 is a schematic structural diagram of an accelerator according to an embodiment of the disclosure.

The operation principle of the accelerator provided by the embodiment of the present disclosure will now be described in detail with reference to fig. 1. The PEs are arranged in a matrix form, assuming that there are M rows and N columns, the number of PEs is m×n, if the minimum computation parallelism of 1 PE is 32 (i.e. 32 MACs are computed in parallel in one clock cycle), and all PEs work, the total computation power is m×n× Frequence _clock×32, and the units MACs/s. The PEs are connected with adjacent PEs through MUXs, and the states of the MUXs are three: cascading, disconnecting and outputting. The calculated force reaches a maximum when the transverse cascade is longitudinally disconnected or the longitudinal cascade is transversely disconnected. When the algorithm requires only half of the computation force, the latter half can be cut longitudinally (the right half of the MUX is fully off), or the lower half can be cut laterally (the lower half of the MUX is fully off).

Specifically: assuming that the algorithm requires 0.5 tips and the maximum accelerator force is 1.5 tips, this 1.5 tips is the value that all PEs can compute simultaneously. The PE distribution is 3 rows and 12 columns, that is, 3×12=36 PEs all work simultaneously with an algorithm force of 1.5 tips, and the algorithm requires an algorithm force of 0.5 tips, only 1/3 of the maximum algorithm force, if the PEs all compute simultaneously, the requirement of 0.5 tips of the algorithm can be met, but this leads to an increase in power consumption of the accelerator, while 2/3 PEs compute data are invalid. In this embodiment, by changing the connection state of the MUX, 2 of the connection modes of the PEs that obtain the corresponding computing forces are: 1 row 12 column PE calculation module (shown in FIG. 2) and 3 row 4 column calculation module (shown in FIG. 3).

According to another aspect of the disclosed embodiments, the disclosed embodiments provide an FPGA-based acceleration system.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an acceleration system based on FPGA according to an embodiment of the disclosure.

As shown in fig. 4, the FPGA-based acceleration system includes: the system comprises a register, a calculation controller, a memory, a transmitter and an accelerator as described above, wherein the calculation controller is respectively connected with the accelerator and the register, and the transmitter is respectively connected with the accelerator and the memory.

Wherein the dashed line of the memory connection indicates that the memory can be connected to an external device. Similarly, the dashed lines of the register connection indicate that the register may be connected to an external device.

Referring to fig. 5, fig. 5 is a flowchart of a control method of an acceleration system based on an FPGA according to an embodiment of the disclosure.

As shown in fig. 5, the method includes:

S1: the computing controller determines configuration information according to the acquired computing power demand information, the configuration information including the number of PEs and the connection state of each MUX.

In this step, when the computing controller learns the calculation force demand information, the number of PEs required to complete the calculation force corresponding to the calculation force demand information may be determined from the calculation force demand information.

When the number of PEs is known, then the connection state of each MUX may be determined to achieve that the number of PEs is the same as the number of PEs required for computing power when performing the arithmetic operation.

S2: the computing controller sends the configuration information to the accelerator so that the accelerator determines the connection state of the MUX based on the configuration information.

It can be appreciated that when the accelerator receives the configuration information, the corresponding MUX is in a corresponding cascade state or disconnection state to implement connection of a corresponding number of PEs, thereby obtaining a computing module that implements a corresponding computing power.

According to another aspect of the embodiments of the present disclosure, there is further provided an acceleration system based on an FPGA, including an accelerator as described above, and further including a computation controller connected to the accelerator, wherein the computation controller is configured to: and determining configuration information according to the acquired calculation force demand information, wherein the configuration information comprises the number of PE and the connection state of each MUX, and transmitting the configuration information to the accelerator.

According to another aspect of the embodiments of the present disclosure, the embodiments of the present disclosure further provide a CNN network system.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a CNN network system according to an embodiment of the disclosure.

As shown in fig. 6, the system includes: the FPGA-based acceleration system as described above, further comprising a processor coupled to the FPGA-based acceleration system.

Wherein the processor is configured to: and determining configuration information according to the acquired calculation force demand information, wherein the configuration information comprises the number of PE and the connection state of each MUX, and transmitting the configuration information to an acceleration system based on the FPGA.

The scheme of this embodiment will now be described in detail with reference to fig. 6:

in the processor, a configuration information table having a mapping relation with the calculation force demand information is stored in advance. That is, based on the configuration information table, the number of PEs selected when the power demand information is information carrying how much power, and which muxes are in the cascade state and which muxes are in the off state can be acquired.

Therefore, when the processor receives (or acquires) the power demand information, the configuration information table is traversed according to the power demand information, or the power demand information is matched with the configuration information table, so as to obtain corresponding configuration information, namely the number of PEs corresponding to the power demand information and the connection state of each MUX. The processor transmits the configuration information to the FPGA-based acceleration system via the AXI bus.

Specifically, the AXI bus sends configuration information to registers in the FPGA-based acceleration system. The registers send configuration information to the compute controller.

After receiving the configuration information, the computing controller configures the accelerator based on the configuration information, i.e., the MUX in the accelerator is in a state that satisfies the corresponding computing force demand information. And after the computing controller completes the configuration operation, sending an instruction for fetching data to the memory.

After receiving the instruction of fetching data, the memory acquires data to be calculated corresponding to the calculation force demand information from the processor through the AXIS. And transmits the data to be calculated to a transmitter (in particular an input module in the transmitter).

The accelerator extracts data to be calculated from the transmitter (particularly an input module in the transmitter) for calculation, and stores a calculation result obtained by calculation into the transmitter (particularly an output module in the transmitter).

The memory obtains the calculation result from the transmitter and transmits the calculation result to the processor through the AXIS bus.

According to another aspect of the embodiments of the present disclosure, the embodiments of the present disclosure further provide a control method of a CNN network system, which is based on the system as described above.

Referring to fig. 7, fig. 7 is a flowchart illustrating a control method of a CNN network system according to an embodiment of the disclosure. The method comprises the following steps:

S10: the processor obtains the power demand information.

S20: the processor determines configuration information including the number of PEs and the connection status of each MUX from the power demand information and a preset configuration information table.

S30: the processor sends the configuration information to the FPGA-based acceleration system so that the FPGA-based acceleration system determines the connection state of the MUX according to the configuration information.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. An accelerator, comprising: a plurality of computation units PE and a plurality of multiplexers MUX, the MUX is respectively connected with two mutually connected PEs, and the quantity of the PEs for executing operation is determined by changing the connection state of the MUX; wherein, in the case that the number of the MUXs is plural and the number of the PEs is at least two more than the number of the MUXs, the MUXs may be disposed between any two adjacent PEs.

2. The accelerator of claim 1, wherein the MUX is a plurality of muxes, and any two adjacent PEs are connected by one MUX.

3. The accelerator of claim 1, wherein the PEs are arranged in a matrix array.

4. An accelerator according to any one of claims 1 to 3, wherein the connected states include an output state, a cascade state and a disconnected state.

5. An FPGA-based acceleration system, comprising: a register, a computation controller, a memory, a transmitter, and the accelerator of any one of claims 1 to 4, wherein the computation controller is connected to the accelerator and the register, respectively, and the transmitter is connected to the accelerator and the memory, respectively.

6. A method of controlling an FPGA-based acceleration system, the method being based on the FPGA-based acceleration system of claim 5, comprising:

7. An FPGA-based acceleration system, comprising: the accelerator of any one of claims 1 to 4, further comprising:

8. A CNN network system, comprising: the FPGA-based acceleration system of claim 5, and a processor coupled to the FPGA-based acceleration system.

9. The system of claim 8, wherein,

The processor is configured to: and determining configuration information according to the acquired calculation force demand information, wherein the configuration information comprises the number of PE and the connection state of each MUX, and transmitting the configuration information to the acceleration system based on the FPGA.

10. A control method of a CNN network system, the method being based on the system of claim 8, comprising:

acquiring calculation force demand information;