CN108921289B

CN108921289B - FPGA heterogeneous acceleration method, device and system

Info

Publication number: CN108921289B
Application number: CN201810635674.7A
Authority: CN
Inventors: 张新; 李龙; 赵雅倩; 陈继承
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-06-20
Filing date: 2018-06-20
Publication date: 2021-10-29
Anticipated expiration: 2038-06-20
Also published as: CN108921289A

Abstract

The invention discloses a method, a device and a system for accelerating heterogeneous FPGA; in the scheme, interfaces facing different machine learning platforms are provided at a host end so as to obtain network description files and data of different networks, and different execution units in a bottom layer library of an FPGA end can be configured through the network description files so as to realize the calculation of the data; by the aid of the acceleration mode, the FPGA heterogeneous acceleration system can be used for a new network without secondary development, recompilation of the FPGA heterogeneous acceleration system and a source code is avoided, and accordingly human resources and time resources are saved.

Description

FPGA heterogeneous acceleration method, device and system

Technical Field

The invention relates to the technical field of heterogeneous acceleration, in particular to a method, a device and a system for heterogeneous acceleration of an FPGA (field programmable gate array).

Background

Convolutional Neural Network (CNN) is an efficient recognition method that has been developed in recent years and has attracted much attention. Hubel and Wiesel discovered their unique network structure in the study of neurons for local sensitivity and direction selection in the feline cerebral cortex to effectively reduce the complexity of the feedback neural network, followed by the proposal of a convolutional neural network comprising convolutional layers, activation layers, pooling layers, and fully-connected layers. At present, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, because the network avoids the complex preprocessing of the image and can directly input the original image, it has been more widely applied. The acceleration of the FPGA in the data center has many researches and good effects, and the FPGA becomes one of the mainstream heterogeneous acceleration platforms. The FPGA is also used for CNN acceleration, and has a good acceleration effect relative to a CPU/GPU. Moreover, the FPGA has the characteristics of low delay, low power consumption and the like, so the FPGA receives more and more attention in the field of deep learning.

At present, although the GPU is widely used for heterogeneous acceleration of CNN, the power consumption of the GPU is large. More and more researches show that the FPGA can play a great role in the field of high-performance computing, the speed of processing the convolution problem can be completely compared with that of a GPU, although the FPGA is used for CNN acceleration at present, most of the FPGA accelerates a certain specific network, such as alexnet, resnet50, resnet101 and the like, if a new network is replaced, great modification needs to be carried out, secondary development and debugging are carried out, and a large amount of manpower and time resources are wasted.

Therefore, how to avoid the problem that the original acceleration system needs to be developed and debugged for the second time when the heterogeneous acceleration system of the FPGA accelerates different networks is a problem to be solved by a person skilled in the art.

Disclosure of Invention

The invention aims to provide an FPGA heterogeneous acceleration method, device and system, so as to realize acceleration of different networks and avoid secondary development and debugging of an FPGA heterogeneous acceleration platform.

In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:

an FPGA heterogeneous acceleration method is based on an FPGA terminal and comprises the following steps:

acquiring data and a control instruction from a host; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;

and scheduling different execution units in a bottom library to calculate the data through the control instruction, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.

Wherein, the data and control instruction are obtained from the host computer terminal, including:

the first-level scheduling unit of the FPGA terminal analyzes the control instruction and calculates the access addresses of feature data and filter data in each clock period;

acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.

The first-stage scheduling unit encapsulates the original feature data and the original filter data from the feature data and the filter data acquired by the host end to generate FPGA readable data.

The data are calculated by scheduling different execution units in a bottom library through the control instruction, and the method comprises the following steps:

the secondary scheduling unit performs resource allocation on the execution unit in the bottom library by using the hardware description file; the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit;

and calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle.

The utility model provides a heterogeneous accelerating device of FPGA, includes based on the FPGA end:

the acquisition module is used for acquiring data and control instructions from the host computer; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;

and the data calculation module is used for scheduling different execution units in the bottom library to calculate the data through the control instruction and sending the calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.

Wherein, the acquisition module comprises a primary scheduling unit:

the first-level scheduling unit is used for analyzing the control instruction and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.

The data calculation module comprises a secondary scheduling unit and a bottom library;

the secondary scheduling unit is used for carrying out resource allocation on the execution unit in the bottom library by using the hardware description file; calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle;

wherein the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit.

An FPGA heterogeneous acceleration system comprises a host terminal and an FPGA terminal;

the system comprises a host terminal, a network description file and a FPGA (field programmable gate array), wherein the host terminal is used for providing interfaces of different machine learning platforms, acquiring data and the network description file of a target machine learning platform through the interfaces and analyzing the network description file into control instructions which can be identified by the FPGA;

the FPGA end is used for acquiring data and control instructions from the host end, calculating the data by scheduling different execution units in a bottom library through the control instructions, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.

And the host end receives the calculation result sent by the FPGA end, and performs subsequent data processing on the calculation result by using the softmax unit.

According to the scheme, the FPGA heterogeneous acceleration method provided by the embodiment of the invention is based on an FPGA terminal and comprises the following steps: acquiring data and a control instruction from a host; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform; and scheduling different execution units in a bottom library to calculate the data through the control instruction, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.

Therefore, in the application, interfaces facing different machine learning platforms are provided at the host end, so that network description files and data of different networks are obtained, and different execution units in the bottom library can be configured through the network description files, so that the data calculation is realized; by the aid of the acceleration mode, the FPGA heterogeneous acceleration system does not need to be developed for the second time when aiming at a new network, and recompilation of the FPGA heterogeneous acceleration system and a source code is avoided, so that human resources and time resources are saved; the invention also discloses an FPGA heterogeneous acceleration device and system, which can also realize the technical effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of an FPGA heterogeneous acceleration method disclosed in the embodiment of the present invention;

FIG. 2 is a schematic diagram of an overall architecture of a heterogeneous acceleration system based on an FPGA according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an FPGA end architecture disclosed in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an FPGA heterogeneous acceleration device disclosed in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a method, a device and a system for accelerating heterogeneous FPGA (field programmable gate array) so as to accelerate different networks and avoid secondary development and debugging of an FPGA heterogeneous acceleration platform.

Referring to fig. 1, an FPGA heterogeneous acceleration method provided in an embodiment of the present invention is based on an FPGA side, and includes:

s101, acquiring data and a control command from a host; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;

specifically, in this embodiment, the host provides an interface to the machine learning platform, and the interface specifically includes an interface of a mainstream machine learning framework such as TensorFlow, Caffe, MXNet, and the like. That is to say, the scheme can realize the compatibility of the scheme to different networks by providing interfaces of different machine learning platforms. Therefore, the heterogeneous acceleration system based on the FPGA disclosed by the embodiment is compatible with the flexibility of machine learning frameworks such as TensorFlow and the advantages of low power consumption and low delay of the FPGA.

Referring to fig. 2, a schematic diagram of an overall architecture of the heterogeneous acceleration system based on the FPGA provided in this embodiment shows that the HOST includes an upper layer interface and a HOST that provide interfaces of different machine learning platforms, and the FPGA includes a hardware description module and an FPGA module. It should be noted that the FPGA is a heterogeneous acceleration platform, and is intended to accelerate a part with slow performance of the host, so that data processing needs to be implemented by the host and the FPGA. Usually, a CNN network includes a convolution unit, an activation unit, a norm unit, a pooling unit, and a softmax unit, and what really operates in the underlying library for accelerating the computation in this embodiment is the convolution unit, the activation unit, the norm unit, and the pooling unit.

The work at the host end is some initialization work, and in this embodiment, a new network that needs to be accelerated is referred to as a target network, such as a machine learning framework of TensorFlow, Caffe, MXNet, and the like; the upper layer interface is mainly connected with different deep learning frames, reorganizes the network description file of the target network according to the requirements of the FPGA end, namely converts the network description file into instructions which can be recognized by the FPGA, and is equivalent to the function of a compiler. The network description file can be a prototxt file, defines the structure of the network and the information of each layer of structure, and can be easily described by using the prototxt for a new network, so that the scheme has higher support for the new network.

Furthermore, the feature data and the filter data acquired by the first-stage scheduling unit of the FPGA from the host end are used as raw feature data and raw filter data for the host end to be encapsulated to generate readable data of the FPGA. It can be understood that, in order to enable the FPGA to recognize the data acquired from the new network, the host needs to initialize the data, specifically, reorganize the original feature data and the filter data according to the format read by the FPGA.

It should be noted that the CNN data stream mainly includes feature data and filter data, and the data amount is more than 99.9%, and there are other data, such as α and μ in norm, and scale, and the data amount is very small and does not need special processing. By reorganization is understood repackaging, which is encapsulated with a structure recognizable by the FPGA. CNN is also called a convolutional neural network, and the general convolutional part accounts for more than 90% of the operation amount, so that a convolutional unit is most important, feature data and filter data are used for convolution, and different CNNs are combinations of convolutional modules, activation modules, norm modules and firing modules with different sizes and types. Moreover, the host side also needs to implement buffer creation, DDR reading, kernel side management, and the like.

It should be noted that the FPGA side is a core part and the most complex part of the whole heterogeneous acceleration system, and the performance of the whole architecture is directly determined by the design of the FPGA side, so the present invention provides the schematic diagram of the FPGA side architecture shown in fig. 3. The whole framework can be abstracted into 3 layers of 8 modules, all the modules are connected through data flow and control flow, the master controller is responsible for all scheduling work, the dark arrow in the figure is the control flow, and the light arrow is the data flow.

As can be seen from fig. 3, the 3 layers included in the FPGA end are a primary scheduling unit, a secondary scheduling unit, and a bottom library, respectively. The first-level scheduling unit comprises a master controller and an M20K cache, the master controller firstly analyzes the control instruction after acquiring the control instruction, then calculates the access addresses of feature data and filter data in each clock cycle, and transmits the feature data and the filter data to the module scheduler in the second-level scheduling unit through the channel after acquiring the feature data and the filter data through the access addresses. It should be noted that the first-level scheduling unit reads the feature data directly from the DDR, the second-level scheduling unit and the M20K buffer are filled with the feature data later, and the first-level scheduling unit can directly read the data from the M20K. In addition, the overall controller also needs to transmit control instructions to the module scheduler for scheduling the underlying library, because the different layers CONV, RELU, NORM, POOLING are configured differently.

And S102, scheduling different execution units in a bottom library to calculate the data through the control instruction, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.

the secondary scheduling unit performs resource allocation on the execution unit in the bottom library by using the hardware description file; the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit; and calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle.

As can be seen from fig. 3, the secondary scheduling unit at the FPGA end includes a module scheduler and an adapter, and a hardware description module, where the module scheduler and the adapter are configured to receive control instructions and data transmitted by the master controller, and the hardware description module is configured to send a hardware description file to the module scheduler and the adapter, where the hardware description file is predetermined, and when different networks are accelerated, the hardware description files used by different networks may be the same or different. The hardware description file is mainly used for carrying out resource allocation on execution units in the bottom layer library and formulating hardware resource allocation of each bottom layer library, such as the number of convolution units, the processing capacity of the POOLING units and the like.

In the operation process, the secondary scheduling unit receives control instructions and data from the master controller in each clock cycle, the data is feature and filter, the control instructions are complex, such as whether activation is needed, activation function types (sigmoid, relu), whether POOLING is needed, POOLING types (max power, average power), and the like. The bottom library includes minimum execution units such as CONV, RELU, NORM, POOLING, and the like, and the minimum execution units are relatively comprehensive and can include common networks such as AlexNet, GoogleNet, VGG, ResNet, and the like. In principle, RTL should be used for development so that the resources are controllable.

It should be noted that, the data interaction between the host side and the FPGA side not only includes that the host side transmits the initialized feature and filter data and the analyzed control flow (i.e. control instruction) to the FPGA side through PCIe, but also includes: after the FPGA terminal schedules different execution units in the bottom library through the control instruction to calculate the data, the calculation result is sent to the host terminal through PCIe, so that the host terminal can perform subsequent data processing on the calculation result. The subsequent data processing specifically refers to performing softmax on the calculation data through a softmax module, and calculating work such as recognition rate calculation and performance analysis.

In conclusion, the scheme provides interfaces facing different machine learning platforms at the host end so as to acquire network description files and data of different networks, and different execution units in the bottom library can be configured through the network description files so as to realize the calculation of the data; by the aid of the acceleration mode, the FPGA heterogeneous acceleration system can be used for a new network without secondary development, recompilation of the FPGA heterogeneous acceleration system and a source code is avoided, and accordingly human resources and time resources are saved.

For example: the target network is resnet50, the operation of resnet50 under the framework provided by the embodiment is simple, only one prototxt network description file is needed, the prototxt is analyzed and converted into instructions at the host side and then transmitted to the FPGA side, data (mainly features and filters) are initialized and transmitted to the FPGA side, and FPGA operation can be performed by the instructions and the data.

In the following, the FPGA heterogeneous acceleration apparatus provided by the embodiment of the present invention is introduced, and the FPGA heterogeneous acceleration apparatus described below and the FPGA heterogeneous acceleration method described above may refer to each other.

Referring to fig. 4, an FPGA heterogeneous acceleration device provided in an embodiment of the present invention is based on an FPGA side, and includes:

and the data calculation module 200 is configured to schedule different execution units in the underlying library to calculate the data through the control instruction, and send a calculation result to the host end, so as to perform subsequent data processing on the calculation result through the host end.

Wherein, the acquisition module comprises a primary scheduling unit:

The FPGA heterogeneous acceleration system provided by the embodiment of the invention comprises a host terminal and an FPGA terminal;

The FPGA end is specifically used for analyzing the control instruction and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.

The FPGA end packages the original feature data and the original filter data to generate FPGA readable data from the feature data and the filter data acquired by the FPGA end from the host end.

The FPGA end is specifically used for carrying out resource configuration on an execution unit in a bottom library by using a hardware description file; calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle; the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An FPGA heterogeneous acceleration method is characterized by comprising the following steps based on an FPGA terminal:

different execution units in a bottom library are scheduled to calculate the data through the control instruction, and the calculation result is sent to the host end so as to perform subsequent data processing on the calculation result through the host end;

2. The FPGA heterogeneous acceleration method according to claim 1, wherein the feature data and the filter data acquired by the primary scheduling unit from the host side are encapsulated for the host side to generate FPGA readable data.

3. The FPGA heterogeneous acceleration method of claim 2, wherein the data is calculated by scheduling different execution units in a bottom library through the control instruction, and the method comprises the following steps:

4. The utility model provides a heterogeneous accelerating device of FPGA which characterized in that, based on the FPGA end, includes:

the data calculation module is used for scheduling different execution units in a bottom library to calculate the data through the control instruction and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end;

wherein, the acquisition module comprises a primary scheduling unit:

5. The FPGA heterogeneous acceleration device according to claim 4, wherein the first-level scheduling unit encapsulates the original feature data and the original filter data from the feature data and the filter data acquired from the host to generate FPGA readable data.

6. The FPGA heterogeneous acceleration device of claim 5, wherein the data computation module comprises a secondary scheduling unit and an underlying library;

7. The FPGA heterogeneous acceleration system is characterized by comprising a host end and an FPGA end;

the FPGA end is used for acquiring data and control instructions from the host end, scheduling different execution units in a bottom library through the control instructions to calculate the data, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end;

wherein, the FPGA end is further used for: analyzing the control instruction through a primary scheduling unit of the FPGA end, and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.

8. The FPGA heterogeneous acceleration system according to claim 7, wherein the host receives the calculation result sent by the FPGA end, and performs subsequent data processing on the calculation result by using a softmax unit.