CN108921289B - FPGA heterogeneous acceleration method, device and system - Google Patents

FPGA heterogeneous acceleration method, device and system Download PDF

Info

Publication number
CN108921289B
CN108921289B CN201810635674.7A CN201810635674A CN108921289B CN 108921289 B CN108921289 B CN 108921289B CN 201810635674 A CN201810635674 A CN 201810635674A CN 108921289 B CN108921289 B CN 108921289B
Authority
CN
China
Prior art keywords
data
fpga
host
unit
feature data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810635674.7A
Other languages
Chinese (zh)
Other versions
CN108921289A (en
Inventor
张新
李龙
赵雅倩
陈继承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810635674.7A priority Critical patent/CN108921289B/en
Publication of CN108921289A publication Critical patent/CN108921289A/en
Application granted granted Critical
Publication of CN108921289B publication Critical patent/CN108921289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method, a device and a system for accelerating heterogeneous FPGA; in the scheme, interfaces facing different machine learning platforms are provided at a host end so as to obtain network description files and data of different networks, and different execution units in a bottom layer library of an FPGA end can be configured through the network description files so as to realize the calculation of the data; by the aid of the acceleration mode, the FPGA heterogeneous acceleration system can be used for a new network without secondary development, recompilation of the FPGA heterogeneous acceleration system and a source code is avoided, and accordingly human resources and time resources are saved.

Description

FPGA heterogeneous acceleration method, device and system
Technical Field
The invention relates to the technical field of heterogeneous acceleration, in particular to a method, a device and a system for heterogeneous acceleration of an FPGA (field programmable gate array).
Background
Convolutional Neural Network (CNN) is an efficient recognition method that has been developed in recent years and has attracted much attention. Hubel and Wiesel discovered their unique network structure in the study of neurons for local sensitivity and direction selection in the feline cerebral cortex to effectively reduce the complexity of the feedback neural network, followed by the proposal of a convolutional neural network comprising convolutional layers, activation layers, pooling layers, and fully-connected layers. At present, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, because the network avoids the complex preprocessing of the image and can directly input the original image, it has been more widely applied. The acceleration of the FPGA in the data center has many researches and good effects, and the FPGA becomes one of the mainstream heterogeneous acceleration platforms. The FPGA is also used for CNN acceleration, and has a good acceleration effect relative to a CPU/GPU. Moreover, the FPGA has the characteristics of low delay, low power consumption and the like, so the FPGA receives more and more attention in the field of deep learning.
At present, although the GPU is widely used for heterogeneous acceleration of CNN, the power consumption of the GPU is large. More and more researches show that the FPGA can play a great role in the field of high-performance computing, the speed of processing the convolution problem can be completely compared with that of a GPU, although the FPGA is used for CNN acceleration at present, most of the FPGA accelerates a certain specific network, such as alexnet, resnet50, resnet101 and the like, if a new network is replaced, great modification needs to be carried out, secondary development and debugging are carried out, and a large amount of manpower and time resources are wasted.
Therefore, how to avoid the problem that the original acceleration system needs to be developed and debugged for the second time when the heterogeneous acceleration system of the FPGA accelerates different networks is a problem to be solved by a person skilled in the art.
Disclosure of Invention
The invention aims to provide an FPGA heterogeneous acceleration method, device and system, so as to realize acceleration of different networks and avoid secondary development and debugging of an FPGA heterogeneous acceleration platform.
In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:
an FPGA heterogeneous acceleration method is based on an FPGA terminal and comprises the following steps:
acquiring data and a control instruction from a host; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;
and scheduling different execution units in a bottom library to calculate the data through the control instruction, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.
Wherein, the data and control instruction are obtained from the host computer terminal, including:
the first-level scheduling unit of the FPGA terminal analyzes the control instruction and calculates the access addresses of feature data and filter data in each clock period;
acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
The first-stage scheduling unit encapsulates the original feature data and the original filter data from the feature data and the filter data acquired by the host end to generate FPGA readable data.
The data are calculated by scheduling different execution units in a bottom library through the control instruction, and the method comprises the following steps:
the secondary scheduling unit performs resource allocation on the execution unit in the bottom library by using the hardware description file; the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit;
and calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle.
The utility model provides a heterogeneous accelerating device of FPGA, includes based on the FPGA end:
the acquisition module is used for acquiring data and control instructions from the host computer; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;
and the data calculation module is used for scheduling different execution units in the bottom library to calculate the data through the control instruction and sending the calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.
Wherein, the acquisition module comprises a primary scheduling unit:
the first-level scheduling unit is used for analyzing the control instruction and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
The first-stage scheduling unit encapsulates the original feature data and the original filter data from the feature data and the filter data acquired by the host end to generate FPGA readable data.
The data calculation module comprises a secondary scheduling unit and a bottom library;
the secondary scheduling unit is used for carrying out resource allocation on the execution unit in the bottom library by using the hardware description file; calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle;
wherein the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit.
An FPGA heterogeneous acceleration system comprises a host terminal and an FPGA terminal;
the system comprises a host terminal, a network description file and a FPGA (field programmable gate array), wherein the host terminal is used for providing interfaces of different machine learning platforms, acquiring data and the network description file of a target machine learning platform through the interfaces and analyzing the network description file into control instructions which can be identified by the FPGA;
the FPGA end is used for acquiring data and control instructions from the host end, calculating the data by scheduling different execution units in a bottom library through the control instructions, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.
And the host end receives the calculation result sent by the FPGA end, and performs subsequent data processing on the calculation result by using the softmax unit.
According to the scheme, the FPGA heterogeneous acceleration method provided by the embodiment of the invention is based on an FPGA terminal and comprises the following steps: acquiring data and a control instruction from a host; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform; and scheduling different execution units in a bottom library to calculate the data through the control instruction, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.
Therefore, in the application, interfaces facing different machine learning platforms are provided at the host end, so that network description files and data of different networks are obtained, and different execution units in the bottom library can be configured through the network description files, so that the data calculation is realized; by the aid of the acceleration mode, the FPGA heterogeneous acceleration system does not need to be developed for the second time when aiming at a new network, and recompilation of the FPGA heterogeneous acceleration system and a source code is avoided, so that human resources and time resources are saved; the invention also discloses an FPGA heterogeneous acceleration device and system, which can also realize the technical effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of an FPGA heterogeneous acceleration method disclosed in the embodiment of the present invention;
FIG. 2 is a schematic diagram of an overall architecture of a heterogeneous acceleration system based on an FPGA according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an FPGA end architecture disclosed in an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an FPGA heterogeneous acceleration device disclosed in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method, a device and a system for accelerating heterogeneous FPGA (field programmable gate array) so as to accelerate different networks and avoid secondary development and debugging of an FPGA heterogeneous acceleration platform.
Referring to fig. 1, an FPGA heterogeneous acceleration method provided in an embodiment of the present invention is based on an FPGA side, and includes:
s101, acquiring data and a control command from a host; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;
specifically, in this embodiment, the host provides an interface to the machine learning platform, and the interface specifically includes an interface of a mainstream machine learning framework such as TensorFlow, Caffe, MXNet, and the like. That is to say, the scheme can realize the compatibility of the scheme to different networks by providing interfaces of different machine learning platforms. Therefore, the heterogeneous acceleration system based on the FPGA disclosed by the embodiment is compatible with the flexibility of machine learning frameworks such as TensorFlow and the advantages of low power consumption and low delay of the FPGA.
Referring to fig. 2, a schematic diagram of an overall architecture of the heterogeneous acceleration system based on the FPGA provided in this embodiment shows that the HOST includes an upper layer interface and a HOST that provide interfaces of different machine learning platforms, and the FPGA includes a hardware description module and an FPGA module. It should be noted that the FPGA is a heterogeneous acceleration platform, and is intended to accelerate a part with slow performance of the host, so that data processing needs to be implemented by the host and the FPGA. Usually, a CNN network includes a convolution unit, an activation unit, a norm unit, a pooling unit, and a softmax unit, and what really operates in the underlying library for accelerating the computation in this embodiment is the convolution unit, the activation unit, the norm unit, and the pooling unit.
The work at the host end is some initialization work, and in this embodiment, a new network that needs to be accelerated is referred to as a target network, such as a machine learning framework of TensorFlow, Caffe, MXNet, and the like; the upper layer interface is mainly connected with different deep learning frames, reorganizes the network description file of the target network according to the requirements of the FPGA end, namely converts the network description file into instructions which can be recognized by the FPGA, and is equivalent to the function of a compiler. The network description file can be a prototxt file, defines the structure of the network and the information of each layer of structure, and can be easily described by using the prototxt for a new network, so that the scheme has higher support for the new network.
Furthermore, the feature data and the filter data acquired by the first-stage scheduling unit of the FPGA from the host end are used as raw feature data and raw filter data for the host end to be encapsulated to generate readable data of the FPGA. It can be understood that, in order to enable the FPGA to recognize the data acquired from the new network, the host needs to initialize the data, specifically, reorganize the original feature data and the filter data according to the format read by the FPGA.
It should be noted that the CNN data stream mainly includes feature data and filter data, and the data amount is more than 99.9%, and there are other data, such as α and μ in norm, and scale, and the data amount is very small and does not need special processing. By reorganization is understood repackaging, which is encapsulated with a structure recognizable by the FPGA. CNN is also called a convolutional neural network, and the general convolutional part accounts for more than 90% of the operation amount, so that a convolutional unit is most important, feature data and filter data are used for convolution, and different CNNs are combinations of convolutional modules, activation modules, norm modules and firing modules with different sizes and types. Moreover, the host side also needs to implement buffer creation, DDR reading, kernel side management, and the like.
Wherein, the data and control instruction are obtained from the host computer terminal, including:
the first-level scheduling unit of the FPGA terminal analyzes the control instruction and calculates the access addresses of feature data and filter data in each clock period;
acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
It should be noted that the FPGA side is a core part and the most complex part of the whole heterogeneous acceleration system, and the performance of the whole architecture is directly determined by the design of the FPGA side, so the present invention provides the schematic diagram of the FPGA side architecture shown in fig. 3. The whole framework can be abstracted into 3 layers of 8 modules, all the modules are connected through data flow and control flow, the master controller is responsible for all scheduling work, the dark arrow in the figure is the control flow, and the light arrow is the data flow.
As can be seen from fig. 3, the 3 layers included in the FPGA end are a primary scheduling unit, a secondary scheduling unit, and a bottom library, respectively. The first-level scheduling unit comprises a master controller and an M20K cache, the master controller firstly analyzes the control instruction after acquiring the control instruction, then calculates the access addresses of feature data and filter data in each clock cycle, and transmits the feature data and the filter data to the module scheduler in the second-level scheduling unit through the channel after acquiring the feature data and the filter data through the access addresses. It should be noted that the first-level scheduling unit reads the feature data directly from the DDR, the second-level scheduling unit and the M20K buffer are filled with the feature data later, and the first-level scheduling unit can directly read the data from the M20K. In addition, the overall controller also needs to transmit control instructions to the module scheduler for scheduling the underlying library, because the different layers CONV, RELU, NORM, POOLING are configured differently.
And S102, scheduling different execution units in a bottom library to calculate the data through the control instruction, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.
The data are calculated by scheduling different execution units in a bottom library through the control instruction, and the method comprises the following steps:
the secondary scheduling unit performs resource allocation on the execution unit in the bottom library by using the hardware description file; the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit; and calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle.
As can be seen from fig. 3, the secondary scheduling unit at the FPGA end includes a module scheduler and an adapter, and a hardware description module, where the module scheduler and the adapter are configured to receive control instructions and data transmitted by the master controller, and the hardware description module is configured to send a hardware description file to the module scheduler and the adapter, where the hardware description file is predetermined, and when different networks are accelerated, the hardware description files used by different networks may be the same or different. The hardware description file is mainly used for carrying out resource allocation on execution units in the bottom layer library and formulating hardware resource allocation of each bottom layer library, such as the number of convolution units, the processing capacity of the POOLING units and the like.
In the operation process, the secondary scheduling unit receives control instructions and data from the master controller in each clock cycle, the data is feature and filter, the control instructions are complex, such as whether activation is needed, activation function types (sigmoid, relu), whether POOLING is needed, POOLING types (max power, average power), and the like. The bottom library includes minimum execution units such as CONV, RELU, NORM, POOLING, and the like, and the minimum execution units are relatively comprehensive and can include common networks such as AlexNet, GoogleNet, VGG, ResNet, and the like. In principle, RTL should be used for development so that the resources are controllable.
It should be noted that, the data interaction between the host side and the FPGA side not only includes that the host side transmits the initialized feature and filter data and the analyzed control flow (i.e. control instruction) to the FPGA side through PCIe, but also includes: after the FPGA terminal schedules different execution units in the bottom library through the control instruction to calculate the data, the calculation result is sent to the host terminal through PCIe, so that the host terminal can perform subsequent data processing on the calculation result. The subsequent data processing specifically refers to performing softmax on the calculation data through a softmax module, and calculating work such as recognition rate calculation and performance analysis.
In conclusion, the scheme provides interfaces facing different machine learning platforms at the host end so as to acquire network description files and data of different networks, and different execution units in the bottom library can be configured through the network description files so as to realize the calculation of the data; by the aid of the acceleration mode, the FPGA heterogeneous acceleration system can be used for a new network without secondary development, recompilation of the FPGA heterogeneous acceleration system and a source code is avoided, and accordingly human resources and time resources are saved.
For example: the target network is resnet50, the operation of resnet50 under the framework provided by the embodiment is simple, only one prototxt network description file is needed, the prototxt is analyzed and converted into instructions at the host side and then transmitted to the FPGA side, data (mainly features and filters) are initialized and transmitted to the FPGA side, and FPGA operation can be performed by the instructions and the data.
In the following, the FPGA heterogeneous acceleration apparatus provided by the embodiment of the present invention is introduced, and the FPGA heterogeneous acceleration apparatus described below and the FPGA heterogeneous acceleration method described above may refer to each other.
Referring to fig. 4, an FPGA heterogeneous acceleration device provided in an embodiment of the present invention is based on an FPGA side, and includes:
the acquisition module is used for acquiring data and control instructions from the host computer; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;
and the data calculation module 200 is configured to schedule different execution units in the underlying library to calculate the data through the control instruction, and send a calculation result to the host end, so as to perform subsequent data processing on the calculation result through the host end.
Wherein, the acquisition module comprises a primary scheduling unit:
the first-level scheduling unit is used for analyzing the control instruction and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
The first-stage scheduling unit encapsulates the original feature data and the original filter data from the feature data and the filter data acquired by the host end to generate FPGA readable data.
The data calculation module comprises a secondary scheduling unit and a bottom library;
the secondary scheduling unit is used for carrying out resource allocation on the execution unit in the bottom library by using the hardware description file; calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle;
wherein the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit.
The FPGA heterogeneous acceleration system provided by the embodiment of the invention comprises a host terminal and an FPGA terminal;
the system comprises a host terminal, a network description file and a FPGA (field programmable gate array), wherein the host terminal is used for providing interfaces of different machine learning platforms, acquiring data and the network description file of a target machine learning platform through the interfaces and analyzing the network description file into control instructions which can be identified by the FPGA;
the FPGA end is used for acquiring data and control instructions from the host end, calculating the data by scheduling different execution units in a bottom library through the control instructions, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end.
And the host end receives the calculation result sent by the FPGA end, and performs subsequent data processing on the calculation result by using the softmax unit.
The FPGA end is specifically used for analyzing the control instruction and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
The FPGA end packages the original feature data and the original filter data to generate FPGA readable data from the feature data and the filter data acquired by the FPGA end from the host end.
The FPGA end is specifically used for carrying out resource configuration on an execution unit in a bottom library by using a hardware description file; calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle; the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. An FPGA heterogeneous acceleration method is characterized by comprising the following steps based on an FPGA terminal:
acquiring data and a control instruction from a host; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;
different execution units in a bottom library are scheduled to calculate the data through the control instruction, and the calculation result is sent to the host end so as to perform subsequent data processing on the calculation result through the host end;
wherein, the data and control instruction are obtained from the host computer terminal, including:
the first-level scheduling unit of the FPGA terminal analyzes the control instruction and calculates the access addresses of feature data and filter data in each clock period;
acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
2. The FPGA heterogeneous acceleration method according to claim 1, wherein the feature data and the filter data acquired by the primary scheduling unit from the host side are encapsulated for the host side to generate FPGA readable data.
3. The FPGA heterogeneous acceleration method of claim 2, wherein the data is calculated by scheduling different execution units in a bottom library through the control instruction, and the method comprises the following steps:
the secondary scheduling unit performs resource allocation on the execution unit in the bottom library by using the hardware description file; the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit;
and calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle.
4. The utility model provides a heterogeneous accelerating device of FPGA which characterized in that, based on the FPGA end, includes:
the acquisition module is used for acquiring data and control instructions from the host computer; the control instruction is a control instruction which can be identified by the FPGA and is generated after the host terminal analyzes the network description file; the host side provides interfaces of different machine learning platforms, and the data and the network description file are acquired by the host side through the interface of the target machine learning platform;
the data calculation module is used for scheduling different execution units in a bottom library to calculate the data through the control instruction and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end;
wherein, the acquisition module comprises a primary scheduling unit:
the first-level scheduling unit is used for analyzing the control instruction and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
5. The FPGA heterogeneous acceleration device according to claim 4, wherein the first-level scheduling unit encapsulates the original feature data and the original filter data from the feature data and the filter data acquired from the host to generate FPGA readable data.
6. The FPGA heterogeneous acceleration device of claim 5, wherein the data computation module comprises a secondary scheduling unit and an underlying library;
the secondary scheduling unit is used for carrying out resource allocation on the execution unit in the bottom library by using the hardware description file; calling an execution unit configured in a bottom library to calculate feature data and filter data by using a control instruction sent by the primary scheduling unit in each clock cycle;
wherein the execution unit comprises at least one execution unit of a convolution unit, an activation unit, a norm unit and a pooling unit.
7. The FPGA heterogeneous acceleration system is characterized by comprising a host end and an FPGA end;
the system comprises a host terminal, a network description file and a FPGA (field programmable gate array), wherein the host terminal is used for providing interfaces of different machine learning platforms, acquiring data and the network description file of a target machine learning platform through the interfaces and analyzing the network description file into control instructions which can be identified by the FPGA;
the FPGA end is used for acquiring data and control instructions from the host end, scheduling different execution units in a bottom library through the control instructions to calculate the data, and sending a calculation result to the host end so as to perform subsequent data processing on the calculation result through the host end;
wherein, the FPGA end is further used for: analyzing the control instruction through a primary scheduling unit of the FPGA end, and calculating the access addresses of feature data and filter data in each clock cycle; acquiring feature data from the host end by using the access address of the feature data, and sending the feature data and the control instruction to a secondary scheduling unit of the FPGA end; and acquiring feature data from the host end by using the access address of the filter data, and caching the feature data to the primary scheduling unit.
8. The FPGA heterogeneous acceleration system according to claim 7, wherein the host receives the calculation result sent by the FPGA end, and performs subsequent data processing on the calculation result by using a softmax unit.
CN201810635674.7A 2018-06-20 2018-06-20 FPGA heterogeneous acceleration method, device and system Active CN108921289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810635674.7A CN108921289B (en) 2018-06-20 2018-06-20 FPGA heterogeneous acceleration method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810635674.7A CN108921289B (en) 2018-06-20 2018-06-20 FPGA heterogeneous acceleration method, device and system

Publications (2)

Publication Number Publication Date
CN108921289A CN108921289A (en) 2018-11-30
CN108921289B true CN108921289B (en) 2021-10-29

Family

ID=64422044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810635674.7A Active CN108921289B (en) 2018-06-20 2018-06-20 FPGA heterogeneous acceleration method, device and system

Country Status (1)

Country Link
CN (1) CN108921289B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021031137A1 (en) * 2019-08-21 2021-02-25 深圳鲲云信息科技有限公司 Artificial intelligence application development system, computer device and storage medium
CN112685159B (en) * 2020-12-30 2022-11-29 深圳致星科技有限公司 Federal learning calculation task processing scheme based on FPGA heterogeneous processing system
CN112732638B (en) * 2021-01-22 2022-05-06 上海交通大学 Heterogeneous acceleration system and method based on CTPN network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106020425A (en) * 2016-05-27 2016-10-12 浪潮(北京)电子信息产业有限公司 FPGA heterogeneous acceleration calculating system
CN106776466A (en) * 2016-11-30 2017-05-31 郑州云海信息技术有限公司 A kind of FPGA isomeries speed-up computation apparatus and system
CN107103113A (en) * 2017-03-23 2017-08-29 中国科学院计算技术研究所 Towards the Automation Design method, device and the optimization method of neural network processor
CN107425957A (en) * 2017-08-31 2017-12-01 郑州云海信息技术有限公司 A kind of cryptographic attack method, apparatus and isomery accelerate platform
CN107491317A (en) * 2017-10-10 2017-12-19 郑州云海信息技术有限公司 A kind of symmetrical encryption and decryption method and systems of AES for accelerating platform based on isomery

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9891935B2 (en) * 2015-08-13 2018-02-13 Altera Corporation Application-based dynamic heterogeneous many-core systems and methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106020425A (en) * 2016-05-27 2016-10-12 浪潮(北京)电子信息产业有限公司 FPGA heterogeneous acceleration calculating system
CN106776466A (en) * 2016-11-30 2017-05-31 郑州云海信息技术有限公司 A kind of FPGA isomeries speed-up computation apparatus and system
CN107103113A (en) * 2017-03-23 2017-08-29 中国科学院计算技术研究所 Towards the Automation Design method, device and the optimization method of neural network processor
CN107425957A (en) * 2017-08-31 2017-12-01 郑州云海信息技术有限公司 A kind of cryptographic attack method, apparatus and isomery accelerate platform
CN107491317A (en) * 2017-10-10 2017-12-19 郑州云海信息技术有限公司 A kind of symmetrical encryption and decryption method and systems of AES for accelerating platform based on isomery

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Acceleration of RSA Processes based on Hybrid ARM-FPGA Cluster;Xu Bai等;《2017 IEEE Symposium on Computers and Communications (ISCC) 》;20170904;第1-7页 *
基于 FPGA 的异构加速卡 OpenCL 架构研究及性能优化;赵贺辉;《计算机工程应用技术》;20180615(第11期);第30-32页 *

Also Published As

Publication number Publication date
CN108921289A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN109376843B (en) FPGA-based electroencephalogram signal rapid classification method, implementation method and device
Chen et al. Cloud-DNN: An open framework for mapping DNN models to cloud FPGAs
Otterness et al. An evaluation of the NVIDIA TX1 for supporting real-time computer-vision workloads
CN108268943B (en) Hardware accelerator engine
JP2020537784A (en) Machine learning runtime library for neural network acceleration
CN108921289B (en) FPGA heterogeneous acceleration method, device and system
EP3766018A1 (en) Hardware accelerated neural network subgraphs
CN108268940B (en) Tool for creating reconfigurable interconnect frameworks
CN106325967B (en) A kind of hardware-accelerated method, compiler and equipment
JP2020537789A (en) Static block scheduling in massively parallel software-defined hardware systems
CN106951926A (en) The deep learning systems approach and device of a kind of mixed architecture
CN105487838A (en) Task-level parallel scheduling method and system for dynamically reconfigurable processor
CN110751676A (en) Heterogeneous computing system and method based on target detection and readable storage medium
CN106780149A (en) A kind of equipment real-time monitoring system based on timed task scheduling
CN205486304U (en) Portable realtime graphic object detection of low -power consumption and tracking means
Song et al. Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small
CN111814959A (en) Model training data processing method, device and system and storage medium
Cheng et al. Accelerating end-to-end deep learning workflow with codesign of data preprocessing and scheduling
CN115146582A (en) Simulation method, simulation device, electronic apparatus, and computer-readable storage medium
US11461662B1 (en) Compilation time reduction for memory and compute bound neural networks
CN108549935A (en) A kind of device and method for realizing neural network model
Fakhi et al. New optimized GPU version of the k-means algorithm for large-sized image segmentation
WO2023123266A1 (en) Subgraph compilation method, subgraph execution method and related device
Cui et al. Design and Implementation of OpenCL-Based FPGA Accelerator for YOLOv2
CN110633493A (en) OpenCL transaction data processing method based on Intel FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant