CN113673704A

CN113673704A - Relational network reasoning optimization method based on software and hardware cooperative acceleration

Info

Publication number: CN113673704A
Application number: CN202110757032.6A
Authority: CN
Inventors: 张志超; 刘忠麟; 蒋丽婷
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-11-19
Anticipated expiration: 2041-07-05
Also published as: CN113673704B

Abstract

The invention discloses a method for relational network reasoning optimization based on software and hardware cooperative acceleration, which can solve the speed and efficiency problems of relational network reasoning calculation and comprises a support set feature extraction process running on an X86/GPU platform and a relational network image classification reasoning design running on an FPGA chip. The support set feature extraction process comprises the following steps: different categories of support set image data are received. And performing feature extraction on the support sets of different categories, and constructing different feature pools as the features of the offline support sets. The relational network image classification reasoning design running on the FPGA chip adopts a test image feature extraction module and a relational computation module which are constructed on the FPGA chip: and receiving the offline support set characteristics and storing the offline support set characteristics into the DRAM. And the test image feature extraction module receives the test image, performs feature extraction and obtains the features of the test image. And the relation calculation module performs relation network reasoning calculation to obtain a relation score. The FPGA-based relational network image classification reasoning design adopts a multi-core interconnection design.

Description

Relational network reasoning optimization method based on software and hardware cooperative acceleration

Technical Field

The invention relates to the technical field of software and hardware cooperative acceleration, in particular to a relational network reasoning optimization method based on software and hardware cooperative acceleration.

Background

The convolutional neural network technology based on deep learning is widely applied to image processing tasks, and large-scale deep models such as AlexNet, VGG16 and ResNet, which are provided on a large-scale labeled Imagenet data set, show higher identification accuracy. However, in the application of learning with few samples, especially for new unknown class classification tasks, new learning modes and methods are needed, including few sample learning methods such as matchings Nets, Meta Nets, MAML, Prototypical Nets, relationship Nets, etc., models are trained by constructing multiple batches of different class tasks, a support set is introduced as prior knowledge to process classification of the unknown class tasks, and the relationship network obtains higher recognition accuracy on the omnioglot data set and the miniageimnet data set relative to other models.

The relational network adopts a design mode of shallow volume blocks to construct a feature extraction module and a relational computation module, a common reasoning computation mode adopts a Central Processing Unit (CPU) or a Graphic Processing Unit (GPU) to process, the CPU is used for processing at a low speed, and the GPU is used for processing at low energy efficiency. Relational network and few sample learning techniques^[4-8]The method is limited by the over-fitting problem of a small number of samples in a large model network, a feature extraction module and a relation calculation module are usually constructed in a shallow rolling block mode, the calculation complexity and the model parameter storage amount are relatively low, and the method is suitable for an acceleration mode based on Field Programmable Gate Array (FPGA) processing.

The typical relationship network reasoning input is a support set and a test image, the output is a relationship score, the relationship scores of a plurality of support sets finally select the category of the test image, and the same test image can output different results under the related relationship calculation of different support sets. Different C-Way K-Shot reasoning tasks can be formed according to the number of the categories (C-Way) of the support sets and the number (K-Shot) of each category of support sets. The relational network reasoning calculation comprises a convolution calculation part of two modules, namely a feature extraction module and a relational calculation module, and mainly comprises convolution calculation, maximum pooling calculation and full-connection calculation. The feature extraction module is used for feature extraction of a support set and a test image, for feature extraction of the support set in a C-Way K-Shot mode, when K is 1, a convolution output feature map of a single image is output, and when K is larger than 1, an accumulated value of elements corresponding to convolution output features of a plurality of images is obtained. After the support set features and the test image features are extracted by the feature extraction module, the test image features and the support set features are spliced to form an input feature graph and are sent to the relation calculation module, the similarity of the features is calculated to be an output value, and for the feature value of C-Way, the feature value with the largest score is selected to be used as the category of the test image to be output.

The relationship calculation requirement between the test image and the multi-batch support sets exists in the inference calculation of the relationship network, and the accuracy and the confidence of the relationship score depend on the number of the support sets. The multi-batch support set relational computation introduces very large computation and storage overhead, the relational network reasoning computation based on the general CPU and GPU processing is slow in computation speed and high in energy consumption, and a high-efficiency reasoning computation accelerator needs to be designed to improve the efficiency of the relational network reasoning computation.

Disclosure of Invention

In view of this, the present invention provides a method for relational network inference optimization based on software and hardware cooperative acceleration, which can solve the speed and efficiency problems of relational network inference computation, and solve the problems of processing throughput and efficiency by adopting a software and hardware cooperative acceleration mode without reducing the accuracy rate of the relational network inference computation.

In order to achieve the purpose, the technical scheme of the invention is as follows: a relational network reasoning optimization method based on software and hardware cooperative acceleration comprises a support set feature extraction process running on an X86/GPU platform and a relational network image classification reasoning process running on an FPGA chip.

The support set feature extraction process specifically comprises the following steps:

receiving image data of support sets of different categories, wherein the support sets are standard data sets labeled by experts, the image data are consistent in size, and classification labels are fixed; the image data has C types, and each type contains K images.

Carrying out feature extraction on support sets of different categories, and constructing different feature pools as features of the offline support sets; the offline support set features contain K feature pools, each of which contains class C features.

The relational network image classification reasoning design running on the FPGA chip adopts a test image feature extraction module and a relational computation module constructed on the FPGA chip, and specifically executes the following procedures:

and receiving the characteristics of the offline support set and storing the characteristics into a Dynamic Random Access Memory (DRAM) on the FPGA board card.

And the test image feature extraction module receives the test image and performs feature extraction on the test image to obtain the test image features.

And the relation calculation module performs relation network reasoning calculation by using the test image characteristics and the offline support set characteristics to obtain a relation score.

The FPGA-based relational network image classification inference process adopts a multi-core interconnection design, wherein M relational computation modules, N test image feature extraction modules, a DRAM (dynamic random access memory) and a control interface are arranged in the relational network image classification inference process, and all the modules are interconnected through an AXI (advanced extensible interface) interconnection bus.

Further, performing relational network reasoning calculation by using the test image characteristics and the offline support set characteristics to obtain a relational score, specifically:

the relationship network comprises a test feature extraction module and a relationship calculation module.

The feature extraction module is formed by sequentially connecting 4 volume blocks and 2 maximum pooling computing layers.

The relation calculation module is formed by sequentially connecting 2 volume blocks, 2 maximum pooling calculation layers and 2 full-connection calculation layers.

Further, the convolution block adopts the following flow:

setting the weight data W of the volume block, wherein the input characteristic diagram of the volume block is In; the output profile of the volume block is Out.

The dimension of an input feature map In of the volume block is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map of the volume block, IW is the width of the input feature map of the volume block, and IC is the number of input feature map channels of the volume block; weight data W of the volume block with dimension of OC multiplied by K²X IC, wherein OC is the number of output characteristic diagram channels of the convolution block, and K is the size of a convolution kernel; the output of the rolling block is an output characteristic diagram Out with dimension OHxOW x OC, where OH is the output profile height of the volume block and OW is the output profile width of the volume block.

Setting PE processing units, and performing parallel computing design in each neuron according to Single Instruction Multiple Data (SIMD) data; the input feature map is scheduled to be data with the width of PE multiplied by SIMD, the output of each PE calculation is the calculation output of a neuron, and the SIMD calculation outputs are adopted inside the PE.

Setting 5 cycles, expanding only the innermost layer, wherein the cycles from the outer layer to the innermost layer are respectively a first cycle of convolution to a fifth cycle of convolution: setting the first recirculation variable of convolution as h, the second recirculation variable of convolution as w, the third recirculation variable of convolution as c, the fourth recirculation variable of convolution as pe, and the fifth recirculation variable of convolution as simd.

h. Initial values of w, c, PE and SIMD are set to 0, and 1 is added each cycle, the upper limit of h is set to OH, the upper limit of w is set to OW, the upper limit of c is set to OC, the upper limit of PE is set to PE, and the upper limit of SIMD is set to SIMD.

Wherein the fifth recycling of the convolution is set to

Out[w][h][c/PE+pe]+＝In[pe][simd]×W[pe][simd]

Wherein In [ pe ] [ simd ] is the parameter of the input feature map In the first pe row and the second simd column of the convolution block; w [ pe ] [ simd ] is the parameter of the sixth pe row and the simd column of the weight data W of the volume block; out [ w ] [ h ] [ c/PE + PE ] is a parameter with three dimensions of w, h and simd respectively of the output characteristic diagram of the rolling block; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.

Further, the maximum pooling calculation layer specifically adopts the following process:

the pooled input profile is In₁Setting weight data W₁(ii) a The output characteristic diagram is Out₁。

Pooled input profile In₁Dimension IH₁×IW₁×IC₁Wherein IH₁Input feature map height, IW, for pooling₁Input feature width, IC, for pooling₁The number of channels of the pooled input feature map; pooled weight data W₁Dimension of OC₁×K₁ ²×IC₁Wherein OC is₁Number of channels of output profile, K, for pooling₁The size of the pooled nuclei; pooled output as output feature map Out₁Dimension of OH₁×OW₁×OC₁In which OH is₁Output feature map height, OW, for pooling₁The feature map width is output for pooling.

And setting Pool _ Size as the pooling Size of the maximum pooling calculation layer, and setting max as the selected maximum value.

Setting 6 cycles, and circularly expanding only the innermost layer to obtain PE₁The circulation from the outer layer to the innermost layer is respectively a first recirculation of pooling to a fifth recirculation of pooling: setting the pooled first recycle variable to h₁Pooled second recycle variable ph, pooled third recycle variable w₁Pooled fourth recycle variable pw, pooled fifth recycle variable c₁And pooled sixth recycle variable pe₁。

h₁、ph、w₁、pw、c₁And pe₁Is set to 0, and is added by 1, h once per cycle₁Is set to OH₁The upper limit of ph is set to Pool _ Size, w₁Is set to OW₁The upper limit of pw is set to Pool _ Size, c₁Is set to the upper limit of (C)₁，pe₁Is set to PE₁。

Wherein the sixth recycle of pooling is set to

Out₁[w₁][h₁][c₁/PE₁+pe₁]＝max(In₁[pe₁],Out₁[w₁][h₁][c₁/PE₁+pe₁])

Wherein Out₁[w₁][h₁][c/PE₁+pe₁]The three dimensions in the output characteristic diagram representing the pooling are [ w ] respectively₁][h₁][c/PE₁+pe₁]Parameter of (1), Out₁Initially storing negative infinity in the matrix; max () representing a value in parenthesesA maximum value; in₁[pe₁]Pe in graph representing pooled features₁And (6) rows.

Further, the fully-connected computation layer specifically adopts the following procedures:

the input characteristic diagram of the fully connected computation layer is In₂Setting weight data W of the fully-connected computation layer₂(ii) a The output characteristic diagram of the fully-connected computing layer is Out₂。

Input characteristic diagram In of fully connected computation layer₂Dimension IH₂×IW₂×IC₂Wherein IH₂Input feature map height, IW, for fully connected computing layers₂Computing input feature width, IC, of layers for full connectivity₂Calculating the number of input characteristic diagram channels of the layer for full connection; weight data W of fully connected computation layers₂Dimension of OC₂×K₂ ²×IC₂Wherein OC is₂Computing the number of output feature map channels, K, for a fully connected computing layer₂Calculating the convolution kernel size of the layer for the full join; the output of the fully-connected computing layer is an output characteristic diagram Out₂Dimension of OH₂×OW₂×OC₂In which OH is₂Computing the output profile height, OW, of layers for full connectivity₂The output profile width of the layer is calculated for the full connection.

The fully-connected computing layer is set to comprise 3 cycles, namely c cycle, PE cycle and SIMD cycle from the outer layer to the inner layer, and only the PE cycle and the SIMD cycle are expanded and designed to comprise PE₂×SIMD₂Parallelism processing module, each PE₂Computing the output of a fully connected neuron with SIMD inside₂The multiply-accumulate units compute in parallel.

Setting a loop variable of the c loop to c₂And the cyclic variable of pe cycle is pe₂The loop variable of the simd loop is simd₂。

c₂、pe₂And simd₂Is set to 0, and 1, c is added once per cycle₂Is set to OC2, pe₂Is set to PE₂，simd₂Is set to SIMD₂。

The SIMD loop is set to:

Out₂[c₂/PE₂+pe₂]+＝In₂[pe₂][simd₂]×W₂[pe₂][simd₂]。

wherein Out₂[c₂/PE₂+pe₂]Computing the output profile Out of a layer for full connectivity₂C th₂/PE₂+pe₂A parameter; in₂[pe₂][simd₂]Input profile In for fully connected computation layers₂Pe (a)₂Row and simd₂Parameters of the columns; w₂[pe₂][simd₂]Computing the pe th of the weight data of a layer for full connectivity₂Row and simd₂Parameters of the columns; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.

Has the advantages that:

the embodiment of the invention provides a relational network reasoning optimization method based on software and hardware cooperative acceleration, which aims at the high-efficiency processing requirement of relational network reasoning calculation and adopts a software and hardware cooperative acceleration mode to solve the problems of processing throughput and efficiency without reducing the requirement of the accuracy of the relational network reasoning calculation. For the feature extraction of the support set, a CPU/GPU mode is adopted to construct a support set feature pool for sharing use of a subsequent FPGA reasoning accelerator, so that settlement overhead is saved; and for the design of the on-chip computing unit of the FPGA of the relational network, a high-level comprehensive cycle optimization and heterogeneous multi-core mode is adopted to improve the processing energy efficiency and the processing throughput. .

Drawings

FIG. 1 is a schematic diagram of a relational network inference calculation process based on software and hardware cooperative acceleration processing according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of multi-core interconnection design on a relational network reasoning computing chip;

FIG. 3 is a schematic diagram of a generic relational network algorithm inference calculation module;

FIG. 4 is a schematic diagram of an example on-chip multi-core interconnection design of Omniglot 28;

FIG. 5 is a schematic diagram of an example of the design of multi-core interconnection on a miniImagenet84 chip.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention provides a relational network reasoning optimization method based on software and hardware cooperative acceleration, which comprises the following specific steps: the method comprises a feature extraction process running on an X86/GPU platform and a relational network image classification inference process running on an FPGA chip;

the characteristic extraction process specifically comprises the following steps:

The relational network image classification reasoning process adopts a test image feature extraction module and a relational computation module, and specifically executes the following processes:

the offline support set features are received and stored in a Dynamic Random Access Memory (DRAM).

A multi-core interconnection design is adopted in a relational network image classification inference process in an FPGA chip, wherein M relational computation modules, N test image feature extraction modules, a DRAM and a control interface are arranged in the relational network image classification inference process, and all the modules are interconnected through an AXI interconnection bus.

Aiming at the problems of low speed and high energy consumption of the relational network reasoning calculation, the processing speed and the processing efficiency are improved by adopting a software and hardware cooperative acceleration calculation mode. The CPU or the GPU extracts the reusable support set characteristics for reasoning calculation, so that the calculation cost is reduced; by adopting a feature extraction module and a relation calculation module which are designed in an HLS (hyper text transfer protocol) circulation optimization mode, the speed and the efficiency of relation network reasoning calculation are improved; the processing capacity of the multi-core is comprehensively utilized by adopting a heterogeneous multi-core design mode, so that the relational network reasoning calculation speed is further improved. A specific relational network inference calculation flow based on software and hardware cooperative acceleration processing is shown in fig. 1.

C-Way K-Shot classification reasoning tasks based on the relational network are accelerated in computation in a mode of cooperative computation of an X86/GPU platform and an FPGA platform. The X86/GPU platform is used for feature extraction of the support set, and the FPGA platform is used for feature extraction of the test image and relation calculation under the support set features. The support set is generally a standard data set labeled by experts, the classification labels of the support set are fixed (image data and image size are consistent), and the support set features can be calculated in advance by using an X86 or GPU platform to form a support set feature pool (an existing feature extraction algorithm, a feature extraction algorithm in a relational network), so that the support set feature pool is convenient for subsequent calculation and use, and meanwhile, the calculation time and the energy consumption are saved. For different C-Way K-Shot task C image categories, K the number of images in each image category, different feature pools can be constructed, typically 1Shot and K Shot feature pools are constructed, wherein K is 1, 5 and 20 under an Omnigloss data set, and K is 1 and 5 under a miniImagenet data set.

Relational network image classification reasoning process

The relational network FPGA reasoning calculation process adopts a mode of directly caching offline support set characteristics to a Dynamic Random Access Memory (DRAM) to save the calculation time and energy consumption of the support set characteristics, a characteristic extraction module and a relational calculation module are configured on a chip, calculation modules with different sizes are constructed according to the requirement of calculation speed and the constraint of on-chip resources, and the plurality of calculation modules are configured into heterogeneous multi-core modes to cooperatively accelerate, so that the reasoning calculation capability is further improved.

Heterogeneous multi-core based on-chip processing system design

The relational network FPGA reasoning calculation module process adopts a mode that off-line support set characteristics are directly cached in a Dynamic Random Access Memory (DRAM) to save the calculation time and energy consumption of the support set characteristics, a test image characteristic extraction module and a relational calculation module are configured on a chip, calculation modules with different sizes are constructed according to the requirement of calculation speed and the constraint of on-chip resources, and a plurality of calculation modules are configured into a heterogeneous multi-core mode to cooperatively accelerate, so that the reasoning calculation capability is further improved. The specific relational network reasoning calculation on-chip multi-core interconnection design is shown in fig. 2 and comprises M relational calculation modules, N test image feature extraction modules, a DRAM (dynamic random access memory) and control interfaces, wherein the modules are interconnected through an AXI (advanced extensible interface) bus, image data and module control instructions are transmitted to the modules through the control interfaces, and output results are output through the control modules.

FPGA reasoning calculation module design of relational network

The reasoning module of the relational network comprises a feature extraction module and a relational computation module, wherein the two computation modules are respectively composed of a basic convolution block, the convolution block comprises a convolution output of 3 multiplied by 3 which is 64 neurons, and the output of the convolution block is finally obtained through Batchnorm and Relu. The feature extraction module is composed of 4 volume blocks and 2 maximum pooling calculations, and the relationship calculation module is composed of two volume blocks, 2 maximum pooling calculations and 2 full-link calculations. The dimension of the full connection 1 is H multiplied by 8, the dimension of the full connection 2 is 8 multiplied by 1, and finally the relation score is output. The specific structure of the relational network reasoning calculation module is shown in fig. 3.

FPGA convolution and pooling full-connection acceleration optimization design based on HLS (hyper text transfer system) cyclic optimization

The relational network adopts a plurality of volume blocks, a maximum pooling layer and a full connection layer to form a feature extraction module and a relational computation module. The part adopts an optimization means based on HLS to perform expansion optimization on convolution, pooling and full-connection calculation circulation, and increases the parallel processing capacity of the calculation module, thereby improving the throughput of the calculation module. And for the multilayer calculation of the feature extraction module and the relation calculation module, a calculation unit is formed by adopting an optimization mode of data flow, and the extra-core scheduling processing of the calculation module is simplified, so that the input is an image or a feature map, and the output is a feature map or a relation calculation score.

(1) Rolling block

The convolution block executes convolution multiplication accumulation calculation to complete convolution multiplication accumulation calculation of the input feature map and the weight data, the calculation input of the convolution multiplication accumulation module is the input feature map, and the dimensionality is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map, IW is the width of the input feature map, and IC is the number of input feature map channels; and weight data W with dimension OC multiplied by K multiplied by 2 multiplied by IC, where OC is the number of channels of the output feature map and K is the size of the convolution kernel. The output is an output characteristic diagram, and the dimension is OH multiplied by OW multiplied by OC.

The convolution multiply accumulate calculation module includes a plurality of processing units (PEs), and each neuron performs parallel calculation design according to Single Instruction Multiple Data (SIMD) inside. The input characteristic diagram is scheduled to be data with the width of PE multiplied by SIMD, the output of each PE calculation is the calculation output of a neuron, the SIMD calculation units are adopted in the PE to further increase the parallel row number, and the SIMD related data completes the multiply-accumulate calculation in the neuron. The weight schedule of the convolution is data of PE × SIMD width, corresponding to the corresponding input. The specific convolution multiply accumulate calculation is shown as an algorithm 1, and has 5 cycles, wherein two cycles are developed aiming at PE and SIMD, the corresponding parallel rows are PE multiplied by SIMD, each convolution layer can carry out independent configuration of PE and SIMD according to the size of the calculated amount and the calculation input and output speeds of the front and rear layers, and the number of resources on a chip is saved as much as possible while the processing speed is improved.

The rolling block adopts the following flow:

The dimension of an input feature map In of the volume block is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map of the volume block, IW is the width of the input feature map of the volume block, and IC is the number of input feature map channels of the volume block; weight data W of the volume block with dimension of OC multiplied by K²X IC, wherein OC is the number of output characteristic diagram channels of the convolution block, and K is the size of a convolution kernel; the output of the volume block is an output feature map Out, and the dimension is OH × OW × OC, where OH is the output feature map height of the volume block, and OW is the output feature map width of the volume block.

Wherein the fifth recycling of the convolution is set to

Out[w][h][c/PE+pe]+＝In[pe][simd]×W[pe][simd]

(2) Fully connected computing layer

And the full-connection calculation layer executes full-connection multiplication accumulation calculation to complete full-connection calculation of the input characteristic diagram and the weight data, and outputs the result as an output characteristic diagram. The input characteristic diagram and the output characteristic diagram of the fully-connected computing module are consistent with those of the convolution computing module and are H multiplied by W multiplied by C data dimensions. The full join computation can be abstracted as a convolution computation with K ═ 1, and a specific full join multiply accumulate computation algorithm, shown as algorithm 2, comprises a triple cycle. The algorithm expands PE circulation and SIMD circulation, and is designed to comprise a PE multiplied by SIMD parallelism processing module, wherein each PE calculates the output of a fully connected neuron, and SIMD multiplication and accumulation units are arranged in the neuron for parallel calculation. Each full-connection layer instance can carry out specific PE and SIMD configuration according to the input and output speeds of the front and rear layers, and the resource consumption on the chip is saved while the processing is satisfied.

The fully-connected computing layer specifically adopts the following procedures:

The full-connection meterThe calculation layer setting comprises 3 cycles, c cycle, PE cycle and SIMD cycle are respectively arranged from the outer layer to the inner layer, only the PE cycle and the SIMD cycle are expanded, and the calculation layer setting is designed to comprise PE₂×SIMD₂Parallelism processing module, each PE₂Computing the output of a fully connected neuron with SIMD inside₂The multiply-accumulate units compute in parallel.

The SIMD loop is set to:

Out₂[c₂/PE₂+pe₂]+＝In₂[pe₂][simd₂]×W₂[pe₂][simd₂]。

(3) Maximum pooling computation layer

The maximum pooling calculation layer completes maximum pooling calculation, and the input and output characteristic diagrams are consistent with the convolution layer and the full connection layer aiming at the pooling calculation function of the input characteristic diagram, and are combined into a calculation unit for data flow scheduling in a multilayer network mode. The maximum pooling calculation algorithm is shown as algorithm 3, and comprises 6 cycles, wherein only the innermost layer is circularly expanded, the expansion number is PE, namely the parallelism is PE, and each pooling layer is independently configured according to calculation requirements. Wherein Pool _ Size is the pooling Size and max is the selected maximum. The data of the input characteristic diagram In meet the calculation requirement of the pseudo code 6 recirculation through input scheduling.

Namely, the maximum pooling calculation layer specifically adopts the following procedures:

Setting 6 cycles, and circularly expanding only the innermost layer to obtain PE₁The circulation from the outer layer to the innermost layer is respectively a first recirculation of pooling to a fifth recirculation of pooling: setting the pooled first recycle variable to h₁Pooled second recycle variable ph, pooled third recycle variable w₁Pooled fourth recycle variable pw, pooled fifthThe recirculation variable is c₁And pooled sixth recycle variable pe₁。

Wherein the sixth recycle of pooling is set to

Wherein Out₁[w₁][h₁][c/PE₁+pe₁]The three dimensions in the output characteristic diagram representing the pooling are [ w ] respectively₁][h₁][c/PE₁+pe₁]Parameter of (1), Out₁Initially storing negative infinity in the matrix; max () represents the maximum value of the value in parentheses; in₁[pe₁]Pe in graph representing pooled features₁And (6) rows.

And (3) constructing a relational network reasoning task aiming at two data sets with different sizes of Omniglot and miniImagenet, wherein the input image dimensions are respectively 28 multiplied by 1 and 84 multiplied by 3, and H in full connection is respectively 64 and 576. Due to the difference of the input sizes and the internal configurations of the two data sets, the specific calculation amount of each module is different, and therefore different accelerators are designed for the Omniglot data set and the miniImagenet data set respectively to carry out reasoning acceleration.

Example 1 Omniglot28 relational network inference design

The Omniglot28 on-chip multi-core design is shown in fig. 4, and includes a control interface, axi (advanced extensible interface), DRAM, and a computation module. The calculation module comprises a feature extraction module and 4 relation calculation modules. The control interface completes the control instruction transmission and data transmission functions of the host computer and controls the specific relation calculation and feature extraction module. A plurality of computing modules share the 4G memory space using one DRAM, and can run in parallel at the same time, so that higher operation throughput is provided by utilizing the multi-core capability.

The Omniglot28 accelerator configuration is shown in Table 1, and for the sake of simplicity of description, the PE and SIMD configurations of the feature extraction module and the relationship calculation module are placed in a table for description. The feature extraction module comprises parts cnv 1-cnv 4, and the relation calculation module comprises parts cnv 5-fc 2. The parallel granularity of each module can be controlled by setting the sizes of the PE and the SIMD, and then the throughput of each module is controlled, but the parallel granularity is limited by the 32-bit width of floating point data and the large line width and large-scale calculation matrix brought by the PE and the SIMD, so that the time sequence is difficult to converge when Vivado is integrated, and the throughput is further improved by using resources in a multi-core mode. The design speed of a feature extraction module of the Omniglot28 accelerator is 2198.39FPS, and the design speed of a relation calculation module is 7430.56 FPS.

TABLE 1 Omniglot28 Accelerator configuration

Example 2 MiniImagenet84 relationship network inference design

The miniImagenet84 on-chip multi-core interconnection design is shown in FIG. 5, and includes a heterogeneous multi-core acceleration on-chip system composed of a feature extraction module and two relation calculation modules.

The miniImagenet84 accelerator configuration is shown in Table 2, and for the sake of description, the PE and SIMD configurations of the feature extraction module and the relationship calculation module are set in a table for description. The feature extraction module comprises parts cnv 1-cnv 4, and the relation calculation module comprises parts cnv 5-fc 2. The parallel granularity of each module can be controlled by setting the sizes of the PE and the SIMD, and further the throughput of each module can be controlled. The design speed of a feature extraction module of the miniImagenet84 accelerator is 221.02 FPS, and the design speed of a relation calculation module is 642.78 FPS.

Table 2 miniImagenet84 accelerator configuration

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A relational network reasoning optimization method based on software and hardware cooperative acceleration is characterized by comprising a support set feature extraction process running on an X86/GPU platform and a relational network image classification reasoning design running on an FPGA platform;

receiving image data of support sets of different categories, wherein the support sets are standard data sets labeled by experts, the image data are consistent in size, and classification labels are fixed; the image data has C types, and each type comprises K images;

carrying out feature extraction on support sets of different categories, and constructing different feature pools as features of the offline support sets; the offline support set features comprise K feature pools, and each feature pool comprises C-type features;

the relational network image classification reasoning design running on the FPGA platform adopts a test image feature extraction module and a relational computation module which are constructed on an FPGA chip to specifically execute the following procedures

Receiving the characteristics of the offline support set, storing the characteristics into a Dynamic Random Access Memory (DRAM) on the FPGA board card;

the test image feature extraction module receives the test image and performs feature extraction on the test image to obtain test image features;

the relation calculation module performs relation network reasoning calculation by using the test image characteristics and the offline support set characteristics to obtain a relation score;

the relational network image classification reasoning process in the FPGA chip adopts a multi-core interconnection design, wherein M relational computation modules, N test image feature extraction modules, a DRAM and a control interface are arranged in the relational network image classification reasoning process, and all the modules are interconnected through an AXI interconnection bus.

2. The method of claim 1, wherein the test image feature extraction module consists of 4 volume blocks and 2 maximal pooling computing layers connected in sequence;

3. The method of claim 2, wherein the volume block is designed as follows:

setting the weight data W of the volume block, wherein the input characteristic diagram of the volume block is In; the output characteristic diagram of the rolling block is Out;

the dimension of an input feature map In of the volume block is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map of the volume block, IW is the width of the input feature map of the volume block, and IC is the number of input feature map channels of the volume block; weight data W of the volume block with dimension of OC multiplied by K²X IC, wherein OC is the number of output characteristic diagram channels of the convolution block, and K is the size of a convolution kernel; the output of the rolling block is an output characteristic diagram Out, and the dimensionality is OH multiplied by OW multiplied by OC, wherein OH is the output characteristic diagram height of the rolling block, and OW is the output characteristic diagram width of the rolling block;

setting PE processing units, and performing parallel computing design in each neuron according to Single Instruction Multiple Data (SIMD) data; the input characteristic diagram is scheduled to be data with the width of PE multiplied by SIMD, the output calculated by each PE is the calculation output of a neuron, and SIMD calculation outputs are adopted inside the PE;

setting 5 cycles, expanding only the innermost layer, wherein the cycles from the outer layer to the innermost layer are respectively a first cycle of convolution to a fifth cycle of convolution: setting a first convolution variable of convolution as h, a second convolution variable of convolution as w, a third convolution variable of convolution as c, a fourth convolution variable of convolution as pe, and a fifth convolution variable of convolution as simd;

h. setting initial values of w, c, PE and SIMD as 0, self-adding 1 every cycle, setting the upper limit of h as OH, setting the upper limit of w as OW, setting the upper limit of c as OC, setting the upper limit of PE as PE and setting the upper limit of SIMD as SIMD;

wherein the fifth recycling of the convolution is set to

Out[w][h][c/PE+pe]+＝In[pe][simd]×W[pe][simd]

4. A method according to claim 2 or 3, characterized in that the maximum pooling calculation layer is designed in particular as follows:

the pooled input profile is In₁Setting weight data W₁(ii) a The output characteristic diagram is Out₁；

Pooled input profile In₁Dimension IH₁×IW₁×IC₁Wherein IH₁Input feature map height, IW, for pooling₁Input feature width, IC, for pooling₁The number of channels of the pooled input feature map; pooled weight data W₁Dimension of OC₁×K₁ ²×IC₁Wherein OC is₁Number of channels of output profile, K, for pooling₁The size of the pooled nuclei; pooled output as output feature map Out₁Dimension of OH₁×OW₁×OC₁In which OH is₁Output feature map height, OW, for pooling₁Outputting the feature map width for pooling;

setting Pool _ Size as the pooling Size of the maximum pooling calculation layer, and setting max as a selected maximum value;

setting 6 cycles, and circularly expanding only the innermost layer to obtain PE₁Cycle ofThe ring is respectively a first re-circulation cycle of pooling to a fifth circulation cycle of pooling from the outer layer to the innermost layer: setting the pooled first recycle variable to h₁Pooled second recycle variable ph, pooled third recycle variable w₁Pooled fourth recycle variable pw, pooled fifth recycle variable c₁And pooled sixth recycle variable pe₁；

h₁、ph、w₁、pw、c₁And pe₁Is set to 0, and is added by 1, h once per cycle₁Is set to OH₁The upper limit of ph is set to Pool _ Size, w₁Is set to OW₁The upper limit of pw is set to Pool _ Size, c₁Is set to the upper limit of (C)₁，pe₁Is set to PE₁；

Wherein the sixth recycle of pooling is set to

5. The method according to claim 2 or 3, wherein the fully-connected computation layer is specifically designed as follows:

the input characteristic diagram of the fully connected computation layer is In₂Setting weight data W of the fully-connected computation layer₂(ii) a The output characteristic diagram of the fully-connected computing layer is Out₂；

Input characteristic diagram In of fully connected computation layer₂Dimension IH₂×IW₂×IC₂Wherein IH₂Input feature map height, IW, for fully connected computing layers₂Computing input feature width, IC, of layers for full connectivity₂Calculating the number of input characteristic diagram channels of the layer for full connection; weight data W of fully connected computation layers₂Dimension of OC₂×K₂ ²×IC₂Wherein OC is₂Computing the number of output feature map channels, K, for a fully connected computing layer₂Calculating the convolution kernel size of the layer for the full join; the output of the fully-connected computing layer is an output characteristic diagram Out₂Dimension of OH₂×OW₂×OC₂In which OH is₂Computing the output profile height, OW, of layers for full connectivity₂Calculating the output feature map width of the layer for the full connection;

the fully-connected computing layer is set to comprise 3 cycles, namely c cycle, PE cycle and SIMD cycle from the outer layer to the inner layer, and only the PE cycle and the SIMD cycle are expanded and designed to comprise PE₂×SIMD₂Parallelism processing module, each PE₂Computing the output of a fully connected neuron with SIMD inside₂Parallel calculation is carried out by the multiply-accumulate units;

setting a loop variable of the c loop to c₂And the cyclic variable of pe cycle is pe₂The loop variable of the simd loop is simd₂；

c₂、pe₂And simd₂Is set to 0, and 1, c is added once per cycle₂Is set to OC2, pe₂Is set to PE₂，simd₂Is set to SIMD₂；

The SIMD loop is set to:

Out₂[c₂/PE₂+pe₂]+＝In₂[pe₂][simd₂]×W₂[pe₂][simd₂]；

wherein Out₂[c₂/PE₂+pe₂]Computing the output profile Out of a layer for full connectivity₂C th₂/PE₂+pe₂A parameter; in₂[pe₂][simd₂]For computing layer inputs in full connectivityIn characteristic diagram In₂Pe (a)₂Row and simd₂Parameters of the columns; w₂[pe₂][simd₂]Computing the pe th of the weight data of a layer for full connectivity₂Row and simd₂Parameters of the columns; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.