CN113673704A - Relational network reasoning optimization method based on software and hardware cooperative acceleration - Google Patents

Relational network reasoning optimization method based on software and hardware cooperative acceleration Download PDF

Info

Publication number
CN113673704A
CN113673704A CN202110757032.6A CN202110757032A CN113673704A CN 113673704 A CN113673704 A CN 113673704A CN 202110757032 A CN202110757032 A CN 202110757032A CN 113673704 A CN113673704 A CN 113673704A
Authority
CN
China
Prior art keywords
simd
layer
calculation
output
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110757032.6A
Other languages
Chinese (zh)
Other versions
CN113673704B (en
Inventor
张志超
刘忠麟
蒋丽婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202110757032.6A priority Critical patent/CN113673704B/en
Publication of CN113673704A publication Critical patent/CN113673704A/en
Application granted granted Critical
Publication of CN113673704B publication Critical patent/CN113673704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for relational network reasoning optimization based on software and hardware cooperative acceleration, which can solve the speed and efficiency problems of relational network reasoning calculation and comprises a support set feature extraction process running on an X86/GPU platform and a relational network image classification reasoning design running on an FPGA chip. The support set feature extraction process comprises the following steps: different categories of support set image data are received. And performing feature extraction on the support sets of different categories, and constructing different feature pools as the features of the offline support sets. The relational network image classification reasoning design running on the FPGA chip adopts a test image feature extraction module and a relational computation module which are constructed on the FPGA chip: and receiving the offline support set characteristics and storing the offline support set characteristics into the DRAM. And the test image feature extraction module receives the test image, performs feature extraction and obtains the features of the test image. And the relation calculation module performs relation network reasoning calculation to obtain a relation score. The FPGA-based relational network image classification reasoning design adopts a multi-core interconnection design.

Description

Relational network reasoning optimization method based on software and hardware cooperative acceleration
Technical Field
The invention relates to the technical field of software and hardware cooperative acceleration, in particular to a relational network reasoning optimization method based on software and hardware cooperative acceleration.
Background
The convolutional neural network technology based on deep learning is widely applied to image processing tasks, and large-scale deep models such as AlexNet, VGG16 and ResNet, which are provided on a large-scale labeled Imagenet data set, show higher identification accuracy. However, in the application of learning with few samples, especially for new unknown class classification tasks, new learning modes and methods are needed, including few sample learning methods such as matchings Nets, Meta Nets, MAML, Prototypical Nets, relationship Nets, etc., models are trained by constructing multiple batches of different class tasks, a support set is introduced as prior knowledge to process classification of the unknown class tasks, and the relationship network obtains higher recognition accuracy on the omnioglot data set and the miniageimnet data set relative to other models.
The relational network adopts a design mode of shallow volume blocks to construct a feature extraction module and a relational computation module, a common reasoning computation mode adopts a Central Processing Unit (CPU) or a Graphic Processing Unit (GPU) to process, the CPU is used for processing at a low speed, and the GPU is used for processing at low energy efficiency. Relational network and few sample learning techniques[4-8]The method is limited by the over-fitting problem of a small number of samples in a large model network, a feature extraction module and a relation calculation module are usually constructed in a shallow rolling block mode, the calculation complexity and the model parameter storage amount are relatively low, and the method is suitable for an acceleration mode based on Field Programmable Gate Array (FPGA) processing.
The typical relationship network reasoning input is a support set and a test image, the output is a relationship score, the relationship scores of a plurality of support sets finally select the category of the test image, and the same test image can output different results under the related relationship calculation of different support sets. Different C-Way K-Shot reasoning tasks can be formed according to the number of the categories (C-Way) of the support sets and the number (K-Shot) of each category of support sets. The relational network reasoning calculation comprises a convolution calculation part of two modules, namely a feature extraction module and a relational calculation module, and mainly comprises convolution calculation, maximum pooling calculation and full-connection calculation. The feature extraction module is used for feature extraction of a support set and a test image, for feature extraction of the support set in a C-Way K-Shot mode, when K is 1, a convolution output feature map of a single image is output, and when K is larger than 1, an accumulated value of elements corresponding to convolution output features of a plurality of images is obtained. After the support set features and the test image features are extracted by the feature extraction module, the test image features and the support set features are spliced to form an input feature graph and are sent to the relation calculation module, the similarity of the features is calculated to be an output value, and for the feature value of C-Way, the feature value with the largest score is selected to be used as the category of the test image to be output.
The relationship calculation requirement between the test image and the multi-batch support sets exists in the inference calculation of the relationship network, and the accuracy and the confidence of the relationship score depend on the number of the support sets. The multi-batch support set relational computation introduces very large computation and storage overhead, the relational network reasoning computation based on the general CPU and GPU processing is slow in computation speed and high in energy consumption, and a high-efficiency reasoning computation accelerator needs to be designed to improve the efficiency of the relational network reasoning computation.
Disclosure of Invention
In view of this, the present invention provides a method for relational network inference optimization based on software and hardware cooperative acceleration, which can solve the speed and efficiency problems of relational network inference computation, and solve the problems of processing throughput and efficiency by adopting a software and hardware cooperative acceleration mode without reducing the accuracy rate of the relational network inference computation.
In order to achieve the purpose, the technical scheme of the invention is as follows: a relational network reasoning optimization method based on software and hardware cooperative acceleration comprises a support set feature extraction process running on an X86/GPU platform and a relational network image classification reasoning process running on an FPGA chip.
The support set feature extraction process specifically comprises the following steps:
receiving image data of support sets of different categories, wherein the support sets are standard data sets labeled by experts, the image data are consistent in size, and classification labels are fixed; the image data has C types, and each type contains K images.
Carrying out feature extraction on support sets of different categories, and constructing different feature pools as features of the offline support sets; the offline support set features contain K feature pools, each of which contains class C features.
The relational network image classification reasoning design running on the FPGA chip adopts a test image feature extraction module and a relational computation module constructed on the FPGA chip, and specifically executes the following procedures:
and receiving the characteristics of the offline support set and storing the characteristics into a Dynamic Random Access Memory (DRAM) on the FPGA board card.
And the test image feature extraction module receives the test image and performs feature extraction on the test image to obtain the test image features.
And the relation calculation module performs relation network reasoning calculation by using the test image characteristics and the offline support set characteristics to obtain a relation score.
The FPGA-based relational network image classification inference process adopts a multi-core interconnection design, wherein M relational computation modules, N test image feature extraction modules, a DRAM (dynamic random access memory) and a control interface are arranged in the relational network image classification inference process, and all the modules are interconnected through an AXI (advanced extensible interface) interconnection bus.
Further, performing relational network reasoning calculation by using the test image characteristics and the offline support set characteristics to obtain a relational score, specifically:
the relationship network comprises a test feature extraction module and a relationship calculation module.
The feature extraction module is formed by sequentially connecting 4 volume blocks and 2 maximum pooling computing layers.
The relation calculation module is formed by sequentially connecting 2 volume blocks, 2 maximum pooling calculation layers and 2 full-connection calculation layers.
Further, the convolution block adopts the following flow:
setting the weight data W of the volume block, wherein the input characteristic diagram of the volume block is In; the output profile of the volume block is Out.
The dimension of an input feature map In of the volume block is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map of the volume block, IW is the width of the input feature map of the volume block, and IC is the number of input feature map channels of the volume block; weight data W of the volume block with dimension of OC multiplied by K2X IC, wherein OC is the number of output characteristic diagram channels of the convolution block, and K is the size of a convolution kernel; the output of the rolling block is an output characteristic diagram Out with dimension OHxOW x OC, where OH is the output profile height of the volume block and OW is the output profile width of the volume block.
Setting PE processing units, and performing parallel computing design in each neuron according to Single Instruction Multiple Data (SIMD) data; the input feature map is scheduled to be data with the width of PE multiplied by SIMD, the output of each PE calculation is the calculation output of a neuron, and the SIMD calculation outputs are adopted inside the PE.
Setting 5 cycles, expanding only the innermost layer, wherein the cycles from the outer layer to the innermost layer are respectively a first cycle of convolution to a fifth cycle of convolution: setting the first recirculation variable of convolution as h, the second recirculation variable of convolution as w, the third recirculation variable of convolution as c, the fourth recirculation variable of convolution as pe, and the fifth recirculation variable of convolution as simd.
h. Initial values of w, c, PE and SIMD are set to 0, and 1 is added each cycle, the upper limit of h is set to OH, the upper limit of w is set to OW, the upper limit of c is set to OC, the upper limit of PE is set to PE, and the upper limit of SIMD is set to SIMD.
Wherein the fifth recycling of the convolution is set to
Out[w][h][c/PE+pe]+=In[pe][simd]×W[pe][simd]
Wherein In [ pe ] [ simd ] is the parameter of the input feature map In the first pe row and the second simd column of the convolution block; w [ pe ] [ simd ] is the parameter of the sixth pe row and the simd column of the weight data W of the volume block; out [ w ] [ h ] [ c/PE + PE ] is a parameter with three dimensions of w, h and simd respectively of the output characteristic diagram of the rolling block; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.
Further, the maximum pooling calculation layer specifically adopts the following process:
the pooled input profile is In1Setting weight data W1(ii) a The output characteristic diagram is Out1
Pooled input profile In1Dimension IH1×IW1×IC1Wherein IH1Input feature map height, IW, for pooling1Input feature width, IC, for pooling1The number of channels of the pooled input feature map; pooled weight data W1Dimension of OC1×K1 2×IC1Wherein OC is1Number of channels of output profile, K, for pooling1The size of the pooled nuclei; pooled output as output feature map Out1Dimension of OH1×OW1×OC1In which OH is1Output feature map height, OW, for pooling1The feature map width is output for pooling.
And setting Pool _ Size as the pooling Size of the maximum pooling calculation layer, and setting max as the selected maximum value.
Setting 6 cycles, and circularly expanding only the innermost layer to obtain PE1The circulation from the outer layer to the innermost layer is respectively a first recirculation of pooling to a fifth recirculation of pooling: setting the pooled first recycle variable to h1Pooled second recycle variable ph, pooled third recycle variable w1Pooled fourth recycle variable pw, pooled fifth recycle variable c1And pooled sixth recycle variable pe1
h1、ph、w1、pw、c1And pe1Is set to 0, and is added by 1, h once per cycle1Is set to OH1The upper limit of ph is set to Pool _ Size, w1Is set to OW1The upper limit of pw is set to Pool _ Size, c1Is set to the upper limit of (C)1,pe1Is set to PE1
Wherein the sixth recycle of pooling is set to
Out1[w1][h1][c1/PE1+pe1]=max(In1[pe1],Out1[w1][h1][c1/PE1+pe1])
Wherein Out1[w1][h1][c/PE1+pe1]The three dimensions in the output characteristic diagram representing the pooling are [ w ] respectively1][h1][c/PE1+pe1]Parameter of (1), Out1Initially storing negative infinity in the matrix; max () representing a value in parenthesesA maximum value; in1[pe1]Pe in graph representing pooled features1And (6) rows.
Further, the fully-connected computation layer specifically adopts the following procedures:
the input characteristic diagram of the fully connected computation layer is In2Setting weight data W of the fully-connected computation layer2(ii) a The output characteristic diagram of the fully-connected computing layer is Out2
Input characteristic diagram In of fully connected computation layer2Dimension IH2×IW2×IC2Wherein IH2Input feature map height, IW, for fully connected computing layers2Computing input feature width, IC, of layers for full connectivity2Calculating the number of input characteristic diagram channels of the layer for full connection; weight data W of fully connected computation layers2Dimension of OC2×K2 2×IC2Wherein OC is2Computing the number of output feature map channels, K, for a fully connected computing layer2Calculating the convolution kernel size of the layer for the full join; the output of the fully-connected computing layer is an output characteristic diagram Out2Dimension of OH2×OW2×OC2In which OH is2Computing the output profile height, OW, of layers for full connectivity2The output profile width of the layer is calculated for the full connection.
The fully-connected computing layer is set to comprise 3 cycles, namely c cycle, PE cycle and SIMD cycle from the outer layer to the inner layer, and only the PE cycle and the SIMD cycle are expanded and designed to comprise PE2×SIMD2Parallelism processing module, each PE2Computing the output of a fully connected neuron with SIMD inside2The multiply-accumulate units compute in parallel.
Setting a loop variable of the c loop to c2And the cyclic variable of pe cycle is pe2The loop variable of the simd loop is simd2
c2、pe2And simd2Is set to 0, and 1, c is added once per cycle2Is set to OC2, pe2Is set to PE2,simd2Is set to SIMD2
The SIMD loop is set to:
Out2[c2/PE2+pe2]+=In2[pe2][simd2]×W2[pe2][simd2]。
wherein Out2[c2/PE2+pe2]Computing the output profile Out of a layer for full connectivity2C th2/PE2+pe2A parameter; in2[pe2][simd2]Input profile In for fully connected computation layers2Pe (a)2Row and simd2Parameters of the columns; w2[pe2][simd2]Computing the pe th of the weight data of a layer for full connectivity2Row and simd2Parameters of the columns; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.
Has the advantages that:
the embodiment of the invention provides a relational network reasoning optimization method based on software and hardware cooperative acceleration, which aims at the high-efficiency processing requirement of relational network reasoning calculation and adopts a software and hardware cooperative acceleration mode to solve the problems of processing throughput and efficiency without reducing the requirement of the accuracy of the relational network reasoning calculation. For the feature extraction of the support set, a CPU/GPU mode is adopted to construct a support set feature pool for sharing use of a subsequent FPGA reasoning accelerator, so that settlement overhead is saved; and for the design of the on-chip computing unit of the FPGA of the relational network, a high-level comprehensive cycle optimization and heterogeneous multi-core mode is adopted to improve the processing energy efficiency and the processing throughput. .
Drawings
FIG. 1 is a schematic diagram of a relational network inference calculation process based on software and hardware cooperative acceleration processing according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of multi-core interconnection design on a relational network reasoning computing chip;
FIG. 3 is a schematic diagram of a generic relational network algorithm inference calculation module;
FIG. 4 is a schematic diagram of an example on-chip multi-core interconnection design of Omniglot 28;
FIG. 5 is a schematic diagram of an example of the design of multi-core interconnection on a miniImagenet84 chip.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a relational network reasoning optimization method based on software and hardware cooperative acceleration, which comprises the following specific steps: the method comprises a feature extraction process running on an X86/GPU platform and a relational network image classification inference process running on an FPGA chip;
the characteristic extraction process specifically comprises the following steps:
receiving image data of support sets of different categories, wherein the support sets are standard data sets labeled by experts, the image data are consistent in size, and classification labels are fixed; the image data has C types, and each type contains K images.
Carrying out feature extraction on support sets of different categories, and constructing different feature pools as features of the offline support sets; the offline support set features contain K feature pools, each of which contains class C features.
The relational network image classification reasoning process adopts a test image feature extraction module and a relational computation module, and specifically executes the following processes:
the offline support set features are received and stored in a Dynamic Random Access Memory (DRAM).
And the test image feature extraction module receives the test image and performs feature extraction on the test image to obtain the test image features.
And the relation calculation module performs relation network reasoning calculation by using the test image characteristics and the offline support set characteristics to obtain a relation score.
A multi-core interconnection design is adopted in a relational network image classification inference process in an FPGA chip, wherein M relational computation modules, N test image feature extraction modules, a DRAM and a control interface are arranged in the relational network image classification inference process, and all the modules are interconnected through an AXI interconnection bus.
Aiming at the problems of low speed and high energy consumption of the relational network reasoning calculation, the processing speed and the processing efficiency are improved by adopting a software and hardware cooperative acceleration calculation mode. The CPU or the GPU extracts the reusable support set characteristics for reasoning calculation, so that the calculation cost is reduced; by adopting a feature extraction module and a relation calculation module which are designed in an HLS (hyper text transfer protocol) circulation optimization mode, the speed and the efficiency of relation network reasoning calculation are improved; the processing capacity of the multi-core is comprehensively utilized by adopting a heterogeneous multi-core design mode, so that the relational network reasoning calculation speed is further improved. A specific relational network inference calculation flow based on software and hardware cooperative acceleration processing is shown in fig. 1.
C-Way K-Shot classification reasoning tasks based on the relational network are accelerated in computation in a mode of cooperative computation of an X86/GPU platform and an FPGA platform. The X86/GPU platform is used for feature extraction of the support set, and the FPGA platform is used for feature extraction of the test image and relation calculation under the support set features. The support set is generally a standard data set labeled by experts, the classification labels of the support set are fixed (image data and image size are consistent), and the support set features can be calculated in advance by using an X86 or GPU platform to form a support set feature pool (an existing feature extraction algorithm, a feature extraction algorithm in a relational network), so that the support set feature pool is convenient for subsequent calculation and use, and meanwhile, the calculation time and the energy consumption are saved. For different C-Way K-Shot task C image categories, K the number of images in each image category, different feature pools can be constructed, typically 1Shot and K Shot feature pools are constructed, wherein K is 1, 5 and 20 under an Omnigloss data set, and K is 1 and 5 under a miniImagenet data set.
Relational network image classification reasoning process
The relational network FPGA reasoning calculation process adopts a mode of directly caching offline support set characteristics to a Dynamic Random Access Memory (DRAM) to save the calculation time and energy consumption of the support set characteristics, a characteristic extraction module and a relational calculation module are configured on a chip, calculation modules with different sizes are constructed according to the requirement of calculation speed and the constraint of on-chip resources, and the plurality of calculation modules are configured into heterogeneous multi-core modes to cooperatively accelerate, so that the reasoning calculation capability is further improved.
Heterogeneous multi-core based on-chip processing system design
The relational network FPGA reasoning calculation module process adopts a mode that off-line support set characteristics are directly cached in a Dynamic Random Access Memory (DRAM) to save the calculation time and energy consumption of the support set characteristics, a test image characteristic extraction module and a relational calculation module are configured on a chip, calculation modules with different sizes are constructed according to the requirement of calculation speed and the constraint of on-chip resources, and a plurality of calculation modules are configured into a heterogeneous multi-core mode to cooperatively accelerate, so that the reasoning calculation capability is further improved. The specific relational network reasoning calculation on-chip multi-core interconnection design is shown in fig. 2 and comprises M relational calculation modules, N test image feature extraction modules, a DRAM (dynamic random access memory) and control interfaces, wherein the modules are interconnected through an AXI (advanced extensible interface) bus, image data and module control instructions are transmitted to the modules through the control interfaces, and output results are output through the control modules.
FPGA reasoning calculation module design of relational network
The reasoning module of the relational network comprises a feature extraction module and a relational computation module, wherein the two computation modules are respectively composed of a basic convolution block, the convolution block comprises a convolution output of 3 multiplied by 3 which is 64 neurons, and the output of the convolution block is finally obtained through Batchnorm and Relu. The feature extraction module is composed of 4 volume blocks and 2 maximum pooling calculations, and the relationship calculation module is composed of two volume blocks, 2 maximum pooling calculations and 2 full-link calculations. The dimension of the full connection 1 is H multiplied by 8, the dimension of the full connection 2 is 8 multiplied by 1, and finally the relation score is output. The specific structure of the relational network reasoning calculation module is shown in fig. 3.
FPGA convolution and pooling full-connection acceleration optimization design based on HLS (hyper text transfer system) cyclic optimization
The relational network adopts a plurality of volume blocks, a maximum pooling layer and a full connection layer to form a feature extraction module and a relational computation module. The part adopts an optimization means based on HLS to perform expansion optimization on convolution, pooling and full-connection calculation circulation, and increases the parallel processing capacity of the calculation module, thereby improving the throughput of the calculation module. And for the multilayer calculation of the feature extraction module and the relation calculation module, a calculation unit is formed by adopting an optimization mode of data flow, and the extra-core scheduling processing of the calculation module is simplified, so that the input is an image or a feature map, and the output is a feature map or a relation calculation score.
(1) Rolling block
The convolution block executes convolution multiplication accumulation calculation to complete convolution multiplication accumulation calculation of the input feature map and the weight data, the calculation input of the convolution multiplication accumulation module is the input feature map, and the dimensionality is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map, IW is the width of the input feature map, and IC is the number of input feature map channels; and weight data W with dimension OC multiplied by K multiplied by 2 multiplied by IC, where OC is the number of channels of the output feature map and K is the size of the convolution kernel. The output is an output characteristic diagram, and the dimension is OH multiplied by OW multiplied by OC.
The convolution multiply accumulate calculation module includes a plurality of processing units (PEs), and each neuron performs parallel calculation design according to Single Instruction Multiple Data (SIMD) inside. The input characteristic diagram is scheduled to be data with the width of PE multiplied by SIMD, the output of each PE calculation is the calculation output of a neuron, the SIMD calculation units are adopted in the PE to further increase the parallel row number, and the SIMD related data completes the multiply-accumulate calculation in the neuron. The weight schedule of the convolution is data of PE × SIMD width, corresponding to the corresponding input. The specific convolution multiply accumulate calculation is shown as an algorithm 1, and has 5 cycles, wherein two cycles are developed aiming at PE and SIMD, the corresponding parallel rows are PE multiplied by SIMD, each convolution layer can carry out independent configuration of PE and SIMD according to the size of the calculated amount and the calculation input and output speeds of the front and rear layers, and the number of resources on a chip is saved as much as possible while the processing speed is improved.
Figure BDA0003148117230000101
Figure BDA0003148117230000111
The rolling block adopts the following flow:
setting the weight data W of the volume block, wherein the input characteristic diagram of the volume block is In; the output profile of the volume block is Out.
The dimension of an input feature map In of the volume block is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map of the volume block, IW is the width of the input feature map of the volume block, and IC is the number of input feature map channels of the volume block; weight data W of the volume block with dimension of OC multiplied by K2X IC, wherein OC is the number of output characteristic diagram channels of the convolution block, and K is the size of a convolution kernel; the output of the volume block is an output feature map Out, and the dimension is OH × OW × OC, where OH is the output feature map height of the volume block, and OW is the output feature map width of the volume block.
Setting PE processing units, and performing parallel computing design in each neuron according to Single Instruction Multiple Data (SIMD) data; the input feature map is scheduled to be data with the width of PE multiplied by SIMD, the output of each PE calculation is the calculation output of a neuron, and the SIMD calculation outputs are adopted inside the PE.
Setting 5 cycles, expanding only the innermost layer, wherein the cycles from the outer layer to the innermost layer are respectively a first cycle of convolution to a fifth cycle of convolution: setting the first recirculation variable of convolution as h, the second recirculation variable of convolution as w, the third recirculation variable of convolution as c, the fourth recirculation variable of convolution as pe, and the fifth recirculation variable of convolution as simd.
h. Initial values of w, c, PE and SIMD are set to 0, and 1 is added each cycle, the upper limit of h is set to OH, the upper limit of w is set to OW, the upper limit of c is set to OC, the upper limit of PE is set to PE, and the upper limit of SIMD is set to SIMD.
Wherein the fifth recycling of the convolution is set to
Out[w][h][c/PE+pe]+=In[pe][simd]×W[pe][simd]
Wherein In [ pe ] [ simd ] is the parameter of the input feature map In the first pe row and the second simd column of the convolution block; w [ pe ] [ simd ] is the parameter of the sixth pe row and the simd column of the weight data W of the volume block; out [ w ] [ h ] [ c/PE + PE ] is a parameter with three dimensions of w, h and simd respectively of the output characteristic diagram of the rolling block; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.
(2) Fully connected computing layer
And the full-connection calculation layer executes full-connection multiplication accumulation calculation to complete full-connection calculation of the input characteristic diagram and the weight data, and outputs the result as an output characteristic diagram. The input characteristic diagram and the output characteristic diagram of the fully-connected computing module are consistent with those of the convolution computing module and are H multiplied by W multiplied by C data dimensions. The full join computation can be abstracted as a convolution computation with K ═ 1, and a specific full join multiply accumulate computation algorithm, shown as algorithm 2, comprises a triple cycle. The algorithm expands PE circulation and SIMD circulation, and is designed to comprise a PE multiplied by SIMD parallelism processing module, wherein each PE calculates the output of a fully connected neuron, and SIMD multiplication and accumulation units are arranged in the neuron for parallel calculation. Each full-connection layer instance can carry out specific PE and SIMD configuration according to the input and output speeds of the front and rear layers, and the resource consumption on the chip is saved while the processing is satisfied.
Figure BDA0003148117230000131
The fully-connected computing layer specifically adopts the following procedures:
the input characteristic diagram of the fully connected computation layer is In2Setting weight data W of the fully-connected computation layer2(ii) a The output characteristic diagram of the fully-connected computing layer is Out2
Input characteristic diagram In of fully connected computation layer2Dimension IH2×IW2×IC2Wherein IH2Input feature map height, IW, for fully connected computing layers2Computing input feature width, IC, of layers for full connectivity2Calculating the number of input characteristic diagram channels of the layer for full connection; weight data W of fully connected computation layers2Dimension of OC2×K2 2×IC2Wherein OC is2Computing the number of output feature map channels, K, for a fully connected computing layer2Calculating the convolution kernel size of the layer for the full join; the output of the fully-connected computing layer is an output characteristic diagram Out2Dimension of OH2×OW2×OC2In which OH is2Computing the output profile height, OW, of layers for full connectivity2The output profile width of the layer is calculated for the full connection.
The full-connection meterThe calculation layer setting comprises 3 cycles, c cycle, PE cycle and SIMD cycle are respectively arranged from the outer layer to the inner layer, only the PE cycle and the SIMD cycle are expanded, and the calculation layer setting is designed to comprise PE2×SIMD2Parallelism processing module, each PE2Computing the output of a fully connected neuron with SIMD inside2The multiply-accumulate units compute in parallel.
Setting a loop variable of the c loop to c2And the cyclic variable of pe cycle is pe2The loop variable of the simd loop is simd2
c2、pe2And simd2Is set to 0, and 1, c is added once per cycle2Is set to OC2, pe2Is set to PE2,simd2Is set to SIMD2
The SIMD loop is set to:
Out2[c2/PE2+pe2]+=In2[pe2][simd2]×W2[pe2][simd2]。
wherein Out2[c2/PE2+pe2]Computing the output profile Out of a layer for full connectivity2C th2/PE2+pe2A parameter; in2[pe2][simd2]Input profile In for fully connected computation layers2Pe (a)2Row and simd2Parameters of the columns; w2[pe2][simd2]Computing the pe th of the weight data of a layer for full connectivity2Row and simd2Parameters of the columns; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.
(3) Maximum pooling computation layer
The maximum pooling calculation layer completes maximum pooling calculation, and the input and output characteristic diagrams are consistent with the convolution layer and the full connection layer aiming at the pooling calculation function of the input characteristic diagram, and are combined into a calculation unit for data flow scheduling in a multilayer network mode. The maximum pooling calculation algorithm is shown as algorithm 3, and comprises 6 cycles, wherein only the innermost layer is circularly expanded, the expansion number is PE, namely the parallelism is PE, and each pooling layer is independently configured according to calculation requirements. Wherein Pool _ Size is the pooling Size and max is the selected maximum. The data of the input characteristic diagram In meet the calculation requirement of the pseudo code 6 recirculation through input scheduling.
Figure BDA0003148117230000141
Figure BDA0003148117230000151
Namely, the maximum pooling calculation layer specifically adopts the following procedures:
the pooled input profile is In1Setting weight data W1(ii) a The output characteristic diagram is Out1
Pooled input profile In1Dimension IH1×IW1×IC1Wherein IH1Input feature map height, IW, for pooling1Input feature width, IC, for pooling1The number of channels of the pooled input feature map; pooled weight data W1Dimension of OC1×K1 2×IC1Wherein OC is1Number of channels of output profile, K, for pooling1The size of the pooled nuclei; pooled output as output feature map Out1Dimension of OH1×OW1×OC1In which OH is1Output feature map height, OW, for pooling1The feature map width is output for pooling.
And setting Pool _ Size as the pooling Size of the maximum pooling calculation layer, and setting max as the selected maximum value.
Setting 6 cycles, and circularly expanding only the innermost layer to obtain PE1The circulation from the outer layer to the innermost layer is respectively a first recirculation of pooling to a fifth recirculation of pooling: setting the pooled first recycle variable to h1Pooled second recycle variable ph, pooled third recycle variable w1Pooled fourth recycle variable pw, pooled fifthThe recirculation variable is c1And pooled sixth recycle variable pe1
h1、ph、w1、pw、c1And pe1Is set to 0, and is added by 1, h once per cycle1Is set to OH1The upper limit of ph is set to Pool _ Size, w1Is set to OW1The upper limit of pw is set to Pool _ Size, c1Is set to the upper limit of (C)1,pe1Is set to PE1
Wherein the sixth recycle of pooling is set to
Out1[w1][h1][c1/PE1+pe1]=max(In1[pe1],Out1[w1][h1][c1/PE1+pe1])
Wherein Out1[w1][h1][c/PE1+pe1]The three dimensions in the output characteristic diagram representing the pooling are [ w ] respectively1][h1][c/PE1+pe1]Parameter of (1), Out1Initially storing negative infinity in the matrix; max () represents the maximum value of the value in parentheses; in1[pe1]Pe in graph representing pooled features1And (6) rows.
And (3) constructing a relational network reasoning task aiming at two data sets with different sizes of Omniglot and miniImagenet, wherein the input image dimensions are respectively 28 multiplied by 1 and 84 multiplied by 3, and H in full connection is respectively 64 and 576. Due to the difference of the input sizes and the internal configurations of the two data sets, the specific calculation amount of each module is different, and therefore different accelerators are designed for the Omniglot data set and the miniImagenet data set respectively to carry out reasoning acceleration.
Example 1 Omniglot28 relational network inference design
The Omniglot28 on-chip multi-core design is shown in fig. 4, and includes a control interface, axi (advanced extensible interface), DRAM, and a computation module. The calculation module comprises a feature extraction module and 4 relation calculation modules. The control interface completes the control instruction transmission and data transmission functions of the host computer and controls the specific relation calculation and feature extraction module. A plurality of computing modules share the 4G memory space using one DRAM, and can run in parallel at the same time, so that higher operation throughput is provided by utilizing the multi-core capability.
The Omniglot28 accelerator configuration is shown in Table 1, and for the sake of simplicity of description, the PE and SIMD configurations of the feature extraction module and the relationship calculation module are placed in a table for description. The feature extraction module comprises parts cnv 1-cnv 4, and the relation calculation module comprises parts cnv 5-fc 2. The parallel granularity of each module can be controlled by setting the sizes of the PE and the SIMD, and then the throughput of each module is controlled, but the parallel granularity is limited by the 32-bit width of floating point data and the large line width and large-scale calculation matrix brought by the PE and the SIMD, so that the time sequence is difficult to converge when Vivado is integrated, and the throughput is further improved by using resources in a multi-core mode. The design speed of a feature extraction module of the Omniglot28 accelerator is 2198.39FPS, and the design speed of a relation calculation module is 7430.56 FPS.
TABLE 1 Omniglot28 Accelerator configuration
Figure BDA0003148117230000171
Example 2 MiniImagenet84 relationship network inference design
The miniImagenet84 on-chip multi-core interconnection design is shown in FIG. 5, and includes a heterogeneous multi-core acceleration on-chip system composed of a feature extraction module and two relation calculation modules.
The miniImagenet84 accelerator configuration is shown in Table 2, and for the sake of description, the PE and SIMD configurations of the feature extraction module and the relationship calculation module are set in a table for description. The feature extraction module comprises parts cnv 1-cnv 4, and the relation calculation module comprises parts cnv 5-fc 2. The parallel granularity of each module can be controlled by setting the sizes of the PE and the SIMD, and further the throughput of each module can be controlled. The design speed of a feature extraction module of the miniImagenet84 accelerator is 221.02 FPS, and the design speed of a relation calculation module is 642.78 FPS.
Table 2 miniImagenet84 accelerator configuration
Figure BDA0003148117230000181
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A relational network reasoning optimization method based on software and hardware cooperative acceleration is characterized by comprising a support set feature extraction process running on an X86/GPU platform and a relational network image classification reasoning design running on an FPGA platform;
the support set feature extraction process specifically comprises the following steps:
receiving image data of support sets of different categories, wherein the support sets are standard data sets labeled by experts, the image data are consistent in size, and classification labels are fixed; the image data has C types, and each type comprises K images;
carrying out feature extraction on support sets of different categories, and constructing different feature pools as features of the offline support sets; the offline support set features comprise K feature pools, and each feature pool comprises C-type features;
the relational network image classification reasoning design running on the FPGA platform adopts a test image feature extraction module and a relational computation module which are constructed on an FPGA chip to specifically execute the following procedures
Receiving the characteristics of the offline support set, storing the characteristics into a Dynamic Random Access Memory (DRAM) on the FPGA board card;
the test image feature extraction module receives the test image and performs feature extraction on the test image to obtain test image features;
the relation calculation module performs relation network reasoning calculation by using the test image characteristics and the offline support set characteristics to obtain a relation score;
the relational network image classification reasoning process in the FPGA chip adopts a multi-core interconnection design, wherein M relational computation modules, N test image feature extraction modules, a DRAM and a control interface are arranged in the relational network image classification reasoning process, and all the modules are interconnected through an AXI interconnection bus.
2. The method of claim 1, wherein the test image feature extraction module consists of 4 volume blocks and 2 maximal pooling computing layers connected in sequence;
the relation calculation module is formed by sequentially connecting 2 volume blocks, 2 maximum pooling calculation layers and 2 full-connection calculation layers.
3. The method of claim 2, wherein the volume block is designed as follows:
setting the weight data W of the volume block, wherein the input characteristic diagram of the volume block is In; the output characteristic diagram of the rolling block is Out;
the dimension of an input feature map In of the volume block is IH multiplied by IW multiplied by IC, wherein IH is the height of the input feature map of the volume block, IW is the width of the input feature map of the volume block, and IC is the number of input feature map channels of the volume block; weight data W of the volume block with dimension of OC multiplied by K2X IC, wherein OC is the number of output characteristic diagram channels of the convolution block, and K is the size of a convolution kernel; the output of the rolling block is an output characteristic diagram Out, and the dimensionality is OH multiplied by OW multiplied by OC, wherein OH is the output characteristic diagram height of the rolling block, and OW is the output characteristic diagram width of the rolling block;
setting PE processing units, and performing parallel computing design in each neuron according to Single Instruction Multiple Data (SIMD) data; the input characteristic diagram is scheduled to be data with the width of PE multiplied by SIMD, the output calculated by each PE is the calculation output of a neuron, and SIMD calculation outputs are adopted inside the PE;
setting 5 cycles, expanding only the innermost layer, wherein the cycles from the outer layer to the innermost layer are respectively a first cycle of convolution to a fifth cycle of convolution: setting a first convolution variable of convolution as h, a second convolution variable of convolution as w, a third convolution variable of convolution as c, a fourth convolution variable of convolution as pe, and a fifth convolution variable of convolution as simd;
h. setting initial values of w, c, PE and SIMD as 0, self-adding 1 every cycle, setting the upper limit of h as OH, setting the upper limit of w as OW, setting the upper limit of c as OC, setting the upper limit of PE as PE and setting the upper limit of SIMD as SIMD;
wherein the fifth recycling of the convolution is set to
Out[w][h][c/PE+pe]+=In[pe][simd]×W[pe][simd]
Wherein In [ pe ] [ simd ] is the parameter of the input feature map In the first pe row and the second simd column of the convolution block; w [ pe ] [ simd ] is the parameter of the sixth pe row and the simd column of the weight data W of the volume block; out [ w ] [ h ] [ c/PE + PE ] is a parameter with three dimensions of w, h and simd respectively of the output characteristic diagram of the rolling block; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.
4. A method according to claim 2 or 3, characterized in that the maximum pooling calculation layer is designed in particular as follows:
the pooled input profile is In1Setting weight data W1(ii) a The output characteristic diagram is Out1
Pooled input profile In1Dimension IH1×IW1×IC1Wherein IH1Input feature map height, IW, for pooling1Input feature width, IC, for pooling1The number of channels of the pooled input feature map; pooled weight data W1Dimension of OC1×K1 2×IC1Wherein OC is1Number of channels of output profile, K, for pooling1The size of the pooled nuclei; pooled output as output feature map Out1Dimension of OH1×OW1×OC1In which OH is1Output feature map height, OW, for pooling1Outputting the feature map width for pooling;
setting Pool _ Size as the pooling Size of the maximum pooling calculation layer, and setting max as a selected maximum value;
setting 6 cycles, and circularly expanding only the innermost layer to obtain PE1Cycle ofThe ring is respectively a first re-circulation cycle of pooling to a fifth circulation cycle of pooling from the outer layer to the innermost layer: setting the pooled first recycle variable to h1Pooled second recycle variable ph, pooled third recycle variable w1Pooled fourth recycle variable pw, pooled fifth recycle variable c1And pooled sixth recycle variable pe1
h1、ph、w1、pw、c1And pe1Is set to 0, and is added by 1, h once per cycle1Is set to OH1The upper limit of ph is set to Pool _ Size, w1Is set to OW1The upper limit of pw is set to Pool _ Size, c1Is set to the upper limit of (C)1,pe1Is set to PE1
Wherein the sixth recycle of pooling is set to
Out1[w1][h1][c1/PE1+pe1]=max(In1[pe1],Out1[w1][h1][c1/PE1+pe1])
Wherein Out1[w1][h1][c/PE1+pe1]The three dimensions in the output characteristic diagram representing the pooling are [ w ] respectively1][h1][c/PE1+pe1]Parameter of (1), Out1Initially storing negative infinity in the matrix; max () represents the maximum value of the value in parentheses; in1[pe1]Pe in graph representing pooled features1And (6) rows.
5. The method according to claim 2 or 3, wherein the fully-connected computation layer is specifically designed as follows:
the input characteristic diagram of the fully connected computation layer is In2Setting weight data W of the fully-connected computation layer2(ii) a The output characteristic diagram of the fully-connected computing layer is Out2
Input characteristic diagram In of fully connected computation layer2Dimension IH2×IW2×IC2Wherein IH2Input feature map height, IW, for fully connected computing layers2Computing input feature width, IC, of layers for full connectivity2Calculating the number of input characteristic diagram channels of the layer for full connection; weight data W of fully connected computation layers2Dimension of OC2×K2 2×IC2Wherein OC is2Computing the number of output feature map channels, K, for a fully connected computing layer2Calculating the convolution kernel size of the layer for the full join; the output of the fully-connected computing layer is an output characteristic diagram Out2Dimension of OH2×OW2×OC2In which OH is2Computing the output profile height, OW, of layers for full connectivity2Calculating the output feature map width of the layer for the full connection;
the fully-connected computing layer is set to comprise 3 cycles, namely c cycle, PE cycle and SIMD cycle from the outer layer to the inner layer, and only the PE cycle and the SIMD cycle are expanded and designed to comprise PE2×SIMD2Parallelism processing module, each PE2Computing the output of a fully connected neuron with SIMD inside2Parallel calculation is carried out by the multiply-accumulate units;
setting a loop variable of the c loop to c2And the cyclic variable of pe cycle is pe2The loop variable of the simd loop is simd2
c2、pe2And simd2Is set to 0, and 1, c is added once per cycle2Is set to OC2, pe2Is set to PE2,simd2Is set to SIMD2
The SIMD loop is set to:
Out2[c2/PE2+pe2]+=In2[pe2][simd2]×W2[pe2][simd2];
wherein Out2[c2/PE2+pe2]Computing the output profile Out of a layer for full connectivity2C th2/PE2+pe2A parameter; in2[pe2][simd2]For computing layer inputs in full connectivityIn characteristic diagram In2Pe (a)2Row and simd2Parameters of the columns; w2[pe2][simd2]Computing the pe th of the weight data of a layer for full connectivity2Row and simd2Parameters of the columns; the left side of the symbol is added with the right side of the symbol on the basis of the original symbol.
CN202110757032.6A 2021-07-05 2021-07-05 Relational network reasoning optimization method based on software and hardware cooperative acceleration Active CN113673704B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110757032.6A CN113673704B (en) 2021-07-05 2021-07-05 Relational network reasoning optimization method based on software and hardware cooperative acceleration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110757032.6A CN113673704B (en) 2021-07-05 2021-07-05 Relational network reasoning optimization method based on software and hardware cooperative acceleration

Publications (2)

Publication Number Publication Date
CN113673704A true CN113673704A (en) 2021-11-19
CN113673704B CN113673704B (en) 2022-07-01

Family

ID=78538663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110757032.6A Active CN113673704B (en) 2021-07-05 2021-07-05 Relational network reasoning optimization method based on software and hardware cooperative acceleration

Country Status (1)

Country Link
CN (1) CN113673704B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160380819A1 (en) * 2015-06-26 2016-12-29 Microsoft Technology Licensing, Llc Configuring acceleration components over a network
CN109284250A (en) * 2017-09-11 2019-01-29 南京弹跳力信息技术有限公司 A kind of calculating acceleration system and its accelerated method based on large-scale F PGA chip
US20190138830A1 (en) * 2015-01-09 2019-05-09 Irvine Sensors Corp. Methods and Devices for Cognitive-based Image Data Analytics in Real Time Comprising Convolutional Neural Network
CN111210019A (en) * 2020-01-16 2020-05-29 电子科技大学 Neural network inference method based on software and hardware cooperative acceleration
CN111626403A (en) * 2020-05-14 2020-09-04 北京航空航天大学 Convolutional neural network accelerator based on CPU-FPGA memory sharing
CN112116084A (en) * 2020-09-15 2020-12-22 中国科学技术大学 Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform
CN112232058A (en) * 2020-10-15 2021-01-15 济南大学 False news identification method and system based on deep learning three-layer semantic extraction framework
CN112464930A (en) * 2019-09-09 2021-03-09 华为技术有限公司 Target detection network construction method, target detection method, device and storage medium
CN112990454A (en) * 2021-02-01 2021-06-18 国网安徽省电力有限公司检修分公司 Neural network calculation acceleration method and device based on integrated DPU multi-core isomerism

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138830A1 (en) * 2015-01-09 2019-05-09 Irvine Sensors Corp. Methods and Devices for Cognitive-based Image Data Analytics in Real Time Comprising Convolutional Neural Network
US20160380819A1 (en) * 2015-06-26 2016-12-29 Microsoft Technology Licensing, Llc Configuring acceleration components over a network
CN109284250A (en) * 2017-09-11 2019-01-29 南京弹跳力信息技术有限公司 A kind of calculating acceleration system and its accelerated method based on large-scale F PGA chip
CN112464930A (en) * 2019-09-09 2021-03-09 华为技术有限公司 Target detection network construction method, target detection method, device and storage medium
CN111210019A (en) * 2020-01-16 2020-05-29 电子科技大学 Neural network inference method based on software and hardware cooperative acceleration
CN111626403A (en) * 2020-05-14 2020-09-04 北京航空航天大学 Convolutional neural network accelerator based on CPU-FPGA memory sharing
CN112116084A (en) * 2020-09-15 2020-12-22 中国科学技术大学 Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform
CN112232058A (en) * 2020-10-15 2021-01-15 济南大学 False news identification method and system based on deep learning three-layer semantic extraction framework
CN112990454A (en) * 2021-02-01 2021-06-18 国网安徽省电力有限公司检修分公司 Neural network calculation acceleration method and device based on integrated DPU multi-core isomerism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YIXING LI 等: "A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks", 《ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS》 *
张坤宁 等: "基于FPGA的多核可扩展卷积加速器设计", 《计算机工程与设计》 *
徐畅 等: "基于DPU加速的CNN推理***设计与实现", 《电脑编程技巧与维护》 *

Also Published As

Publication number Publication date
CN113673704B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN111967468B (en) Implementation method of lightweight target detection neural network based on FPGA
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
US20180164866A1 (en) Low-power architecture for sparse neural network
Li et al. Laius: An 8-bit fixed-point CNN hardware inference engine
CN112633490B (en) Data processing device, method and related product for executing neural network model
Sun et al. A high-performance accelerator for large-scale convolutional neural networks
Yu et al. Instruction driven cross-layer cnn accelerator for fast detection on fpga
Wang et al. Briefly Analysis about CNN Accelerator based on FPGA
Lin et al. A high-speed low-cost CNN inference accelerator for depthwise separable convolution
Zong-ling et al. The design of lightweight and multi parallel CNN accelerator based on FPGA
Feng et al. The implementation of LeNet-5 with NVDLA on RISC-V SoC
Lin et al. High utilization energy-aware real-time inference deep convolutional neural network accelerator
Kim et al. A low-power graph convolutional network processor with sparse grouping for 3d point cloud semantic segmentation in mobile devices
Lu et al. An 176.3 GOPs object detection CNN accelerator emulated in a 28nm CMOS technology
Hu et al. On-chip instruction generation for cross-layer CNN accelerator on FPGA
CN113673704B (en) Relational network reasoning optimization method based on software and hardware cooperative acceleration
Gong et al. RAODAT: An energy-efficient reconfigurable AI-based object detection and tracking processor with online learning
Hu et al. High-performance reconfigurable DNN accelerator on a bandwidth-limited embedded system
Islam et al. An uninterrupted processing technique-based high-throughput and energy-efficient hardware accelerator for convolutional neural networks
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
Wang et al. DLA+: A light aggregation network for object classification and detection
Yang et al. Implementation of Reconfigurable CNN-LSTM Accelerator Based on FPGA
Cui et al. Design and Implementation of OpenCL-Based FPGA Accelerator for YOLOv2
Wen FPGA-Based Deep Convolutional Neural Network Optimization Method
Tsai et al. Hardware Architecture Design for Hand Gesture Recognition System on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant