CN113516235A - Deformable convolution accelerator and deformable convolution acceleration method - Google Patents

Deformable convolution accelerator and deformable convolution acceleration method Download PDF

Info

Publication number
CN113516235A
CN113516235A CN202110788017.8A CN202110788017A CN113516235A CN 113516235 A CN113516235 A CN 113516235A CN 202110788017 A CN202110788017 A CN 202110788017A CN 113516235 A CN113516235 A CN 113516235A
Authority
CN
China
Prior art keywords
module
value
convolution
input
input value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110788017.8A
Other languages
Chinese (zh)
Inventor
王中风
于悦
罗嘉鹏
毛文东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110788017.8A priority Critical patent/CN113516235A/en
Publication of CN113516235A publication Critical patent/CN113516235A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to the technical field of convolutional neural networks, and provides a deformable convolution accelerator and a deformable convolution acceleration method. According to the hardware architecture design based on the FPGA, through mapping operation of the value stage, regular access is provided for convolution calculation, a register array is designed to match processing rates of two stages, storage space is optimized, convolution operation is executed according to a regular input value, and an output result is obtained. The method accelerates the original deformable convolution layer, does not adjust the algorithm at all, does not limit the offset, and furthest retains the precision of the original model; for irregular receptive fields, the mapping module is adopted to regularize the irregular receptive fields, and the operation rates of the mapping module and the convolution module are matched in a ping-pong operation mode of the register module, so that the hardware utilization rate is improved; the method and the device do not need to store the intermediate data outside the chip, and reduce the access frequency of the off-chip storage structure.

Description

Deformable convolution accelerator and deformable convolution acceleration method
Technical Field
The application relates to the technical field of convolutional neural networks, in particular to a deformable convolution accelerator and a deformable convolution acceleration method.
Background
Deformable convolution has important applications in object detection work. Since the deformable convolution theory is proposed, the deformable convolution is applied to various target detection models to achieve higher detection accuracy and speed. In a common convolution network, a window with the same size as a convolution kernel slides on an input characteristic diagram, and numerical values at positions corresponding to the window are selected to be multiplied by the convolution kernel and accumulated to obtain the output of the current position. And the deformable convolution is based on the common convolution, the offset is dynamically generated according to the input characteristic diagram, when a numerical value is selected on the input characteristic diagram and multiplied by the convolution kernel, the position corresponding to the sliding window is taken as a base address, the corresponding offset is added on the base address to obtain a new address, and then the value is taken on the input characteristic diagram by using the new address. Because the offset is not an integer, the new address is often not an integer, and when the new address exceeds the address range of the input characteristic diagram, the returned input value is zero; and when the new address is in the address range of the input characteristic diagram, searching four points closest to the new address coordinate by adopting a bilinear interpolation method to calculate the input value of the new address, multiplying the input value of the new address by a convolution kernel, and accumulating to obtain the output of the deformable convolution.
In order to facilitate hardware design, the prior art performs algorithmic adjustment on the deformable convolution, specifically: limiting the magnitude of the offset within a certain range, and fixing the offset direction to change the irregular receptive field into a regular square shape; meanwhile, the offset is changed from a floating point number to an integer, and a bilinear interpolation method is skipped to directly take a value. Through the operation, the receptive field is regularized, and the adjusted deformable convolution layer structure is close to the cavity convolution. However, this method greatly adjusts the algorithm of the deformed convolution, resulting in a decrease in output accuracy.
In the prior art, the size of the receptive field is limited by selecting a smaller offset in training, and the value of the offset address on the input feature map is stored outside the chip as an intermediate variable and is retrieved from outside the chip when convolution calculation is performed, so that the access times of an off-chip memory are greatly increased.
Disclosure of Invention
In order to overcome the defects of the prior art, the application aims to provide a deformable convolution accelerator and a deformable convolution acceleration method so as to solve the problems of access irregularity and output accuracy reduction caused by irregular receptive fields, and simultaneously reduce the access times of an off-chip memory through optimizing a storage space and access operation.
In order to achieve the above object, in one aspect, the present application provides a deformable convolution accelerator, specifically including: including a software side and a hardware side.
The software end is used for realizing the whole framework of the deformable convolution accelerator and loading original model parameters from an off-chip memory to the hardware end, and the original model parameters comprise original input values, offsets, quantization scales, masks and convolution weights; and is also used for reading the output value of the hardware end to an off-chip memory.
The hardware end comprises:
the control module is used for acquiring an original input value, an offset, a quantization scale, a mask and a convolution weight from a software end; and the real address is obtained according to the base address and the offset, and is separated into an integer part and a decimal part; the device also is used for outputting control signals and controlling the mapping module, the register module, the mask module, the convolution module, the input cache module, the output cache module, the weight cache module and the data interaction among the modules; and also for transmitting the output value to the software side.
And the input buffer module is used for storing the original input value.
And the register module is used for converting the access parallel dimensionality of the original input value, obtaining a parallel input value and storing the parallel input value.
The mapping module is used for obtaining an actual address and an interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part; and the processing unit is used for carrying out bilinear interpolation processing according to the parallel input value and the interpolation weight to obtain a real input value.
And the mask module is used for storing the mask and obtaining a regularized input value according to the real input value and the mask.
And the weight cache module is used for storing the convolution weight.
And the convolution module is used for performing convolution calculation according to the convolution weight and the regularized input value to obtain an output value.
And the output buffer module is used for storing the output value obtained by the calculation of the convolution module.
Further, the convolution module comprises a multiplication unit, an addition unit and an accumulation unit.
The multiplication unit comprises a multiplier and is used for multiplying the regularized input value and the convolution weight to obtain a unit input value.
The addition unit comprises a carry-save adder and an adder and is used for carrying out addition operation on all unit input values step by step to obtain a unit accumulated value.
The accumulation unit comprises an accumulator and is used for accumulating all the unit accumulated values to obtain output values.
Further, the addition unit includes a preprocessing layer, an intermediate layer, and an output layer.
The preprocessing layer is used for selecting a corresponding carry save adder according to the size of the convolution kernel and grouping all unit input values according to the requirement of the carry save adder on the input values; and the carry save adder is used for carrying out addition operation on the grouped unit input values and outputting a first-level sum value and a first-level carry value; and the temporary storage value is obtained through an adder according to the primary carry value and the primary sum value.
And the intermediate layer is used for taking the temporary storage value output by the previous layer as the input of the next layer and carrying out progressive accumulation operation through the carry save adder and the adder to obtain a unit accumulated value.
And the output layer is used for transmitting the unit accumulated value to an accumulation unit.
Further, the register module comprises a read array and a write array, the read array is used for providing parallel input values for the mapping module to perform bilinear interpolation, the write array is used for storing the parallel input values, and the functions of the read array and the write array can be alternated mutually.
Furthermore, ping-pong operation is adopted by the read array and the write array to realize parallel processing of read action and storage action.
Further, the access parallel dimension of the register module is set to 36.
Further, the control module comprises a counter, and the counter is used for obtaining the interpolation weight according to the fractional part calculation.
In a second aspect, the present application further provides a deformable convolution acceleration method, specifically including:
the control module obtains an original input value, an offset, a quantization scale, a mask and a convolution weight through a software end.
The control module adds a base address and the offset to obtain a real address, separates the real address into an integer part and a fractional part, and transmits the integer part, the fractional part and the quantization scale to a mapping module; and storing the original input value into an input cache module, storing the mask into a mask module, and caching the convolution weight into a weight cache module.
And the mapping module obtains the actual address and the interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part.
And the control module acquires a corresponding original input value from the input cache module according to the actual address and transmits the original input value to the register module.
And the register module converts the parallel dimensionality of the original input value and outputs a parallel input value.
And the mapping module carries out bilinear interpolation processing on the parallel input value and the interpolation weight to obtain a real input value.
And the mask module performs product operation on the real input value and the mask to obtain a regularized input value.
The convolution module obtains the convolution weight of the weight cache module and the regularization input value of the mask module through the control module, performs convolution calculation according to the regularization input value and the convolution weight to obtain an output value, and stores the output value to the output cache module.
And the control module controls the output cache module to transmit the output value to an off-chip memory through a software end.
Further, the register module includes a read array and a write array, and a specific method for converting parallel dimensions by the register module is as follows:
the register module obtains the original input value input into the cache module through the control module.
And the control module sends a control signal to store the original input value into a write array of the register module.
The control module controls the mapping module to read the stored original input value from the read array of the register module for performing the bilinear interpolation operation.
And when the writing array is full of the original input values and all the original input values in the reading array are processed by the mapping module, the functions of the reading array and the writing array are interchanged.
Further, the convolution module includes a multiplier, a carry save adder, an adder and an accumulator, and the specific method for the convolution module to perform convolution calculation is as follows:
and acquiring the convolution weight of the weight cache module and the regularized input value of the mask module through the control module.
And multiplying the convolution weight and the regularized input value through a multiplier to obtain a unit input value.
And selecting a corresponding carry save adder according to the size of the convolution kernel.
All unit input values are grouped according to the requirement of the carry save adder on the input values.
And carrying out addition operation on the grouped unit input values through a carry save adder, and outputting a first-level sum value and a first-level carry value.
And summing the 2-time first-level carry value and the first-level sum value through an adder to obtain a first-level temporary storage value.
And grouping all the first-stage temporary storage values according to the requirement of the carry save adder on the input value.
And carrying out addition operation on the grouped primary temporary storage values through a carry save adder, and outputting a secondary sum value and a secondary carry value.
And summing the 2-time secondary carry value and the secondary sum value through an adder to obtain a secondary temporary storage value.
By repeatedly performing the grouping and adding operations until the unit accumulated value is obtained.
And accumulating all the unit accumulated values through an accumulator to obtain an output value.
The application provides a deformable convolution accelerator and a deformable convolution acceleration method, wherein an original deformable convolution layer is accelerated, the algorithm is not adjusted at all, the original offset range is used, the offset is not limited, and the precision of an original model is retained to the maximum extent; for irregular receptive fields, the mapping module is adopted to regularize the irregular receptive fields, and the operation rates of the mapping module and the convolution module are matched in a ping-pong operation mode of the register module, so that the hardware utilization rate is improved; meanwhile, the method and the device do not need to store the intermediate data outside the chip, and the access frequency of the off-chip storage structure is reduced.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of an overall architecture of a deformable convolution accelerator according to an embodiment of the present application;
FIG. 2 is a block diagram of a mapping module of a deformable convolution accelerator according to an embodiment of the present disclosure;
FIG. 3 is a block diagram of a register module of a deformable convolution accelerator according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an input buffer module of a deformable convolution accelerator according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a convolution module of a deformable convolution accelerator according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a comparison of the number of off-chip accesses provided in the examples of the present application;
fig. 7 is a flowchart illustrating a deformable convolution acceleration method according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be fully and clearly described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
A Field Programmable Gate Array (FPGA) is a semi-custom circuit in the field of Application Specific Integrated Circuits (ASICs), and is a product developed on the basis of programmable devices such as PAL, GAL and CPLD. Because the FPGA has the characteristics of short development period, strong function, high reliability and good confidentiality, the design for accelerating the convolutional neural network by applying the FPGA is endless at present, but the research on accelerating the deformable convolutional neural network by applying the FPGA is still lacked. The embodiment of the application is just based on the FPGA platform to research the deformable convolution neural network accelerator so as to make up the defects of the prior art.
Referring to fig. 1, a schematic diagram of an overall architecture of a deformable convolution accelerator according to an embodiment of the present application is provided. A first aspect of an embodiment of the present application provides a deformable convolution accelerator, which specifically includes: including a software side and a hardware side.
The software end is used for realizing the whole framework of the deformable convolution accelerator and loading original model parameters from an off-chip memory to the hardware end, and the original model parameters comprise original input values, offsets, quantization scales, masks and convolution weights; and is also used for reading the output value of the hardware end to an off-chip memory.
The hardware end comprises:
the control module is used for acquiring an original input value, an offset, a quantization scale, a mask and a convolution weight from a software end; and the real address is obtained according to the base address and the offset, and is separated into an integer part and a decimal part; the device also is used for outputting control signals and controlling the mapping module, the register module, the mask module, the convolution module, the input cache module, the output cache module, the weight cache module and the data interaction among the modules; and also for transmitting the output value to the software side.
And the input buffer module is used for storing the original input value.
And the register module is used for converting the access parallel dimensionality of the original input value, obtaining a parallel input value and storing the parallel input value.
The mapping module is used for obtaining an actual address and an interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part; and the processing unit is used for carrying out bilinear interpolation processing according to the parallel input value and the interpolation weight to obtain a real input value.
And the mask module is used for storing the mask and obtaining a regularized input value according to the real input value and the mask.
And the weight cache module is used for storing the convolution weight.
And the convolution module is used for performing convolution calculation according to the convolution weight and the regularized input value to obtain an output value.
And the output buffer module is used for storing the output value obtained by the calculation of the convolution module.
Compared with a common convolution layer, the deformable convolution obtains the receptive field with irregular shape by shifting the input address, and can adapt to targets with different sizes and shapes, so that the deformable convolution has important application in a target detection task. But the offset addresses also result in irregular accesses, and the use of the conventional parallel strategy can result in access conflicts. The embodiment of the application is a hardware architecture design based on deformable convolution, and the calculation of the deformable convolution is divided into two stages: a value stage and a convolution stage. In the value taking stage, the storage space is optimized to provide regular access for convolution calculation, and the calculation in the stage is mainly completed by a mapping module; in the convolution stage, convolution operation is executed according to the regularized input value obtained from the previous stage, and an output result is obtained. Compared with other hardware architectures, the embodiment of the application realizes acceleration of the original deformable convolution for the first time, and greatly reduces energy consumption; in addition, the application does not compress and regularize the receptive field, thereby furthest retaining the original precision of the model; meanwhile, the optimization of the storage space reduces the access frequency of the off-chip storage space.
Specifically, in order to verify the feasibility of the scheme, the software end and the hardware end are both verified by using an FPGA platform.
Further, the control module comprises a counter, and the counter is used for obtaining the interpolation weight according to the fractional part calculation.
Specifically, referring to fig. 2, a schematic structural diagram of a mapping module of a deformable convolution accelerator according to an embodiment of the present application is provided. The mapping module is represented as an array of feature map cells, the number of which is denoted by H, and in the embodiment of the present application, the convolution kernel size is 3 × 3 and the number of the array of feature map cells is 9. In each profile cell, the input X, Y is the integer part of the abscissa and ordinate of the real address, which is obtained by adding an offset to the base address, and x and y are the fractional parts. For the value after quantization, the two parts can be obtained by shifting the address after offset through the quantization scale. From the integer part X, Y, 4 adjacent values required for calculating bilinear interpolation are obtained on the feature map, and the coordinates of the 4 adjacent values are (X, Y), (X +1,y), (X +1, Y +1) and (X, Y + 1). Calculating interpolation weights w of the 4 adjacent values through a counter in a control module according to the fractional parts x and y1~w4The coordinate value and the interpolation weight are multiplied correspondingly and then accumulated to obtain a value corresponding to the address after the offset, and the value is also an input value required in the convolution operation of the next stage.
In conclusion, in the mapping stage, the value is taken on the feature map through the coordinate of the actual address, and the value is regularly arranged and provided for the convolution calculation in the next stage.
Specifically, if the calculation in the convolution stage is resumed after the calculation in the value stage is completely completed, a large delay is caused. Therefore, in the embodiment of the present application, the two stages of calculation are performed synchronously, each value-taking address in the feature map is multiplexed in the input dimension, and values at the same position on the feature map but on different input channels are taken at the same time. However, in the calculation of the next stage, values at different positions on the same input channel need to be input at the same time. Due to different access dimensions, a register module is arranged between the value stage and the convolution stage to store intermediate data and convert the parallel dimensions of the data.
Further, the register module comprises a read array and a write array, the read array is used for providing parallel input values for the mapping module to perform bilinear interpolation, the write array is used for storing the parallel input values, and the functions of the read array and the write array can be alternated mutually.
Furthermore, ping-pong operation is adopted by the read array and the write array to realize parallel processing of read action and storage action.
Further, the access parallel dimension of the register module is set to 36.
Referring to fig. 3, a schematic diagram of a register module of a deformable convolution accelerator according to an embodiment of the present application is shown. Specifically, in order to enable the access operation not to influence each other, the calculation of the two stages can be executed simultaneously, and the register module adopts ping-pong operation, so that the reading and the writing can be carried out simultaneously without mutual interference. And simultaneously taking values of one address in the mapping module at a corresponding position on a plurality of input channels and storing the values into a writing array of the register module, wherein the values are marked as w _1 to w _ 36. Since the size of the convolution kernel is 3 × 3, 4 adjacent values are needed for each bilinear interpolation operation to participate in the calculation, and 36 original input values in the buffer memory need to be input for each time an output value is calculated by the convolution unit, the access parallel dimension of the register module in the embodiment of the present application is 36. When all 36 addresses in the mapping module have taken the original input values and store them in the register module, 36 × 36 original input values are stored in the write array from 36 input channels, each having 36 original input values. The original input values output by the read array each time are also 36, and are marked as r _1 to r _36, the original input values output at the same time have the same input channel, and after 36 times of output, all the original input values are output to the mapping module, and are multiplexed on the output channel by the convolution module after the bilinear interpolation is performed. When the write array is full, all the original input values in the read array are used by the following modules, and then the read array and the write array are exchanged.
In order to match with the access scheme in the register module, the arrangement of the original input values in the input cache module is also adjusted in the embodiment of the present application. According to the embodiment of the application, the original input values are grouped according to the input channels, and the group number of the grouping is the same as the access parallel dimension of the register module.
Specifically, referring to fig. 4, a schematic structural diagram of an input buffer module of a deformable convolution accelerator according to an embodiment of the present application is provided. C represents an input channel, the mapping module outputs an address to the input cache module each time, and the input cache module outputs values of the corresponding positions of the currently selected 36 different input channels. After a group of addresses in the mapping module are fetched, the selected input channel in the input cache module is changed, and the black box represents the selected input channel and moves in the direction of the arrow as shown in the figure.
Further, the convolution module comprises a multiplication unit, an addition unit and an accumulation unit.
The multiplication unit comprises a multiplier and is used for multiplying the regularized input value and the convolution weight to obtain a unit input value.
The addition unit comprises a carry-save adder and an adder and is used for carrying out addition operation on all unit input values step by step to obtain a unit accumulated value.
The accumulation unit comprises an accumulator and is used for accumulating all the unit accumulated values to obtain output values.
The Carry Save Adder (CSA) has very small Carry propagation delay when executing multiple number addition, and its basic idea is to reduce the sum of 3 addends into the sum of 2 addends, separately calculate and Save the Carry value and the sum value, and each bit can independently calculate the Carry value and the sum value, so the speed is very fast, and the number of times of accumulation can be reduced, and the critical path can be shortened.
Specifically, when verilog is used to design an equation, such as Sum ═ a + B + C + D, a full adder may be designed, the full adder constitutes an N-bit CSA structure, multiple numbers are combined, the Sum is added through two stages of CSAs, and finally the carry value and the Sum are added through an adder, and when the carry is transmitted to any module, the carry value needs to be multiplied by 2, because the carry is the real number after it is combined.
Further, the addition unit includes a preprocessing layer, an intermediate layer, and an output layer.
The preprocessing layer is used for selecting a corresponding carry save adder according to the size of the convolution kernel and grouping all unit input values according to the requirement of the carry save adder on the input values; and the carry save adder is used for carrying out addition operation on the grouped unit input values and outputting a first-level sum value and a first-level carry value; and the temporary storage value is obtained through an adder according to the primary carry value and the primary sum value.
And the intermediate layer is used for taking the temporary storage value output by the previous layer as the input of the next layer and carrying out progressive accumulation operation through the carry save adder and the adder to obtain a unit accumulated value.
The output layer is used for transmitting the unit accumulated value to an accumulator.
Specifically, in the second stage of calculation, the embodiment of the present application designs a convolution array dedicated to convolution calculation. Referring to fig. 5, a schematic diagram of a convolution module structure of a deformable convolution accelerator according to an embodiment of the present application is provided. The parallel dimension of the output channel is N, which may vary according to the size of the design, and in the embodiments of this application, N is set to 32. Each convolution calculation unit is responsible for convolution calculation of one output channel, and weights input at the same time have the same input channel and different output channels. And the offset values calculated in the last stage are multiplexed by different output channels. Wherein, w1To w9Is the weight value of the input, v1To v9Is the input value taken by the mapping module in the calculation of the previous stage. Since the convolution kernel size in the embodiment of the present application is 3 × 3, the inputs after regularization of the same size are taken and multiplied correspondingly. In order to shorten the shortest path and simplify the calculation, a mode of directly accumulating 9 results is not adopted, and 3 results are used as a group to carry out calculation by using a carry save adder. The obtained 3 results are added by a first-stage carry adder. The two-stage carry adder can calculate the accumulated value of 9 numerical values to obtain the output of the current position of the current input channel.
Because each convolution calculation unit is responsible for the calculation of one output channel, the output result of each convolution calculation unit is accumulated by an accumulator on the dimension of an input channel to obtain the output value of the current position on the corresponding output channel.
The embodiment of the application provides a deformable convolution accelerator, which regularizes data access of convolution calculation of a second stage through mapping operation of a first stage, solves the problem of access conflict and enables an architecture to realize large-scale parallel; the register module is designed to match the processing rates of the two stages, the hardware utilization rate is improved, the parallel dimensionality of access is changed, and the register module is matched with the storage optimization of the input cache, so that the storage space requirement is not greatly increased under the parallel operation, and the off-chip access and storage times of the system are reduced; the application does not compress and regularize the receptive field, and the precision of the original model is kept to the maximum extent.
The deformable convolution accelerator and the method provided by the embodiments of the present application will be described in detail by specific embodiments.
As can be seen from table 1, compared with the prior art, the accuracy of the COCO data set of the embodiment of the present application is closer to the original model. The embodiment of the application does not compress and regularize the receptive field, accelerates the original deformable convolution layer and furthest retains the precision of the original model.
TABLE 1 accuracy comparison Table on COCO data set
Figure BDA0003159843140000071
As can be seen from table 2, the embodiment of the present application occupies less resources on the board.
TABLE 2 resource occupancy table
Figure BDA0003159843140000072
As can be seen from table 3, compared with the GPU, the energy efficiency ratio is greatly improved in the embodiments of the present application.
TABLE 3 energy consumption COMPARATIVE TABLE
Figure BDA0003159843140000073
Figure BDA0003159843140000081
Referring to fig. 6, a comparison diagram of the number of off-chip accesses provided in the embodiment of the present application is shown, where S denotes the size of the input, specifically, the side length, and N denotes the number of input channels. As can be seen from the figure, compared with the DCL accelerator in the prior art, the deformable convolution accelerator provided by the embodiment of the present application greatly reduces the number of accesses to the off-chip memory structure.
In summary, the deformable convolution accelerator in the embodiment of the present application has the following features:
(1) the embodiment of the application designs a mapping module to finish irregular value-taking operation of deformable convolution, regularizes access in a convolution stage and solves access conflict; the register module is used for matching the calculation rates of the two stages, so that the hardware utilization rate is improved; meanwhile, the parallel dimensionality of data access is changed, and the requirement of a storage space is reduced.
(2) The calculation module designed in the embodiment of the application constructs an accelerator framework for original deformable convolution calculation, does not modify an original algorithm, and furthest retains the precision of an original model.
(3) The embodiment of the application is realized based on an FPGA platform, and can reach the indexes of 200MHz and 126.32 GOPS. The energy efficiency on the FPGA is 50.33GOPS/W, which is improved by 2.5 times compared with the GPU. In addition, compared with the prior art, the access times of the off-chip memory are greatly reduced.
Referring to fig. 7, a schematic flowchart of a deformable convolution acceleration method according to an embodiment of the present application is provided. A second aspect of the embodiments of the present application provides a deformable convolution acceleration method, which is used to guide the operation of the deformable convolution accelerator provided in the first aspect of the embodiments of the present application, and for details that are not disclosed in the deformable convolution acceleration method provided in the second aspect of the embodiments of the present application, please refer to the deformable convolution accelerator provided in the first aspect of the embodiments of the present application.
The deformable convolution acceleration method specifically comprises the following steps:
the control module obtains an original input value, an offset, a quantization scale, a mask and a convolution weight through a software end.
The control module adds a base address and the offset to obtain a real address, separates the real address into an integer part and a fractional part, and transmits the integer part, the fractional part and the quantization scale to a mapping module; and storing the original input value into an input cache module, storing the mask into a mask module, and caching the convolution weight into a weight cache module.
And the mapping module obtains the actual address and the interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part.
And the control module acquires a corresponding original input value from the input cache module according to the actual address and transmits the original input value to the register module.
And the register module converts the parallel dimensionality of the original input value and outputs a parallel input value.
And the mapping module carries out bilinear interpolation processing on the parallel input value and the interpolation weight to obtain a real input value.
And the mask module performs product operation on the real input value and the mask to obtain a regularized input value.
The convolution module obtains the convolution weight of the weight cache module and the regularization input value of the mask module through the control module, performs convolution calculation according to the regularization input value and the convolution weight to obtain an output value, and stores the output value to the output cache module.
And the control module controls the output cache module to transmit the output value to an off-chip memory through a software end.
Further, the register module includes a read array and a write array, and a specific method for converting parallel dimensions by the register module is as follows:
the register module obtains the original input value input into the cache module through the control module.
And the control module sends a control signal to store the original input value into a write array of the register module.
The control module controls the mapping module to read the stored original input value from the read array of the register module for performing the bilinear interpolation operation.
And when the writing array is full of the original input values and all the original input values in the reading array are processed by the mapping module, the functions of the reading array and the writing array are interchanged.
Further, the convolution module includes a multiplier, a carry save adder, an adder and an accumulator, and the specific method for the convolution module to perform convolution calculation is as follows:
and acquiring the convolution weight of the weight cache module and the regularized input value of the mask module through the control module.
And multiplying the convolution weight and the regularized input value through a multiplier to obtain a unit input value.
And selecting a corresponding carry save adder according to the size of the convolution kernel.
All unit input values are grouped according to the requirement of the carry save adder on the input values.
And carrying out addition operation on the grouped unit input values through a carry save adder, and outputting a first-level sum value and a first-level carry value.
And summing the 2-time first-level carry value and the first-level sum value through an adder to obtain a first-level temporary storage value.
And grouping all the first-stage temporary storage values according to the requirement of the carry save adder on the input value.
And carrying out addition operation on the grouped primary temporary storage values through a carry save adder, and outputting a secondary sum value and a secondary carry value.
And summing the 2-time secondary carry value and the secondary sum value through an adder to obtain a secondary temporary storage value.
By repeatedly performing the grouping and adding operations until the unit accumulated value is obtained.
And accumulating all the unit accumulated values through an accumulator to obtain an output value.
Specifically, referring to fig. 5, for a convolution kernel of 3 × 3 and an input at a certain position, the weights in the convolution kernel are multiplied by the inputs correspondingly, and each 3 groups of the obtained 9 temporary storage values are added by a carry save adder to obtain 3 accumulated values, and then the accumulated values are accumulated again by the carry save adder to obtain an output value.
According to the technical scheme, the deformable convolution accelerator and the deformable convolution accelerator method accelerate an original deformable convolution layer, the algorithm is not adjusted at all, the original offset range is used, the offset is not limited, and the accuracy of an original model is reserved to the maximum extent; for irregular receptive fields, the mapping module is adopted to regularize the irregular receptive fields, and the operation rates of the mapping module and the convolution module are matched in a ping-pong operation mode of the register module, so that the hardware utilization rate is improved; meanwhile, the method and the device do not need to store the intermediate data outside the chip, and the access frequency of the off-chip storage structure is reduced.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

Claims (10)

1. A deformable convolution accelerator is characterized by comprising a software end and a hardware end;
the software end is used for realizing the whole framework of the deformable convolution accelerator and loading original model parameters from an off-chip memory to the hardware end, and the original model parameters comprise original input values, offsets, quantization scales, masks and convolution weights; the device is also used for reading the output value of the hardware end to an off-chip memory;
the hardware end comprises:
the control module is used for acquiring an original input value, an offset, a quantization scale, a mask and a convolution weight from a software end; and the real address is obtained according to the base address and the offset, and is separated into an integer part and a decimal part; the device also is used for outputting control signals and controlling the mapping module, the register module, the mask module, the convolution module, the input cache module, the output cache module, the weight cache module and the data interaction among the modules; and also for transmitting the output value to the software end;
the input buffer module is used for storing an original input value;
the register module is used for converting the access parallel dimensionality of the original input value, obtaining a parallel input value and storing the parallel input value;
the mapping module is used for obtaining an actual address and an interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part; and the processing unit is used for carrying out bilinear interpolation processing according to the parallel input value and the interpolation weight to obtain a real input value;
the mask module is used for storing the mask and obtaining a regularized input value according to the real input value and the mask;
the weight cache module is used for storing the convolution weight;
the convolution module is used for carrying out convolution calculation according to the convolution weight and the regularized input value to obtain an output value;
and the output buffer module is used for storing the output value obtained by the calculation of the convolution module.
2. A deformable convolution accelerator according to claim 1 wherein said convolution module includes a multiplication unit, an addition unit and an accumulation unit;
the multiplication unit comprises a multiplier and is used for multiplying the regularized input value by the convolution weight to obtain a unit input value;
the addition unit comprises a carry-save adder and an adder and is used for carrying out step-by-step addition operation on all unit input values to obtain a unit accumulated value;
the accumulation unit comprises an accumulator and is used for accumulating all the unit accumulated values to obtain output values.
3. A deformable convolution accelerator according to claim 2 wherein said adding unit includes a preprocessing layer, an intermediate layer and an output layer;
the preprocessing layer is used for selecting a corresponding carry save adder according to the size of the convolution kernel and grouping all unit input values according to the requirement of the carry save adder on the input values; and the carry save adder is used for carrying out addition operation on the grouped unit input values and outputting a first-level sum value and a first-level carry value; the temporary storage value is obtained through an adder according to the primary carry value and the primary sum value;
the intermediate layer is used for taking the temporary storage value output by the previous layer as the input of the next layer and performing progressive accumulation operation through the carry save adder and the adder to obtain a unit accumulated value;
the output layer is used for transmitting the unit accumulated value to an accumulator.
4. A deformable convolution accelerator according to claim 1, characterized in that said register module comprises a read array for providing parallel input values for said mapping module for bilinear interpolation and a write array for storing parallel input values, the functions of said read array and said write array being interchangeable.
5. A deformable convolution accelerator according to claim 4 wherein said read array and said write array implement parallel processing of read and store actions using ping-pong operations.
6. A deformable convolution accelerator according to any one of claim 5 wherein the access parallelism dimension of said register block is set to 36.
7. A deformable convolution accelerator according to claim 1 wherein said control module includes a counter for calculating said interpolation weight from said fractional portion.
8. A deformable convolution acceleration method applied to a deformable convolution accelerator according to any one of claims 1 to 7, comprising:
the control module acquires an original input value, an offset, a quantization scale, a mask and a convolution weight through a software end;
the control module adds a base address and the offset to obtain a real address, separates the real address into an integer part and a fractional part, and transmits the integer part, the fractional part and the quantization scale to a mapping module; storing the original input value into an input cache module, storing the mask into a mask module, and caching the convolution weight into a weight cache module;
the mapping module obtains an actual address and an interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part;
the control module acquires a corresponding original input value from the input cache module according to the actual address and transmits the original input value to the register module;
the register module converts the parallel dimensionality of the original input value and outputs a parallel input value;
the mapping module carries out bilinear interpolation processing on the parallel input value and the interpolation weight to obtain a real input value;
the mask module performs product operation on the real input value and the mask to obtain a regularized input value;
the convolution module acquires the convolution weight of the weight cache module and the regularized input value of the mask module through the control module, performs convolution calculation according to the regularized input value and the convolution weight to acquire an output value, and stores the output value to the output cache module;
and the control module controls the output cache module to transmit the output value to an off-chip memory through a software end.
9. The method of claim 8, wherein the register module comprises a read array and a write array, and the specific method for the register module to convert the parallel dimension is as follows:
the register module acquires an original input value input into the cache module through the control module;
the control module sends a control signal to store the original input value into a write array of the register module;
the control module controls the mapping module to read the stored original input value from the read array of the register module so as to execute bilinear interpolation operation;
and when the writing array is full of the original input values and all the original input values in the reading array are processed by the mapping module, the functions of the reading array and the writing array are interchanged.
10. The method according to claim 8, wherein the convolution module comprises a multiplier, a carry save adder, an adder and an accumulator, and the convolution module performs convolution calculation by the specific method comprising:
acquiring the convolution weight of the weight cache module and the regularized input value of the mask module through a control module;
multiplying the convolution weight and the regularized input value by a multiplier to obtain a unit input value;
selecting a corresponding carry save adder according to the size of the convolution kernel;
grouping all unit input values according to the requirement of the carry save adder on the input values;
carrying out addition operation on the grouped unit input values through a carry save adder, and outputting a first-level sum value and a first-level carry value;
summing the 2 times of first-level carry value and the first-level sum value through an adder to obtain a first-level temporary storage value;
grouping all the first-stage temporary storage values according to the requirement of the carry save adder on the input value;
carrying out addition operation on the grouped first-level temporary storage values through a carry save adder, and outputting a second-level sum value and a second-level carry value;
summing the 2 times secondary carry value and the secondary sum value through an adder to obtain a secondary temporary storage value;
by repeatedly performing grouping and adding operations until a unit accumulated value is obtained;
and accumulating all the unit accumulated values through an accumulator to obtain an output value.
CN202110788017.8A 2021-07-13 2021-07-13 Deformable convolution accelerator and deformable convolution acceleration method Pending CN113516235A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110788017.8A CN113516235A (en) 2021-07-13 2021-07-13 Deformable convolution accelerator and deformable convolution acceleration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110788017.8A CN113516235A (en) 2021-07-13 2021-07-13 Deformable convolution accelerator and deformable convolution acceleration method

Publications (1)

Publication Number Publication Date
CN113516235A true CN113516235A (en) 2021-10-19

Family

ID=78067228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110788017.8A Pending CN113516235A (en) 2021-07-13 2021-07-13 Deformable convolution accelerator and deformable convolution acceleration method

Country Status (1)

Country Link
CN (1) CN113516235A (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN108133270A (en) * 2018-01-12 2018-06-08 清华大学 Convolutional neural networks accelerating method and device
CN108229645A (en) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 Convolution accelerates and computation processing method, device, electronic equipment and storage medium
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
CN110197255A (en) * 2019-04-29 2019-09-03 杰创智能科技股份有限公司 A kind of deformable convolutional network based on deep learning
US20190354835A1 (en) * 2018-05-17 2019-11-21 International Business Machines Corporation Action detection by exploiting motion in receptive fields
WO2019232836A1 (en) * 2018-06-04 2019-12-12 江南大学 Multi-scale sensing pedestrian detection method based on improved full convolutional network
CN110738317A (en) * 2019-10-17 2020-01-31 中国科学院上海高等研究院 FPGA-based deformable convolution network operation method, device and system
CN111368699A (en) * 2020-02-28 2020-07-03 交叉信息核心技术研究院(西安)有限公司 Convolutional neural network pruning method based on patterns and pattern perception accelerator
CN111563582A (en) * 2020-05-06 2020-08-21 哈尔滨理工大学 Method for realizing and optimizing accelerated convolution neural network on FPGA (field programmable Gate array)
CN112766479A (en) * 2021-01-26 2021-05-07 东南大学 Neural network accelerator supporting channel separation convolution based on FPGA
US20210142448A1 (en) * 2019-11-07 2021-05-13 Intel Corporation Adaptive deformable kernel prediction network for image de-noising
CN113011562A (en) * 2021-03-18 2021-06-22 华为技术有限公司 Model training method and device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875011A (en) * 2017-01-12 2017-06-20 南京大学 The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator
CN108229645A (en) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 Convolution accelerates and computation processing method, device, electronic equipment and storage medium
CN108133270A (en) * 2018-01-12 2018-06-08 清华大学 Convolutional neural networks accelerating method and device
US20190354835A1 (en) * 2018-05-17 2019-11-21 International Business Machines Corporation Action detection by exploiting motion in receptive fields
WO2019232836A1 (en) * 2018-06-04 2019-12-12 江南大学 Multi-scale sensing pedestrian detection method based on improved full convolutional network
CN109032781A (en) * 2018-07-13 2018-12-18 重庆邮电大学 A kind of FPGA parallel system of convolutional neural networks algorithm
CN110197255A (en) * 2019-04-29 2019-09-03 杰创智能科技股份有限公司 A kind of deformable convolutional network based on deep learning
CN110738317A (en) * 2019-10-17 2020-01-31 中国科学院上海高等研究院 FPGA-based deformable convolution network operation method, device and system
US20210142448A1 (en) * 2019-11-07 2021-05-13 Intel Corporation Adaptive deformable kernel prediction network for image de-noising
CN111368699A (en) * 2020-02-28 2020-07-03 交叉信息核心技术研究院(西安)有限公司 Convolutional neural network pruning method based on patterns and pattern perception accelerator
CN111563582A (en) * 2020-05-06 2020-08-21 哈尔滨理工大学 Method for realizing and optimizing accelerated convolution neural network on FPGA (field programmable Gate array)
CN112766479A (en) * 2021-01-26 2021-05-07 东南大学 Neural network accelerator supporting channel separation convolution based on FPGA
CN113011562A (en) * 2021-03-18 2021-06-22 华为技术有限公司 Model training method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YUE YU ET AL.: "a memory-efficient hardware architecture for deformable convolutional networks", 《IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS》, vol. 11, 11 November 2021 (2021-11-11), pages 140 - 145 *
杨薇: "卷及神经网络的FPGA并行结构研究", 《数字技术与应用》, vol. 12, 15 December 2015 (2015-12-15), pages 51 *

Similar Documents

Publication Publication Date Title
WO2021036905A1 (en) Data processing method and apparatus, computer equipment, and storage medium
CN109948774B (en) Neural network accelerator based on network layer binding operation and implementation method thereof
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
TW201913460A (en) Chip device and related products
CN109146067B (en) Policy convolution neural network accelerator based on FPGA
CN110852416A (en) CNN accelerated computing method and system based on low-precision floating-point data expression form
CN110175670B (en) Method and system for realizing YOLOv2 detection network based on FPGA
CN110688088A (en) General nonlinear activation function computing device and method for neural network
CN112257844B (en) Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN113792621B (en) FPGA-based target detection accelerator design method
CN112862091B (en) Resource multiplexing type neural network hardware accelerating circuit based on quick convolution
CN112668708A (en) Convolution operation device for improving data utilization rate
CN113516236A (en) VGG16 network parallel acceleration processing method based on ZYNQ platform
CN114003198B (en) Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium
CN111091183A (en) Neural network acceleration system and method
CN113052299B (en) Neural network memory computing device based on lower communication bound and acceleration method
CN114492753A (en) Sparse accelerator applied to on-chip training
CN113516235A (en) Deformable convolution accelerator and deformable convolution acceleration method
CN116611488A (en) Vector processing unit, neural network processor and depth camera
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN113869446A (en) CNN target identification system and method based on FPGA
CN115935888A (en) Neural network accelerating system
Huang et al. A low-bit quantized and hls-based neural network fpga accelerator for object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination