CN113516235A

CN113516235A - Deformable convolution accelerator and deformable convolution acceleration method

Info

Publication number: CN113516235A
Application number: CN202110788017.8A
Authority: CN
Inventors: 王中风; 于悦; 罗嘉鹏; 毛文东
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-10-19

Abstract

The application relates to the technical field of convolutional neural networks, and provides a deformable convolution accelerator and a deformable convolution acceleration method. According to the hardware architecture design based on the FPGA, through mapping operation of the value stage, regular access is provided for convolution calculation, a register array is designed to match processing rates of two stages, storage space is optimized, convolution operation is executed according to a regular input value, and an output result is obtained. The method accelerates the original deformable convolution layer, does not adjust the algorithm at all, does not limit the offset, and furthest retains the precision of the original model; for irregular receptive fields, the mapping module is adopted to regularize the irregular receptive fields, and the operation rates of the mapping module and the convolution module are matched in a ping-pong operation mode of the register module, so that the hardware utilization rate is improved; the method and the device do not need to store the intermediate data outside the chip, and reduce the access frequency of the off-chip storage structure.

Description

Deformable convolution accelerator and deformable convolution acceleration method

Technical Field

The application relates to the technical field of convolutional neural networks, in particular to a deformable convolution accelerator and a deformable convolution acceleration method.

Background

Deformable convolution has important applications in object detection work. Since the deformable convolution theory is proposed, the deformable convolution is applied to various target detection models to achieve higher detection accuracy and speed. In a common convolution network, a window with the same size as a convolution kernel slides on an input characteristic diagram, and numerical values at positions corresponding to the window are selected to be multiplied by the convolution kernel and accumulated to obtain the output of the current position. And the deformable convolution is based on the common convolution, the offset is dynamically generated according to the input characteristic diagram, when a numerical value is selected on the input characteristic diagram and multiplied by the convolution kernel, the position corresponding to the sliding window is taken as a base address, the corresponding offset is added on the base address to obtain a new address, and then the value is taken on the input characteristic diagram by using the new address. Because the offset is not an integer, the new address is often not an integer, and when the new address exceeds the address range of the input characteristic diagram, the returned input value is zero; and when the new address is in the address range of the input characteristic diagram, searching four points closest to the new address coordinate by adopting a bilinear interpolation method to calculate the input value of the new address, multiplying the input value of the new address by a convolution kernel, and accumulating to obtain the output of the deformable convolution.

In order to facilitate hardware design, the prior art performs algorithmic adjustment on the deformable convolution, specifically: limiting the magnitude of the offset within a certain range, and fixing the offset direction to change the irregular receptive field into a regular square shape; meanwhile, the offset is changed from a floating point number to an integer, and a bilinear interpolation method is skipped to directly take a value. Through the operation, the receptive field is regularized, and the adjusted deformable convolution layer structure is close to the cavity convolution. However, this method greatly adjusts the algorithm of the deformed convolution, resulting in a decrease in output accuracy.

In the prior art, the size of the receptive field is limited by selecting a smaller offset in training, and the value of the offset address on the input feature map is stored outside the chip as an intermediate variable and is retrieved from outside the chip when convolution calculation is performed, so that the access times of an off-chip memory are greatly increased.

Disclosure of Invention

In order to overcome the defects of the prior art, the application aims to provide a deformable convolution accelerator and a deformable convolution acceleration method so as to solve the problems of access irregularity and output accuracy reduction caused by irregular receptive fields, and simultaneously reduce the access times of an off-chip memory through optimizing a storage space and access operation.

In order to achieve the above object, in one aspect, the present application provides a deformable convolution accelerator, specifically including: including a software side and a hardware side.

The software end is used for realizing the whole framework of the deformable convolution accelerator and loading original model parameters from an off-chip memory to the hardware end, and the original model parameters comprise original input values, offsets, quantization scales, masks and convolution weights; and is also used for reading the output value of the hardware end to an off-chip memory.

The hardware end comprises:

the control module is used for acquiring an original input value, an offset, a quantization scale, a mask and a convolution weight from a software end; and the real address is obtained according to the base address and the offset, and is separated into an integer part and a decimal part; the device also is used for outputting control signals and controlling the mapping module, the register module, the mask module, the convolution module, the input cache module, the output cache module, the weight cache module and the data interaction among the modules; and also for transmitting the output value to the software side.

And the input buffer module is used for storing the original input value.

And the register module is used for converting the access parallel dimensionality of the original input value, obtaining a parallel input value and storing the parallel input value.

The mapping module is used for obtaining an actual address and an interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part; and the processing unit is used for carrying out bilinear interpolation processing according to the parallel input value and the interpolation weight to obtain a real input value.

And the mask module is used for storing the mask and obtaining a regularized input value according to the real input value and the mask.

And the weight cache module is used for storing the convolution weight.

And the convolution module is used for performing convolution calculation according to the convolution weight and the regularized input value to obtain an output value.

And the output buffer module is used for storing the output value obtained by the calculation of the convolution module.

Further, the convolution module comprises a multiplication unit, an addition unit and an accumulation unit.

The multiplication unit comprises a multiplier and is used for multiplying the regularized input value and the convolution weight to obtain a unit input value.

The addition unit comprises a carry-save adder and an adder and is used for carrying out addition operation on all unit input values step by step to obtain a unit accumulated value.

The accumulation unit comprises an accumulator and is used for accumulating all the unit accumulated values to obtain output values.

Further, the addition unit includes a preprocessing layer, an intermediate layer, and an output layer.

The preprocessing layer is used for selecting a corresponding carry save adder according to the size of the convolution kernel and grouping all unit input values according to the requirement of the carry save adder on the input values; and the carry save adder is used for carrying out addition operation on the grouped unit input values and outputting a first-level sum value and a first-level carry value; and the temporary storage value is obtained through an adder according to the primary carry value and the primary sum value.

And the intermediate layer is used for taking the temporary storage value output by the previous layer as the input of the next layer and carrying out progressive accumulation operation through the carry save adder and the adder to obtain a unit accumulated value.

And the output layer is used for transmitting the unit accumulated value to an accumulation unit.

Further, the register module comprises a read array and a write array, the read array is used for providing parallel input values for the mapping module to perform bilinear interpolation, the write array is used for storing the parallel input values, and the functions of the read array and the write array can be alternated mutually.

Furthermore, ping-pong operation is adopted by the read array and the write array to realize parallel processing of read action and storage action.

Further, the access parallel dimension of the register module is set to 36.

Further, the control module comprises a counter, and the counter is used for obtaining the interpolation weight according to the fractional part calculation.

In a second aspect, the present application further provides a deformable convolution acceleration method, specifically including:

the control module obtains an original input value, an offset, a quantization scale, a mask and a convolution weight through a software end.

The control module adds a base address and the offset to obtain a real address, separates the real address into an integer part and a fractional part, and transmits the integer part, the fractional part and the quantization scale to a mapping module; and storing the original input value into an input cache module, storing the mask into a mask module, and caching the convolution weight into a weight cache module.

And the mapping module obtains the actual address and the interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part.

And the control module acquires a corresponding original input value from the input cache module according to the actual address and transmits the original input value to the register module.

And the register module converts the parallel dimensionality of the original input value and outputs a parallel input value.

And the mapping module carries out bilinear interpolation processing on the parallel input value and the interpolation weight to obtain a real input value.

And the mask module performs product operation on the real input value and the mask to obtain a regularized input value.

The convolution module obtains the convolution weight of the weight cache module and the regularization input value of the mask module through the control module, performs convolution calculation according to the regularization input value and the convolution weight to obtain an output value, and stores the output value to the output cache module.

And the control module controls the output cache module to transmit the output value to an off-chip memory through a software end.

Further, the register module includes a read array and a write array, and a specific method for converting parallel dimensions by the register module is as follows:

the register module obtains the original input value input into the cache module through the control module.

And the control module sends a control signal to store the original input value into a write array of the register module.

The control module controls the mapping module to read the stored original input value from the read array of the register module for performing the bilinear interpolation operation.

And when the writing array is full of the original input values and all the original input values in the reading array are processed by the mapping module, the functions of the reading array and the writing array are interchanged.

Further, the convolution module includes a multiplier, a carry save adder, an adder and an accumulator, and the specific method for the convolution module to perform convolution calculation is as follows:

and acquiring the convolution weight of the weight cache module and the regularized input value of the mask module through the control module.

And multiplying the convolution weight and the regularized input value through a multiplier to obtain a unit input value.

And selecting a corresponding carry save adder according to the size of the convolution kernel.

All unit input values are grouped according to the requirement of the carry save adder on the input values.

And carrying out addition operation on the grouped unit input values through a carry save adder, and outputting a first-level sum value and a first-level carry value.

And summing the 2-time first-level carry value and the first-level sum value through an adder to obtain a first-level temporary storage value.

And grouping all the first-stage temporary storage values according to the requirement of the carry save adder on the input value.

And carrying out addition operation on the grouped primary temporary storage values through a carry save adder, and outputting a secondary sum value and a secondary carry value.

And summing the 2-time secondary carry value and the secondary sum value through an adder to obtain a secondary temporary storage value.

By repeatedly performing the grouping and adding operations until the unit accumulated value is obtained.

And accumulating all the unit accumulated values through an accumulator to obtain an output value.

The application provides a deformable convolution accelerator and a deformable convolution acceleration method, wherein an original deformable convolution layer is accelerated, the algorithm is not adjusted at all, the original offset range is used, the offset is not limited, and the precision of an original model is retained to the maximum extent; for irregular receptive fields, the mapping module is adopted to regularize the irregular receptive fields, and the operation rates of the mapping module and the convolution module are matched in a ping-pong operation mode of the register module, so that the hardware utilization rate is improved; meanwhile, the method and the device do not need to store the intermediate data outside the chip, and the access frequency of the off-chip storage structure is reduced.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of an overall architecture of a deformable convolution accelerator according to an embodiment of the present application;

FIG. 2 is a block diagram of a mapping module of a deformable convolution accelerator according to an embodiment of the present disclosure;

FIG. 3 is a block diagram of a register module of a deformable convolution accelerator according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an input buffer module of a deformable convolution accelerator according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a convolution module of a deformable convolution accelerator according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a comparison of the number of off-chip accesses provided in the examples of the present application;

fig. 7 is a flowchart illustrating a deformable convolution acceleration method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be fully and clearly described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

A Field Programmable Gate Array (FPGA) is a semi-custom circuit in the field of Application Specific Integrated Circuits (ASICs), and is a product developed on the basis of programmable devices such as PAL, GAL and CPLD. Because the FPGA has the characteristics of short development period, strong function, high reliability and good confidentiality, the design for accelerating the convolutional neural network by applying the FPGA is endless at present, but the research on accelerating the deformable convolutional neural network by applying the FPGA is still lacked. The embodiment of the application is just based on the FPGA platform to research the deformable convolution neural network accelerator so as to make up the defects of the prior art.

Referring to fig. 1, a schematic diagram of an overall architecture of a deformable convolution accelerator according to an embodiment of the present application is provided. A first aspect of an embodiment of the present application provides a deformable convolution accelerator, which specifically includes: including a software side and a hardware side.

The hardware end comprises:

And the input buffer module is used for storing the original input value.

And the weight cache module is used for storing the convolution weight.

Compared with a common convolution layer, the deformable convolution obtains the receptive field with irregular shape by shifting the input address, and can adapt to targets with different sizes and shapes, so that the deformable convolution has important application in a target detection task. But the offset addresses also result in irregular accesses, and the use of the conventional parallel strategy can result in access conflicts. The embodiment of the application is a hardware architecture design based on deformable convolution, and the calculation of the deformable convolution is divided into two stages: a value stage and a convolution stage. In the value taking stage, the storage space is optimized to provide regular access for convolution calculation, and the calculation in the stage is mainly completed by a mapping module; in the convolution stage, convolution operation is executed according to the regularized input value obtained from the previous stage, and an output result is obtained. Compared with other hardware architectures, the embodiment of the application realizes acceleration of the original deformable convolution for the first time, and greatly reduces energy consumption; in addition, the application does not compress and regularize the receptive field, thereby furthest retaining the original precision of the model; meanwhile, the optimization of the storage space reduces the access frequency of the off-chip storage space.

Specifically, in order to verify the feasibility of the scheme, the software end and the hardware end are both verified by using an FPGA platform.

Specifically, referring to fig. 2, a schematic structural diagram of a mapping module of a deformable convolution accelerator according to an embodiment of the present application is provided. The mapping module is represented as an array of feature map cells, the number of which is denoted by H, and in the embodiment of the present application, the convolution kernel size is 3 × 3 and the number of the array of feature map cells is 9. In each profile cell, the input X, Y is the integer part of the abscissa and ordinate of the real address, which is obtained by adding an offset to the base address, and x and y are the fractional parts. For the value after quantization, the two parts can be obtained by shifting the address after offset through the quantization scale. From the integer part X, Y, 4 adjacent values required for calculating bilinear interpolation are obtained on the feature map, and the coordinates of the 4 adjacent values are (X, Y), (X +1,y), (X +1, Y +1) and (X, Y + 1). Calculating interpolation weights w of the 4 adjacent values through a counter in a control module according to the fractional parts x and y₁～w₄The coordinate value and the interpolation weight are multiplied correspondingly and then accumulated to obtain a value corresponding to the address after the offset, and the value is also an input value required in the convolution operation of the next stage.

In conclusion, in the mapping stage, the value is taken on the feature map through the coordinate of the actual address, and the value is regularly arranged and provided for the convolution calculation in the next stage.

Specifically, if the calculation in the convolution stage is resumed after the calculation in the value stage is completely completed, a large delay is caused. Therefore, in the embodiment of the present application, the two stages of calculation are performed synchronously, each value-taking address in the feature map is multiplexed in the input dimension, and values at the same position on the feature map but on different input channels are taken at the same time. However, in the calculation of the next stage, values at different positions on the same input channel need to be input at the same time. Due to different access dimensions, a register module is arranged between the value stage and the convolution stage to store intermediate data and convert the parallel dimensions of the data.

Further, the access parallel dimension of the register module is set to 36.

Referring to fig. 3, a schematic diagram of a register module of a deformable convolution accelerator according to an embodiment of the present application is shown. Specifically, in order to enable the access operation not to influence each other, the calculation of the two stages can be executed simultaneously, and the register module adopts ping-pong operation, so that the reading and the writing can be carried out simultaneously without mutual interference. And simultaneously taking values of one address in the mapping module at a corresponding position on a plurality of input channels and storing the values into a writing array of the register module, wherein the values are marked as w _1 to w _ 36. Since the size of the convolution kernel is 3 × 3, 4 adjacent values are needed for each bilinear interpolation operation to participate in the calculation, and 36 original input values in the buffer memory need to be input for each time an output value is calculated by the convolution unit, the access parallel dimension of the register module in the embodiment of the present application is 36. When all 36 addresses in the mapping module have taken the original input values and store them in the register module, 36 × 36 original input values are stored in the write array from 36 input channels, each having 36 original input values. The original input values output by the read array each time are also 36, and are marked as r _1 to r _36, the original input values output at the same time have the same input channel, and after 36 times of output, all the original input values are output to the mapping module, and are multiplexed on the output channel by the convolution module after the bilinear interpolation is performed. When the write array is full, all the original input values in the read array are used by the following modules, and then the read array and the write array are exchanged.

In order to match with the access scheme in the register module, the arrangement of the original input values in the input cache module is also adjusted in the embodiment of the present application. According to the embodiment of the application, the original input values are grouped according to the input channels, and the group number of the grouping is the same as the access parallel dimension of the register module.

Specifically, referring to fig. 4, a schematic structural diagram of an input buffer module of a deformable convolution accelerator according to an embodiment of the present application is provided. C represents an input channel, the mapping module outputs an address to the input cache module each time, and the input cache module outputs values of the corresponding positions of the currently selected 36 different input channels. After a group of addresses in the mapping module are fetched, the selected input channel in the input cache module is changed, and the black box represents the selected input channel and moves in the direction of the arrow as shown in the figure.

The Carry Save Adder (CSA) has very small Carry propagation delay when executing multiple number addition, and its basic idea is to reduce the sum of 3 addends into the sum of 2 addends, separately calculate and Save the Carry value and the sum value, and each bit can independently calculate the Carry value and the sum value, so the speed is very fast, and the number of times of accumulation can be reduced, and the critical path can be shortened.

Specifically, when verilog is used to design an equation, such as Sum ═ a + B + C + D, a full adder may be designed, the full adder constitutes an N-bit CSA structure, multiple numbers are combined, the Sum is added through two stages of CSAs, and finally the carry value and the Sum are added through an adder, and when the carry is transmitted to any module, the carry value needs to be multiplied by 2, because the carry is the real number after it is combined.

The output layer is used for transmitting the unit accumulated value to an accumulator.

Specifically, in the second stage of calculation, the embodiment of the present application designs a convolution array dedicated to convolution calculation. Referring to fig. 5, a schematic diagram of a convolution module structure of a deformable convolution accelerator according to an embodiment of the present application is provided. The parallel dimension of the output channel is N, which may vary according to the size of the design, and in the embodiments of this application, N is set to 32. Each convolution calculation unit is responsible for convolution calculation of one output channel, and weights input at the same time have the same input channel and different output channels. And the offset values calculated in the last stage are multiplexed by different output channels. Wherein, w₁To w₉Is the weight value of the input, v₁To v₉Is the input value taken by the mapping module in the calculation of the previous stage. Since the convolution kernel size in the embodiment of the present application is 3 × 3, the inputs after regularization of the same size are taken and multiplied correspondingly. In order to shorten the shortest path and simplify the calculation, a mode of directly accumulating 9 results is not adopted, and 3 results are used as a group to carry out calculation by using a carry save adder. The obtained 3 results are added by a first-stage carry adder. The two-stage carry adder can calculate the accumulated value of 9 numerical values to obtain the output of the current position of the current input channel.

Because each convolution calculation unit is responsible for the calculation of one output channel, the output result of each convolution calculation unit is accumulated by an accumulator on the dimension of an input channel to obtain the output value of the current position on the corresponding output channel.

The embodiment of the application provides a deformable convolution accelerator, which regularizes data access of convolution calculation of a second stage through mapping operation of a first stage, solves the problem of access conflict and enables an architecture to realize large-scale parallel; the register module is designed to match the processing rates of the two stages, the hardware utilization rate is improved, the parallel dimensionality of access is changed, and the register module is matched with the storage optimization of the input cache, so that the storage space requirement is not greatly increased under the parallel operation, and the off-chip access and storage times of the system are reduced; the application does not compress and regularize the receptive field, and the precision of the original model is kept to the maximum extent.

The deformable convolution accelerator and the method provided by the embodiments of the present application will be described in detail by specific embodiments.

As can be seen from table 1, compared with the prior art, the accuracy of the COCO data set of the embodiment of the present application is closer to the original model. The embodiment of the application does not compress and regularize the receptive field, accelerates the original deformable convolution layer and furthest retains the precision of the original model.

TABLE 1 accuracy comparison Table on COCO data set

As can be seen from table 2, the embodiment of the present application occupies less resources on the board.

TABLE 2 resource occupancy table

As can be seen from table 3, compared with the GPU, the energy efficiency ratio is greatly improved in the embodiments of the present application.

TABLE 3 energy consumption COMPARATIVE TABLE

Referring to fig. 6, a comparison diagram of the number of off-chip accesses provided in the embodiment of the present application is shown, where S denotes the size of the input, specifically, the side length, and N denotes the number of input channels. As can be seen from the figure, compared with the DCL accelerator in the prior art, the deformable convolution accelerator provided by the embodiment of the present application greatly reduces the number of accesses to the off-chip memory structure.

In summary, the deformable convolution accelerator in the embodiment of the present application has the following features:

(1) the embodiment of the application designs a mapping module to finish irregular value-taking operation of deformable convolution, regularizes access in a convolution stage and solves access conflict; the register module is used for matching the calculation rates of the two stages, so that the hardware utilization rate is improved; meanwhile, the parallel dimensionality of data access is changed, and the requirement of a storage space is reduced.

(2) The calculation module designed in the embodiment of the application constructs an accelerator framework for original deformable convolution calculation, does not modify an original algorithm, and furthest retains the precision of an original model.

(3) The embodiment of the application is realized based on an FPGA platform, and can reach the indexes of 200MHz and 126.32 GOPS. The energy efficiency on the FPGA is 50.33GOPS/W, which is improved by 2.5 times compared with the GPU. In addition, compared with the prior art, the access times of the off-chip memory are greatly reduced.

Referring to fig. 7, a schematic flowchart of a deformable convolution acceleration method according to an embodiment of the present application is provided. A second aspect of the embodiments of the present application provides a deformable convolution acceleration method, which is used to guide the operation of the deformable convolution accelerator provided in the first aspect of the embodiments of the present application, and for details that are not disclosed in the deformable convolution acceleration method provided in the second aspect of the embodiments of the present application, please refer to the deformable convolution accelerator provided in the first aspect of the embodiments of the present application.

The deformable convolution acceleration method specifically comprises the following steps:

Specifically, referring to fig. 5, for a convolution kernel of 3 × 3 and an input at a certain position, the weights in the convolution kernel are multiplied by the inputs correspondingly, and each 3 groups of the obtained 9 temporary storage values are added by a carry save adder to obtain 3 accumulated values, and then the accumulated values are accumulated again by the carry save adder to obtain an output value.

According to the technical scheme, the deformable convolution accelerator and the deformable convolution accelerator method accelerate an original deformable convolution layer, the algorithm is not adjusted at all, the original offset range is used, the offset is not limited, and the accuracy of an original model is reserved to the maximum extent; for irregular receptive fields, the mapping module is adopted to regularize the irregular receptive fields, and the operation rates of the mapping module and the convolution module are matched in a ping-pong operation mode of the register module, so that the hardware utilization rate is improved; meanwhile, the method and the device do not need to store the intermediate data outside the chip, and the access frequency of the off-chip storage structure is reduced.

The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

Claims

1. A deformable convolution accelerator is characterized by comprising a software end and a hardware end;

the software end is used for realizing the whole framework of the deformable convolution accelerator and loading original model parameters from an off-chip memory to the hardware end, and the original model parameters comprise original input values, offsets, quantization scales, masks and convolution weights; the device is also used for reading the output value of the hardware end to an off-chip memory;

the hardware end comprises:

the control module is used for acquiring an original input value, an offset, a quantization scale, a mask and a convolution weight from a software end; and the real address is obtained according to the base address and the offset, and is separated into an integer part and a decimal part; the device also is used for outputting control signals and controlling the mapping module, the register module, the mask module, the convolution module, the input cache module, the output cache module, the weight cache module and the data interaction among the modules; and also for transmitting the output value to the software end;

the input buffer module is used for storing an original input value;

the register module is used for converting the access parallel dimensionality of the original input value, obtaining a parallel input value and storing the parallel input value;

the mapping module is used for obtaining an actual address and an interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part; and the processing unit is used for carrying out bilinear interpolation processing according to the parallel input value and the interpolation weight to obtain a real input value;

the mask module is used for storing the mask and obtaining a regularized input value according to the real input value and the mask;

the weight cache module is used for storing the convolution weight;

the convolution module is used for carrying out convolution calculation according to the convolution weight and the regularized input value to obtain an output value;

2. A deformable convolution accelerator according to claim 1 wherein said convolution module includes a multiplication unit, an addition unit and an accumulation unit;

the multiplication unit comprises a multiplier and is used for multiplying the regularized input value by the convolution weight to obtain a unit input value;

the addition unit comprises a carry-save adder and an adder and is used for carrying out step-by-step addition operation on all unit input values to obtain a unit accumulated value;

3. A deformable convolution accelerator according to claim 2 wherein said adding unit includes a preprocessing layer, an intermediate layer and an output layer;

the preprocessing layer is used for selecting a corresponding carry save adder according to the size of the convolution kernel and grouping all unit input values according to the requirement of the carry save adder on the input values; and the carry save adder is used for carrying out addition operation on the grouped unit input values and outputting a first-level sum value and a first-level carry value; the temporary storage value is obtained through an adder according to the primary carry value and the primary sum value;

the intermediate layer is used for taking the temporary storage value output by the previous layer as the input of the next layer and performing progressive accumulation operation through the carry save adder and the adder to obtain a unit accumulated value;

4. A deformable convolution accelerator according to claim 1, characterized in that said register module comprises a read array for providing parallel input values for said mapping module for bilinear interpolation and a write array for storing parallel input values, the functions of said read array and said write array being interchangeable.

5. A deformable convolution accelerator according to claim 4 wherein said read array and said write array implement parallel processing of read and store actions using ping-pong operations.

6. A deformable convolution accelerator according to any one of claim 5 wherein the access parallelism dimension of said register block is set to 36.

7. A deformable convolution accelerator according to claim 1 wherein said control module includes a counter for calculating said interpolation weight from said fractional portion.

8. A deformable convolution acceleration method applied to a deformable convolution accelerator according to any one of claims 1 to 7, comprising:

the control module acquires an original input value, an offset, a quantization scale, a mask and a convolution weight through a software end;

the control module adds a base address and the offset to obtain a real address, separates the real address into an integer part and a fractional part, and transmits the integer part, the fractional part and the quantization scale to a mapping module; storing the original input value into an input cache module, storing the mask into a mask module, and caching the convolution weight into a weight cache module;

the mapping module obtains an actual address and an interpolation weight corresponding to the actual address according to the quantization scale, the integer part and the decimal part;

the control module acquires a corresponding original input value from the input cache module according to the actual address and transmits the original input value to the register module;

the register module converts the parallel dimensionality of the original input value and outputs a parallel input value;

the mapping module carries out bilinear interpolation processing on the parallel input value and the interpolation weight to obtain a real input value;

the mask module performs product operation on the real input value and the mask to obtain a regularized input value;

the convolution module acquires the convolution weight of the weight cache module and the regularized input value of the mask module through the control module, performs convolution calculation according to the regularized input value and the convolution weight to acquire an output value, and stores the output value to the output cache module;

9. The method of claim 8, wherein the register module comprises a read array and a write array, and the specific method for the register module to convert the parallel dimension is as follows:

the register module acquires an original input value input into the cache module through the control module;

the control module sends a control signal to store the original input value into a write array of the register module;

the control module controls the mapping module to read the stored original input value from the read array of the register module so as to execute bilinear interpolation operation;

10. The method according to claim 8, wherein the convolution module comprises a multiplier, a carry save adder, an adder and an accumulator, and the convolution module performs convolution calculation by the specific method comprising:

acquiring the convolution weight of the weight cache module and the regularized input value of the mask module through a control module;

multiplying the convolution weight and the regularized input value by a multiplier to obtain a unit input value;

selecting a corresponding carry save adder according to the size of the convolution kernel;

grouping all unit input values according to the requirement of the carry save adder on the input values;

carrying out addition operation on the grouped unit input values through a carry save adder, and outputting a first-level sum value and a first-level carry value;

summing the 2 times of first-level carry value and the first-level sum value through an adder to obtain a first-level temporary storage value;

grouping all the first-stage temporary storage values according to the requirement of the carry save adder on the input value;

carrying out addition operation on the grouped first-level temporary storage values through a carry save adder, and outputting a second-level sum value and a second-level carry value;

summing the 2 times secondary carry value and the secondary sum value through an adder to obtain a secondary temporary storage value;

by repeatedly performing grouping and adding operations until a unit accumulated value is obtained;