CN113435337A

CN113435337A - Video target detection method and device based on deformable convolution and attention mechanism

Info

Publication number: CN113435337A
Application number: CN202110720136.XA
Authority: CN
Inventors: 李成钢; 詹建文; 李忠; 李金岭; 杜忠田; 王彦君; 夏海轮; 张碧昭; 余清华; 卜理超; 张天正; 李凤文; 袁福碧
Original assignee: China Telecom Group System Integration Co Ltd
Current assignee: China Telecom Group System Integration Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-24

Abstract

The invention discloses a video target detection method and a device based on deformable convolution and an attention mechanism, belonging to the field of image detection and comprising the following steps: acquiring original image data; inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; and generating a detection result according to the second model data. The invention solves the problems of accurate identification and real-time identification of video monitoring targets such as engineering vehicles, workers and the like in a strip mine area and the like with complex situations on site.

Description

Video target detection method and device based on deformable convolution and attention mechanism

Technical Field

The invention belongs to the field of image detection, and particularly relates to a video target detection method and device based on a deformable convolution and attention mechanism.

Background

Along with the continuous development of intelligent science and technology, people use intelligent equipment more and more among life, work, the study, use intelligent science and technology means, improved the quality of people's life, increased the efficiency of people's study and work.

The current target detection algorithm is quite mature, and although the research on vehicle detection is quite large, the related research on the detection of the engineering vehicle in the scene of an optical cable line, particularly on the special engineering vehicle such as an excavator, is very little. In the application of intelligent monitoring in an open-pit mining area, the prior art uses an algorithm for automatically identifying engineering vehicles in a large-scene, long-distance and multi-angle environment, but the algorithm utilizes histogram of gradient (HOG) characteristics as image description, so that the accuracy is low and the detection speed is slow. In the application of monitoring illegal land use phenomenon in real time, the prior art uses an excavator real-time monitoring method under a natural scene, but the method mainly solves the problem that the operation excavator is difficult to accurately detect due to uneven illumination, shielding and the like. Therefore, the existing target detection process has the following technical defects: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be ensured in an actual application scene.

Disclosure of Invention

The embodiment of the invention provides a video target detection method and device based on a deformable convolution and attention mechanism, which at least solve the following defects in the target detection process in the prior art: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be guaranteed in an actual application scene.

In one aspect of the present invention, a method for detecting a video object based on deformable convolution and attention mechanism is provided, which includes: acquiring original image data; inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; and generating a detection result according to the second model data.

Further, the inputting the original image data into a preset network to obtain first model data includes: inputting the original image data into the preset network to obtain model data to be perfected; and replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by deformable convolution.

Further, after obtaining second model data by adding a preset network structure according to the first model data, the method further includes: and optimizing the scale detection parameters in the second model data.

Further, the optimizing the scale detection parameter in the second model data includes: and increasing the detection range of the second model data scale detection.

In another aspect of the present invention, there is also provided a video object detection apparatus based on deformable convolution and attention mechanism, including: the acquisition module is used for acquiring original image data; an input module, configured to input the original image data into a preset network to obtain first model data, where the preset network includes: a deformable convolution; the adding module is used for obtaining second model data by adding a preset network structure according to the first model data; and the generating module is used for generating a detection result according to the second model data.

Further, the input module includes: the input unit is used for inputting the original image data into the preset network to obtain model data to be perfected; and the replacing unit is used for replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by using deformable convolution.

Further, the apparatus further comprises: and the optimization module is used for optimizing the scale detection parameters in the second model data.

Further, the optimization module includes: and the increasing unit is used for increasing the detection range of the second model data scale detection.

In another aspect of the present invention, a non-volatile storage medium is also provided, which includes a stored program, wherein the program when executed controls an apparatus in which the non-volatile storage medium is located to perform a method for video object detection based on deformable convolution and attention machine control.

In another aspect of the present invention, an electronic device is further provided, which includes a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform a method of video object detection based on deformable volume and attention mechanisms.

Compared with the prior art, the invention has the beneficial effects that:

the invention solves the problems of accurate identification and real-time identification of video monitoring targets such as engineering vehicles, workers and the like in a strip mine area and the like with complex situations on site. The method comprises the steps of obtaining original image data, inputting the original image data into a preset network, and obtaining first model data. Wherein, the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; the mode of generating the detection result according to the second model data solves the following defects in the target detection process in the prior art: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be guaranteed in an actual application scene.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a network structure of the YOLOv3-t iny detection algorithm according to an embodiment of the invention;

FIG. 2 is a comparison of different simulation effects when extracting excavator features according to standard convolution and deformable convolution of the embodiment of the invention;

FIG. 3 is the basic structure of a SENET network according to an embodiment of the invention;

FIG. 4 is a YOLOv3-monitor network structure based on a deformable convolution and attention mechanism according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method for video object detection based on deformable convolution and attention mechanism according to an embodiment of the present invention;

fig. 6 is a block diagram of a video object detection apparatus based on deformable convolution and attention mechanism according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of a method for video object detection based on deformable convolution and attention mechanism, it is noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that illustrated herein.

Example one

Fig. 5 is a flowchart of a video object detection method based on deformable convolution and attention mechanism according to an embodiment of the present invention, as shown in fig. 5, the method includes the following steps:

step S102, original image data is acquired.

Specifically, the embodiment of the invention needs to avoid the defects in the target detection process in the prior art, such as the application scene is an optical cable road in video monitoring, which is mostly a construction site, and the background is very complex; the excavator is variable in form, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved; the sizes of the excavators in the video monitoring images are different, and the detection frames have different sizes; the method comprises the steps that the real-time performance of a current detection algorithm is poor, the real-time performance of the algorithm is required to be guaranteed in an actual application scene, and the like, real-time original image data are firstly obtained through image obtaining equipment, and are stored through a storage module for subsequent analysis and identification.

Step S104, inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution.

Optionally, the inputting the original image data into a preset network to obtain first model data includes: inputting the original image data into the preset network to obtain model data to be perfected; and replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by deformable convolution.

Specifically, in order to efficiently and accurately output the recognition result through the input of the original image data, the embodiment of the present invention needs to input the original image data into a preset network, where the preset network may be a YOLOv3-tiny network, and meanwhile, in order to increase the recognition accuracy of a deformed object such as an excavator, the deformable convolution needs to be replaced with the fourth, fifth, and sixth layers of convolution parameters in the YOLOv3-tiny network, so as to increase the operation accuracy of the entire model and reduce the situation of erroneous recognition.

It should be noted that YOLOv3-tiny is a simplified version of YOLOv3, and compared with YOLOv3, the structure is simpler, the number of network layers is less, the total amount of network parameters is greatly reduced, there is no higher requirement for hardware resources when the detection task is completed, and the target detection speed is greatly improved. The network structure of YOLOv3-tiny is shown in fig. 1, the network has 24 layers in total, wherein the backbone network part is composed of 7 convolutional layers and 6 pooling layers, the network input image size is 416 × 416, the input data is continuously subjected to convolutional and pooling operations, the feature output of the network has two dimensions, namely 13 × 13 and 26 × 26, respectively, wherein the output feature of 13 × 13 is changed into 26 × 26 again through an upsampling scale and is combined with the 26 × 26 feature map in the backbone network, and therefore the dimension of 26 × 26 is the comprehensive output of the two features. By this method, multi-scale prediction is realized, and prediction results are generated on the two scales. Although the YOLOv3-tiny inherits the good performance of the YOLOv3 algorithm and is greatly improved in detection speed, the simplification of a backbone network causes the detection precision to be reduced, and the problems of detection of an excavator in a cable area scene cannot be effectively solved, so that the YOLOv3-tiny algorithm needs to be improved and optimized, and the detection precision is improved as much as possible on the premise of ensuring that the detection speed is not influenced.

It should be noted that, with the addition of the deformable convolution, the excavator is different from a common vehicle, and the form of the excavator during operation is variable because the movable arm and the bucket of the excavator are movable. The sampling position of the traditional convolutional neural network in the convolution process is fixed, and the receptive field ranges of the convolutional layer activation units are the same, so that the object characteristics cannot be correctly characterized when deformed, and the characteristic expression capability is limited. Therefore, the traditional convolutional neural network is difficult to fully adapt to various forms of the excavator in the construction scene of the optical cable line, and in order to solve the problem, the backbone network of the YOLOv3-tiny algorithm is reconstructed by using the deformable convolution. In conventional convolution, a convolution kernel of size 3 × 3 can be represented as R:

R＝{(-1,-1),(-1,0),…,(0,1),(1,1)} (1)

then, the mapping of the conventional convolution can be defined as:

wherein y represents an output characteristic diagram, P₀Indicates the center position, P, of the feature map_nRepresenting the corresponding coordinate position of the feature map, w represents the weight of the convolution kernel, and x represents the input feature map. As can be seen from equation 2, once the convolution kernel is unchanged, the position of the sample is fixed, i.e., the receptive field is fixed. As shown in fig. 2a, when the standard convolution samples the features of the excavator image, the receptive field is fixed, and the geometric deformation of the excavator cannot be accurately represented.

Whereas in a deformable convolution, the mapping of the convolution is defined as:

wherein, Δ P_nExpressing the added offset, as can be seen from equation 3, the deformable convolution additionally adds an offset to each sample location, so that each convolution kernel has a separate offset direction, as shown in fig. 2 b. Since these offsets are learned from previous feature maps by the convolutional layer, the way of free deformation depends entirely on the input features, and therefore can better accommodate the geometric deformation of the excavator.

In the backbone network structure of YOLOv3-tiny, a deformable convolution (3) is introduced to replace the traditional convolution in the network structure. Since the offset of the deformable convolution also depends on the original image, replacing all conventional convolutions will not only lose the original features of the image, but also create computational redundancy. The invention finds that the detection effect is best when the fourth layer (Conv4), the fifth layer (Conv5) and the sixth layer (Conv6) in the YOLOv3-tiny backbone network are replaced through experiments. Wherein, the positions of Conv4, Conv5 and Conv6 are shown in FIG. 1. Thus, the introduction of a deformable convolution to YOLOv3-tiny replaces the conventional convolution of the fourth, fifth and sixth layers in the backbone network.

And step S106, according to the first model data, obtaining second model data by adding a preset network structure.

Specifically, in the deep learning according to the embodiment of the present invention, the attention mechanism actively screens out the key information from a large amount of external information, and has been widely applied to different types of deep learning tasks such as speech processing, natural language processing, image classification, and the like. In the embodiment, the backbone network of the YOLOv3-tiny detection algorithm extracts features from a conventional convolution layer, and only the features of a local image are extracted from a receptive field during convolution, and different areas in the image can be associated through multi-layer convolution. Therefore, a SENet network structure is added to the first model data of this embodiment, and global information of the image is counted from a feature channel level, wherein relevance between different channels is considered during SENet modeling, and feature intensities corresponding to different channels are automatically adjusted, the basic structure is shown in fig. 3, fig. 3 is the basic structure of the SENet network according to the embodiment of the present invention, in the figure, F _ tr represents a conventional convolution structure, X is an input of convolution, and U is an output of convolution. SEnet adds operations after U, mainly including squeeze operations and excitation operations. Squeeze operation, i.e. F in the figure_sqAnd in the process, different mapping relations are collected by the operation, the formed new function covers the whole distribution of channel characteristic response, and each layer of the network can utilize global mapping area information. Excitation operation, i.e. F in the figure_exAnd the (-) and W process is used for superposing the weight value on each channel and controlling the characteristic strength of each channel through different weights. And finally, applying the output to a C channel of the U vector as the input of the next stage.

In addition, the SE module is a substructure in the SENET, is easy to implement, is light in computational complexity, and has little influence on model complexity and computation. In the embodiment, the SE module mode is adopted in the network structure, and the SE module is added after the convolution layer with the 13 multiplied by 13 prediction scale is output, so that the network can learn the characteristic weight according to the loss, and the whole training model achieves a better effect.

Optionally, after obtaining second model data by adding a preset network structure according to the first model data, the method further includes: and optimizing the scale detection parameters in the second model data.

Optionally, the optimizing the scale detection parameter in the second model data includes: and increasing the detection range of the second model data scale detection.

Specifically, in order to increase the range and accuracy of the scale detection, the embodiment of the invention further optimizes the second model data, and the number of scales of the scale detection can be further increased through optimization, so that the embodiment of the invention can more accurately perform the identification and detection operations of the image or the monitoring screen in practical application.

For example, from the network structure diagram of Yolov3-tiny, the output of the characteristic scale of Yolov3-tiny is only two scales of 13 × 13 and 26 × 26. The invention mainly detects the excavator and the pedestrian in video monitoring, the size of the excavator and the pedestrian in the monitoring video is not fixed, the size difference is large, and the detection of the excavator and the pedestrian by only depending on two prediction scales of YOLOv3-tiny can cause a large amount of missed detection. Therefore, the invention optimizes the scale detection part in the YOLOv3-tiny network structure, expands the original two-scale detection to three-scale detection, and increases the characteristic scale output of 52 multiplied by 52, so that the algorithm can more accurately detect the small target in the monitoring video. Therefore, through the improvement, the problems of detection of the engineering vehicles and the pedestrians in the optical cable line scene are solved in a targeted mode. The network structure of the improved algorithm YOLOv3-monitor is shown in fig. 4. The input data size is still 416 x 416, features are extracted in a mode of combining convolutional layers and pooling layers in a backbone network, convolutions of layers 4, 5 and 6 are deformable convolutions, the number of prediction scales is 3, namely 13 x 13, 26 x 26 and 52 x 52, an SE module is added after the convolutional layers with the 13 x 13 scales, and then the SE module is connected to a prediction part.

And S108, generating a detection result according to the second model data.

Specifically, when the optimized second model data can be output, the detection result can be acquired and collected at the output end of the second model data, and the collected output data is used as the detection result data to be displayed and stored, so that the user can perform subsequent analysis and check.

Through the embodiment, the following defects in the target detection process in the prior art are overcome: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be guaranteed in an actual application scene.

Example two

Fig. 6 is a block diagram of a video object detection apparatus based on deformable convolution and attention mechanism according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes:

an obtaining module 60 is configured to obtain raw image data.

An input module 62, configured to input the original image data into a preset network to obtain first model data, where the preset network includes: a deformable convolution.

Optionally, the input module includes: the input unit is used for inputting the original image data into the preset network to obtain model data to be perfected; and the replacing unit is used for replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by using deformable convolution.

Specifically, in order to efficiently and accurately output the recognition result through the input of the original image data, the original image data obtained in the above embodiment needs to be input into a preset network, where the preset network may be a YOLOv3-tiny network, and meanwhile, in order to increase the recognition accuracy of a deformed object such as an excavator, the deformable convolution needs to be replaced by the fourth, fifth, and sixth layers of convolution parameters in the YOLOv3-tiny network, so that the operation accuracy of the whole model is increased, and the situation of erroneous recognition is reduced.

R＝{(-1,-1),(-1,0),…,(0,1),(1,1)} (1)

then, the mapping of the conventional convolution can be defined as:

In the backbone network structure of YOLOv3-tiny, a deformable convolution (3) is introduced to replace the traditional convolution in the network structure. Since the offset of the deformable convolution also depends on the original image, replacing all conventional convolutions will not only lose the original features of the image, but also create computational redundancy. The invention discovers through research and experiments that the detection effect is best when the fourth layer (Conv4), the fifth layer (Conv5) and the sixth layer (Conv6) in the YOLOv3-tiny backbone network are replaced. Thus, the introduction of a deformable convolution to YOLOv3-tiny replaces the conventional convolution of the fourth, fifth and sixth layers in the backbone network.

And an adding module 64, configured to obtain second model data by adding a preset network structure according to the first model data.

Specifically, in the deep learning according to the embodiment of the present invention, the attention mechanism actively screens out the key information from a large amount of external information, and has been widely applied to different types of deep learning tasks such as speech processing, natural language processing, image classification, and the like. In the embodiment, the backbone network of the YOLOv3-tiny detection algorithm extracts features from a conventional convolution layer, and only the features of a local image are extracted from a receptive field during convolution, and different areas in the image can be associated through multi-layer convolution. Therefore, a send network structure is added to the first model data of this embodiment, global information of the image is counted from the level of the feature channel, correlation between different channels is considered during modeling, and feature intensities corresponding to different channels are automatically adjusted, a basic structure is shown in fig. 3, fig. 3 is a basic structure of a send network according to an embodiment of the present invention, and a diagram is shownIn (F _ tr) denotes a conventional convolution structure, X is an input of convolution, and U is an output of convolution. SEnet adds operations after U, mainly including squeeze operations and excitation operations. Squeeze operation, i.e. F in the figure_sqAnd in the process, different mapping relations are collected by the operation, the formed new function covers the whole distribution of channel characteristic response, and each layer of the network can utilize global mapping area information. Excitation operation, i.e. F in the figure_exAnd the (-) and W process is used for superposing the weight value on each channel and controlling the characteristic strength of each channel through different weights. And finally, applying the output to a C channel of the U vector as the input of the next stage.

Optionally, the apparatus further comprises: and the optimization module is used for optimizing the scale detection parameters in the second model data.

Optionally, the optimization module includes: and the increasing unit is used for increasing the detection range of the second model data scale detection.

And a generating module 66, configured to generate a detection result according to the second model data.

According to another aspect of the embodiments of the present invention, there is also provided a non-volatile storage medium including a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform a video object detection method based on a deformable volume and attention force mechanism.

Specifically, the method comprises the following steps: acquiring original image data; inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; and generating a detection result according to the second model data.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform a method of video object detection based on deformable volume and attention mechanisms.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A video object detection method based on deformable convolution and attention mechanism is characterized by comprising the following steps:

acquiring original image data;

inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution;

according to the first model data, second model data are obtained by adding a preset network structure;

and generating a detection result according to the second model data.

2. The method of claim 1, wherein inputting the raw image data into a predetermined network to obtain first model data comprises:

inputting the original image data into the preset network to obtain model data to be perfected;

and replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by deformable convolution.

3. The method of claim 1, wherein after obtaining second model data by joining a predetermined network structure based on the first model data, the method further comprises:

and optimizing the scale detection parameters in the second model data.

4. The method of claim 3, wherein optimizing the scale detection parameter in the second model data comprises:

and increasing the detection range of the second model data scale detection.

5. A video object detection apparatus based on deformable convolution and attention mechanism, comprising:

the acquisition module is used for acquiring original image data;

an input module, configured to input the original image data into a preset network to obtain first model data, where the preset network includes: a deformable convolution;

the adding module is used for obtaining second model data by adding a preset network structure according to the first model data;

and the generating module is used for generating a detection result according to the second model data.

6. The apparatus of claim 5, wherein the input module comprises:

the input unit is used for inputting the original image data into the preset network to obtain model data to be perfected;

and the replacing unit is used for replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by the deformable convolution.

7. The apparatus of claim 5, further comprising:

and the optimization module is used for optimizing the scale detection parameters in the second model data.

8. The apparatus of claim 7, wherein the optimization module comprises:

and the increasing unit is used for increasing the detection range of the second model data scale detection.

9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform the method of any one of claims 1 to 4.

10. An electronic device comprising a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform the method of any one of claims 1 to 4.