CN113435337A - Video target detection method and device based on deformable convolution and attention mechanism - Google Patents

Video target detection method and device based on deformable convolution and attention mechanism Download PDF

Info

Publication number
CN113435337A
CN113435337A CN202110720136.XA CN202110720136A CN113435337A CN 113435337 A CN113435337 A CN 113435337A CN 202110720136 A CN202110720136 A CN 202110720136A CN 113435337 A CN113435337 A CN 113435337A
Authority
CN
China
Prior art keywords
model data
detection
convolution
preset network
original image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110720136.XA
Other languages
Chinese (zh)
Inventor
李成钢
詹建文
李忠
李金岭
杜忠田
王彦君
夏海轮
张碧昭
余清华
卜理超
张天正
李凤文
袁福碧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Group System Integration Co Ltd
Original Assignee
China Telecom Group System Integration Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Group System Integration Co Ltd filed Critical China Telecom Group System Integration Co Ltd
Priority to CN202110720136.XA priority Critical patent/CN113435337A/en
Publication of CN113435337A publication Critical patent/CN113435337A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video target detection method and a device based on deformable convolution and an attention mechanism, belonging to the field of image detection and comprising the following steps: acquiring original image data; inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; and generating a detection result according to the second model data. The invention solves the problems of accurate identification and real-time identification of video monitoring targets such as engineering vehicles, workers and the like in a strip mine area and the like with complex situations on site.

Description

Video target detection method and device based on deformable convolution and attention mechanism
Technical Field
The invention belongs to the field of image detection, and particularly relates to a video target detection method and device based on a deformable convolution and attention mechanism.
Background
Along with the continuous development of intelligent science and technology, people use intelligent equipment more and more among life, work, the study, use intelligent science and technology means, improved the quality of people's life, increased the efficiency of people's study and work.
The current target detection algorithm is quite mature, and although the research on vehicle detection is quite large, the related research on the detection of the engineering vehicle in the scene of an optical cable line, particularly on the special engineering vehicle such as an excavator, is very little. In the application of intelligent monitoring in an open-pit mining area, the prior art uses an algorithm for automatically identifying engineering vehicles in a large-scene, long-distance and multi-angle environment, but the algorithm utilizes histogram of gradient (HOG) characteristics as image description, so that the accuracy is low and the detection speed is slow. In the application of monitoring illegal land use phenomenon in real time, the prior art uses an excavator real-time monitoring method under a natural scene, but the method mainly solves the problem that the operation excavator is difficult to accurately detect due to uneven illumination, shielding and the like. Therefore, the existing target detection process has the following technical defects: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be ensured in an actual application scene.
Disclosure of Invention
The embodiment of the invention provides a video target detection method and device based on a deformable convolution and attention mechanism, which at least solve the following defects in the target detection process in the prior art: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be guaranteed in an actual application scene.
In one aspect of the present invention, a method for detecting a video object based on deformable convolution and attention mechanism is provided, which includes: acquiring original image data; inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; and generating a detection result according to the second model data.
Further, the inputting the original image data into a preset network to obtain first model data includes: inputting the original image data into the preset network to obtain model data to be perfected; and replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by deformable convolution.
Further, after obtaining second model data by adding a preset network structure according to the first model data, the method further includes: and optimizing the scale detection parameters in the second model data.
Further, the optimizing the scale detection parameter in the second model data includes: and increasing the detection range of the second model data scale detection.
In another aspect of the present invention, there is also provided a video object detection apparatus based on deformable convolution and attention mechanism, including: the acquisition module is used for acquiring original image data; an input module, configured to input the original image data into a preset network to obtain first model data, where the preset network includes: a deformable convolution; the adding module is used for obtaining second model data by adding a preset network structure according to the first model data; and the generating module is used for generating a detection result according to the second model data.
Further, the input module includes: the input unit is used for inputting the original image data into the preset network to obtain model data to be perfected; and the replacing unit is used for replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by using deformable convolution.
Further, the apparatus further comprises: and the optimization module is used for optimizing the scale detection parameters in the second model data.
Further, the optimization module includes: and the increasing unit is used for increasing the detection range of the second model data scale detection.
In another aspect of the present invention, a non-volatile storage medium is also provided, which includes a stored program, wherein the program when executed controls an apparatus in which the non-volatile storage medium is located to perform a method for video object detection based on deformable convolution and attention machine control.
In another aspect of the present invention, an electronic device is further provided, which includes a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform a method of video object detection based on deformable volume and attention mechanisms.
Compared with the prior art, the invention has the beneficial effects that:
the invention solves the problems of accurate identification and real-time identification of video monitoring targets such as engineering vehicles, workers and the like in a strip mine area and the like with complex situations on site. The method comprises the steps of obtaining original image data, inputting the original image data into a preset network, and obtaining first model data. Wherein, the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; the mode of generating the detection result according to the second model data solves the following defects in the target detection process in the prior art: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be guaranteed in an actual application scene.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a network structure of the YOLOv3-t iny detection algorithm according to an embodiment of the invention;
FIG. 2 is a comparison of different simulation effects when extracting excavator features according to standard convolution and deformable convolution of the embodiment of the invention;
FIG. 3 is the basic structure of a SENET network according to an embodiment of the invention;
FIG. 4 is a YOLOv3-monitor network structure based on a deformable convolution and attention mechanism according to an embodiment of the present invention;
FIG. 5 is a flow chart of a method for video object detection based on deformable convolution and attention mechanism according to an embodiment of the present invention;
fig. 6 is a block diagram of a video object detection apparatus based on deformable convolution and attention mechanism according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present invention, there is provided a method embodiment of a method for video object detection based on deformable convolution and attention mechanism, it is noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that illustrated herein.
Example one
Fig. 5 is a flowchart of a video object detection method based on deformable convolution and attention mechanism according to an embodiment of the present invention, as shown in fig. 5, the method includes the following steps:
step S102, original image data is acquired.
Specifically, the embodiment of the invention needs to avoid the defects in the target detection process in the prior art, such as the application scene is an optical cable road in video monitoring, which is mostly a construction site, and the background is very complex; the excavator is variable in form, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved; the sizes of the excavators in the video monitoring images are different, and the detection frames have different sizes; the method comprises the steps that the real-time performance of a current detection algorithm is poor, the real-time performance of the algorithm is required to be guaranteed in an actual application scene, and the like, real-time original image data are firstly obtained through image obtaining equipment, and are stored through a storage module for subsequent analysis and identification.
Step S104, inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution.
Optionally, the inputting the original image data into a preset network to obtain first model data includes: inputting the original image data into the preset network to obtain model data to be perfected; and replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by deformable convolution.
Specifically, in order to efficiently and accurately output the recognition result through the input of the original image data, the embodiment of the present invention needs to input the original image data into a preset network, where the preset network may be a YOLOv3-tiny network, and meanwhile, in order to increase the recognition accuracy of a deformed object such as an excavator, the deformable convolution needs to be replaced with the fourth, fifth, and sixth layers of convolution parameters in the YOLOv3-tiny network, so as to increase the operation accuracy of the entire model and reduce the situation of erroneous recognition.
It should be noted that YOLOv3-tiny is a simplified version of YOLOv3, and compared with YOLOv3, the structure is simpler, the number of network layers is less, the total amount of network parameters is greatly reduced, there is no higher requirement for hardware resources when the detection task is completed, and the target detection speed is greatly improved. The network structure of YOLOv3-tiny is shown in fig. 1, the network has 24 layers in total, wherein the backbone network part is composed of 7 convolutional layers and 6 pooling layers, the network input image size is 416 × 416, the input data is continuously subjected to convolutional and pooling operations, the feature output of the network has two dimensions, namely 13 × 13 and 26 × 26, respectively, wherein the output feature of 13 × 13 is changed into 26 × 26 again through an upsampling scale and is combined with the 26 × 26 feature map in the backbone network, and therefore the dimension of 26 × 26 is the comprehensive output of the two features. By this method, multi-scale prediction is realized, and prediction results are generated on the two scales. Although the YOLOv3-tiny inherits the good performance of the YOLOv3 algorithm and is greatly improved in detection speed, the simplification of a backbone network causes the detection precision to be reduced, and the problems of detection of an excavator in a cable area scene cannot be effectively solved, so that the YOLOv3-tiny algorithm needs to be improved and optimized, and the detection precision is improved as much as possible on the premise of ensuring that the detection speed is not influenced.
It should be noted that, with the addition of the deformable convolution, the excavator is different from a common vehicle, and the form of the excavator during operation is variable because the movable arm and the bucket of the excavator are movable. The sampling position of the traditional convolutional neural network in the convolution process is fixed, and the receptive field ranges of the convolutional layer activation units are the same, so that the object characteristics cannot be correctly characterized when deformed, and the characteristic expression capability is limited. Therefore, the traditional convolutional neural network is difficult to fully adapt to various forms of the excavator in the construction scene of the optical cable line, and in order to solve the problem, the backbone network of the YOLOv3-tiny algorithm is reconstructed by using the deformable convolution. In conventional convolution, a convolution kernel of size 3 × 3 can be represented as R:
R={(-1,-1),(-1,0),…,(0,1),(1,1)} (1)
then, the mapping of the conventional convolution can be defined as:
Figure BDA0003136639850000051
wherein y represents an output characteristic diagram, P0Indicates the center position, P, of the feature mapnRepresenting the corresponding coordinate position of the feature map, w represents the weight of the convolution kernel, and x represents the input feature map. As can be seen from equation 2, once the convolution kernel is unchanged, the position of the sample is fixed, i.e., the receptive field is fixed. As shown in fig. 2a, when the standard convolution samples the features of the excavator image, the receptive field is fixed, and the geometric deformation of the excavator cannot be accurately represented.
Whereas in a deformable convolution, the mapping of the convolution is defined as:
Figure BDA0003136639850000052
wherein, Δ PnExpressing the added offset, as can be seen from equation 3, the deformable convolution additionally adds an offset to each sample location, so that each convolution kernel has a separate offset direction, as shown in fig. 2 b. Since these offsets are learned from previous feature maps by the convolutional layer, the way of free deformation depends entirely on the input features, and therefore can better accommodate the geometric deformation of the excavator.
In the backbone network structure of YOLOv3-tiny, a deformable convolution (3) is introduced to replace the traditional convolution in the network structure. Since the offset of the deformable convolution also depends on the original image, replacing all conventional convolutions will not only lose the original features of the image, but also create computational redundancy. The invention finds that the detection effect is best when the fourth layer (Conv4), the fifth layer (Conv5) and the sixth layer (Conv6) in the YOLOv3-tiny backbone network are replaced through experiments. Wherein, the positions of Conv4, Conv5 and Conv6 are shown in FIG. 1. Thus, the introduction of a deformable convolution to YOLOv3-tiny replaces the conventional convolution of the fourth, fifth and sixth layers in the backbone network.
And step S106, according to the first model data, obtaining second model data by adding a preset network structure.
Specifically, in the deep learning according to the embodiment of the present invention, the attention mechanism actively screens out the key information from a large amount of external information, and has been widely applied to different types of deep learning tasks such as speech processing, natural language processing, image classification, and the like. In the embodiment, the backbone network of the YOLOv3-tiny detection algorithm extracts features from a conventional convolution layer, and only the features of a local image are extracted from a receptive field during convolution, and different areas in the image can be associated through multi-layer convolution. Therefore, a SENet network structure is added to the first model data of this embodiment, and global information of the image is counted from a feature channel level, wherein relevance between different channels is considered during SENet modeling, and feature intensities corresponding to different channels are automatically adjusted, the basic structure is shown in fig. 3, fig. 3 is the basic structure of the SENet network according to the embodiment of the present invention, in the figure, F _ tr represents a conventional convolution structure, X is an input of convolution, and U is an output of convolution. SEnet adds operations after U, mainly including squeeze operations and excitation operations. Squeeze operation, i.e. F in the figuresqAnd in the process, different mapping relations are collected by the operation, the formed new function covers the whole distribution of channel characteristic response, and each layer of the network can utilize global mapping area information. Excitation operation, i.e. F in the figureexAnd the (-) and W process is used for superposing the weight value on each channel and controlling the characteristic strength of each channel through different weights. And finally, applying the output to a C channel of the U vector as the input of the next stage.
In addition, the SE module is a substructure in the SENET, is easy to implement, is light in computational complexity, and has little influence on model complexity and computation. In the embodiment, the SE module mode is adopted in the network structure, and the SE module is added after the convolution layer with the 13 multiplied by 13 prediction scale is output, so that the network can learn the characteristic weight according to the loss, and the whole training model achieves a better effect.
Optionally, after obtaining second model data by adding a preset network structure according to the first model data, the method further includes: and optimizing the scale detection parameters in the second model data.
Optionally, the optimizing the scale detection parameter in the second model data includes: and increasing the detection range of the second model data scale detection.
Specifically, in order to increase the range and accuracy of the scale detection, the embodiment of the invention further optimizes the second model data, and the number of scales of the scale detection can be further increased through optimization, so that the embodiment of the invention can more accurately perform the identification and detection operations of the image or the monitoring screen in practical application.
For example, from the network structure diagram of Yolov3-tiny, the output of the characteristic scale of Yolov3-tiny is only two scales of 13 × 13 and 26 × 26. The invention mainly detects the excavator and the pedestrian in video monitoring, the size of the excavator and the pedestrian in the monitoring video is not fixed, the size difference is large, and the detection of the excavator and the pedestrian by only depending on two prediction scales of YOLOv3-tiny can cause a large amount of missed detection. Therefore, the invention optimizes the scale detection part in the YOLOv3-tiny network structure, expands the original two-scale detection to three-scale detection, and increases the characteristic scale output of 52 multiplied by 52, so that the algorithm can more accurately detect the small target in the monitoring video. Therefore, through the improvement, the problems of detection of the engineering vehicles and the pedestrians in the optical cable line scene are solved in a targeted mode. The network structure of the improved algorithm YOLOv3-monitor is shown in fig. 4. The input data size is still 416 x 416, features are extracted in a mode of combining convolutional layers and pooling layers in a backbone network, convolutions of layers 4, 5 and 6 are deformable convolutions, the number of prediction scales is 3, namely 13 x 13, 26 x 26 and 52 x 52, an SE module is added after the convolutional layers with the 13 x 13 scales, and then the SE module is connected to a prediction part.
And S108, generating a detection result according to the second model data.
Specifically, when the optimized second model data can be output, the detection result can be acquired and collected at the output end of the second model data, and the collected output data is used as the detection result data to be displayed and stored, so that the user can perform subsequent analysis and check.
Through the embodiment, the following defects in the target detection process in the prior art are overcome: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be guaranteed in an actual application scene.
Example two
Fig. 6 is a block diagram of a video object detection apparatus based on deformable convolution and attention mechanism according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes:
an obtaining module 60 is configured to obtain raw image data.
Specifically, the embodiment of the invention needs to avoid the defects in the target detection process in the prior art, such as the application scene is an optical cable road in video monitoring, which is mostly a construction site, and the background is very complex; the excavator is variable in form, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved; the sizes of the excavators in the video monitoring images are different, and the detection frames have different sizes; the method comprises the steps that the real-time performance of a current detection algorithm is poor, the real-time performance of the algorithm is required to be guaranteed in an actual application scene, and the like, real-time original image data are firstly obtained through image obtaining equipment, and are stored through a storage module for subsequent analysis and identification.
An input module 62, configured to input the original image data into a preset network to obtain first model data, where the preset network includes: a deformable convolution.
Optionally, the input module includes: the input unit is used for inputting the original image data into the preset network to obtain model data to be perfected; and the replacing unit is used for replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by using deformable convolution.
Specifically, in order to efficiently and accurately output the recognition result through the input of the original image data, the original image data obtained in the above embodiment needs to be input into a preset network, where the preset network may be a YOLOv3-tiny network, and meanwhile, in order to increase the recognition accuracy of a deformed object such as an excavator, the deformable convolution needs to be replaced by the fourth, fifth, and sixth layers of convolution parameters in the YOLOv3-tiny network, so that the operation accuracy of the whole model is increased, and the situation of erroneous recognition is reduced.
It should be noted that YOLOv3-tiny is a simplified version of YOLOv3, and compared with YOLOv3, the structure is simpler, the number of network layers is less, the total amount of network parameters is greatly reduced, there is no higher requirement for hardware resources when the detection task is completed, and the target detection speed is greatly improved. The network structure of YOLOv3-tiny is shown in fig. 1, the network has 24 layers in total, wherein the backbone network part is composed of 7 convolutional layers and 6 pooling layers, the network input image size is 416 × 416, the input data is continuously subjected to convolutional and pooling operations, the feature output of the network has two dimensions, namely 13 × 13 and 26 × 26, respectively, wherein the output feature of 13 × 13 is changed into 26 × 26 again through an upsampling scale and is combined with the 26 × 26 feature map in the backbone network, and therefore the dimension of 26 × 26 is the comprehensive output of the two features. By this method, multi-scale prediction is realized, and prediction results are generated on the two scales. Although the YOLOv3-tiny inherits the good performance of the YOLOv3 algorithm and is greatly improved in detection speed, the simplification of a backbone network causes the detection precision to be reduced, and the problems of detection of an excavator in a cable area scene cannot be effectively solved, so that the YOLOv3-tiny algorithm needs to be improved and optimized, and the detection precision is improved as much as possible on the premise of ensuring that the detection speed is not influenced.
It should be noted that, with the addition of the deformable convolution, the excavator is different from a common vehicle, and the form of the excavator during operation is variable because the movable arm and the bucket of the excavator are movable. The sampling position of the traditional convolutional neural network in the convolution process is fixed, and the receptive field ranges of the convolutional layer activation units are the same, so that the object characteristics cannot be correctly characterized when deformed, and the characteristic expression capability is limited. Therefore, the traditional convolutional neural network is difficult to fully adapt to various forms of the excavator in the construction scene of the optical cable line, and in order to solve the problem, the backbone network of the YOLOv3-tiny algorithm is reconstructed by using the deformable convolution. In conventional convolution, a convolution kernel of size 3 × 3 can be represented as R:
R={(-1,-1),(-1,0),…,(0,1),(1,1)} (1)
then, the mapping of the conventional convolution can be defined as:
Figure BDA0003136639850000091
wherein y represents an output characteristic diagram, P0Indicates the center position, P, of the feature mapnRepresenting the corresponding coordinate position of the feature map, w represents the weight of the convolution kernel, and x represents the input feature map. As can be seen from equation 2, once the convolution kernel is unchanged, the position of the sample is fixed, i.e., the receptive field is fixed. As shown in fig. 2a, when the standard convolution samples the features of the excavator image, the receptive field is fixed, and the geometric deformation of the excavator cannot be accurately represented.
Whereas in a deformable convolution, the mapping of the convolution is defined as:
Figure BDA0003136639850000092
wherein, Δ PnExpressing the added offset, as can be seen from equation 3, the deformable convolution additionally adds an offset to each sample location, so that each convolution kernel has a separate offset direction, as shown in fig. 2 b. Since these offsets are learned from previous feature maps by the convolutional layer, the way of free deformation depends entirely on the input features, and therefore can better accommodate the geometric deformation of the excavator.
In the backbone network structure of YOLOv3-tiny, a deformable convolution (3) is introduced to replace the traditional convolution in the network structure. Since the offset of the deformable convolution also depends on the original image, replacing all conventional convolutions will not only lose the original features of the image, but also create computational redundancy. The invention discovers through research and experiments that the detection effect is best when the fourth layer (Conv4), the fifth layer (Conv5) and the sixth layer (Conv6) in the YOLOv3-tiny backbone network are replaced. Thus, the introduction of a deformable convolution to YOLOv3-tiny replaces the conventional convolution of the fourth, fifth and sixth layers in the backbone network.
And an adding module 64, configured to obtain second model data by adding a preset network structure according to the first model data.
Specifically, in the deep learning according to the embodiment of the present invention, the attention mechanism actively screens out the key information from a large amount of external information, and has been widely applied to different types of deep learning tasks such as speech processing, natural language processing, image classification, and the like. In the embodiment, the backbone network of the YOLOv3-tiny detection algorithm extracts features from a conventional convolution layer, and only the features of a local image are extracted from a receptive field during convolution, and different areas in the image can be associated through multi-layer convolution. Therefore, a send network structure is added to the first model data of this embodiment, global information of the image is counted from the level of the feature channel, correlation between different channels is considered during modeling, and feature intensities corresponding to different channels are automatically adjusted, a basic structure is shown in fig. 3, fig. 3 is a basic structure of a send network according to an embodiment of the present invention, and a diagram is shownIn (F _ tr) denotes a conventional convolution structure, X is an input of convolution, and U is an output of convolution. SEnet adds operations after U, mainly including squeeze operations and excitation operations. Squeeze operation, i.e. F in the figuresqAnd in the process, different mapping relations are collected by the operation, the formed new function covers the whole distribution of channel characteristic response, and each layer of the network can utilize global mapping area information. Excitation operation, i.e. F in the figureexAnd the (-) and W process is used for superposing the weight value on each channel and controlling the characteristic strength of each channel through different weights. And finally, applying the output to a C channel of the U vector as the input of the next stage.
In addition, the SE module is a substructure in the SENET, is easy to implement, is light in computational complexity, and has little influence on model complexity and computation. In the embodiment, the SE module mode is adopted in the network structure, and the SE module is added after the convolution layer with the 13 multiplied by 13 prediction scale is output, so that the network can learn the characteristic weight according to the loss, and the whole training model achieves a better effect.
Optionally, the apparatus further comprises: and the optimization module is used for optimizing the scale detection parameters in the second model data.
Optionally, the optimization module includes: and the increasing unit is used for increasing the detection range of the second model data scale detection.
Specifically, in order to increase the range and accuracy of the scale detection, the embodiment of the invention further optimizes the second model data, and the number of scales of the scale detection can be further increased through optimization, so that the embodiment of the invention can more accurately perform the identification and detection operations of the image or the monitoring screen in practical application.
For example, from the network structure diagram of Yolov3-tiny, the output of the characteristic scale of Yolov3-tiny is only two scales of 13 × 13 and 26 × 26. The invention mainly detects the excavator and the pedestrian in video monitoring, the size of the excavator and the pedestrian in the monitoring video is not fixed, the size difference is large, and the detection of the excavator and the pedestrian by only depending on two prediction scales of YOLOv3-tiny can cause a large amount of missed detection. Therefore, the invention optimizes the scale detection part in the YOLOv3-tiny network structure, expands the original two-scale detection to three-scale detection, and increases the characteristic scale output of 52 multiplied by 52, so that the algorithm can more accurately detect the small target in the monitoring video. Therefore, through the improvement, the problems of detection of the engineering vehicles and the pedestrians in the optical cable line scene are solved in a targeted mode. The network structure of the improved algorithm YOLOv3-monitor is shown in fig. 4. The input data size is still 416 x 416, features are extracted in a mode of combining convolutional layers and pooling layers in a backbone network, convolutions of layers 4, 5 and 6 are deformable convolutions, the number of prediction scales is 3, namely 13 x 13, 26 x 26 and 52 x 52, an SE module is added after the convolutional layers with the 13 x 13 scales, and then the SE module is connected to a prediction part.
And a generating module 66, configured to generate a detection result according to the second model data.
Specifically, when the optimized second model data can be output, the detection result can be acquired and collected at the output end of the second model data, and the collected output data is used as the detection result data to be displayed and stored, so that the user can perform subsequent analysis and check.
According to another aspect of the embodiments of the present invention, there is also provided a non-volatile storage medium including a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform a video object detection method based on a deformable volume and attention force mechanism.
Specifically, the method comprises the following steps: acquiring original image data; inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; and generating a detection result according to the second model data.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform a method of video object detection based on deformable volume and attention mechanisms.
Specifically, the method comprises the following steps: acquiring original image data; inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution; according to the first model data, second model data are obtained by adding a preset network structure; and generating a detection result according to the second model data.
Through the embodiment, the following defects in the target detection process in the prior art are overcome: (1) the application scene is on the optical cable road in the video monitoring, mostly is the construction site, and the background is very complicated. (2) The excavator is changeable in shape, high in detection difficulty and low in detection accuracy, and the problem that the target deformation adaptability of the current detection algorithm is poor needs to be solved. (3) The excavator has different scales in the video monitoring image, and the detection frame has different sizes. (4) The real-time performance of the current detection algorithm is poor, and the real-time performance of the algorithm needs to be guaranteed in an actual application scene.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A video object detection method based on deformable convolution and attention mechanism is characterized by comprising the following steps:
acquiring original image data;
inputting the original image data into a preset network to obtain first model data, wherein the preset network comprises: a deformable convolution;
according to the first model data, second model data are obtained by adding a preset network structure;
and generating a detection result according to the second model data.
2. The method of claim 1, wherein inputting the raw image data into a predetermined network to obtain first model data comprises:
inputting the original image data into the preset network to obtain model data to be perfected;
and replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by deformable convolution.
3. The method of claim 1, wherein after obtaining second model data by joining a predetermined network structure based on the first model data, the method further comprises:
and optimizing the scale detection parameters in the second model data.
4. The method of claim 3, wherein optimizing the scale detection parameter in the second model data comprises:
and increasing the detection range of the second model data scale detection.
5. A video object detection apparatus based on deformable convolution and attention mechanism, comprising:
the acquisition module is used for acquiring original image data;
an input module, configured to input the original image data into a preset network to obtain first model data, where the preset network includes: a deformable convolution;
the adding module is used for obtaining second model data by adding a preset network structure according to the first model data;
and the generating module is used for generating a detection result according to the second model data.
6. The apparatus of claim 5, wherein the input module comprises:
the input unit is used for inputting the original image data into the preset network to obtain model data to be perfected;
and the replacing unit is used for replacing the fourth, fifth and sixth layers of convolution parameters in the model data to be perfected by the deformable convolution.
7. The apparatus of claim 5, further comprising:
and the optimization module is used for optimizing the scale detection parameters in the second model data.
8. The apparatus of claim 7, wherein the optimization module comprises:
and the increasing unit is used for increasing the detection range of the second model data scale detection.
9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform the method of any one of claims 1 to 4.
10. An electronic device comprising a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform the method of any one of claims 1 to 4.
CN202110720136.XA 2021-06-28 2021-06-28 Video target detection method and device based on deformable convolution and attention mechanism Pending CN113435337A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110720136.XA CN113435337A (en) 2021-06-28 2021-06-28 Video target detection method and device based on deformable convolution and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110720136.XA CN113435337A (en) 2021-06-28 2021-06-28 Video target detection method and device based on deformable convolution and attention mechanism

Publications (1)

Publication Number Publication Date
CN113435337A true CN113435337A (en) 2021-09-24

Family

ID=77754944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110720136.XA Pending CN113435337A (en) 2021-06-28 2021-06-28 Video target detection method and device based on deformable convolution and attention mechanism

Country Status (1)

Country Link
CN (1) CN113435337A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409443A (en) * 2018-11-28 2019-03-01 北方工业大学 Multi-scale deformable convolution network target detection method based on deep learning
CN110163836A (en) * 2018-11-14 2019-08-23 宁波大学 Based on deep learning for the excavator detection method under the inspection of high-altitude
CN112329658A (en) * 2020-11-10 2021-02-05 江苏科技大学 Method for improving detection algorithm of YOLOV3 network
CN112396002A (en) * 2020-11-20 2021-02-23 重庆邮电大学 Lightweight remote sensing target detection method based on SE-YOLOv3
CN112560918A (en) * 2020-12-07 2021-03-26 杭州电子科技大学 Dish identification method based on improved YOLO v3

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163836A (en) * 2018-11-14 2019-08-23 宁波大学 Based on deep learning for the excavator detection method under the inspection of high-altitude
CN109409443A (en) * 2018-11-28 2019-03-01 北方工业大学 Multi-scale deformable convolution network target detection method based on deep learning
CN112329658A (en) * 2020-11-10 2021-02-05 江苏科技大学 Method for improving detection algorithm of YOLOV3 network
CN112396002A (en) * 2020-11-20 2021-02-23 重庆邮电大学 Lightweight remote sensing target detection method based on SE-YOLOv3
CN112560918A (en) * 2020-12-07 2021-03-26 杭州电子科技大学 Dish identification method based on improved YOLO v3

Similar Documents

Publication Publication Date Title
CN111047551B (en) Remote sensing image change detection method and system based on U-net improved algorithm
CN110287849A (en) A kind of lightweight depth network image object detection method suitable for raspberry pie
CN112837344B (en) Target tracking method for generating twin network based on condition countermeasure
CN106295613A (en) A kind of unmanned plane target localization method and system
CN112163496B (en) Embedded terminal reservoir water level early warning method based on semantic segmentation
CN115272196B (en) Method for predicting focus area in histopathological image
CN110046617A (en) A kind of digital electric meter reading self-adaptive identification method based on deep learning
CN110781882A (en) License plate positioning and identifying method based on YOLO model
CN110826429A (en) Scenic spot video-based method and system for automatically monitoring travel emergency
CN112818849B (en) Crowd density detection algorithm based on context attention convolutional neural network for countermeasure learning
CN112036419B (en) SAR image component interpretation method based on VGG-Attention model
CN112597964B (en) Method for counting layered multi-scale crowd
CN116343103B (en) Natural resource supervision method based on three-dimensional GIS scene and video fusion
CN117475236B (en) Data processing system and method for mineral resource exploration
CN117593666B (en) Geomagnetic station data prediction method and system for aurora image
CN117765361B (en) Method for detecting building change area in double-time-phase remote sensing image based on residual neural network
CN112785610B (en) Lane line semantic segmentation method integrating low-level features
CN112818818B (en) Novel ultra-high-definition remote sensing image change detection method based on AFFPN
Luo et al. RBD-Net: robust breakage detection algorithm for industrial leather
CN113361496A (en) City built-up area statistical method based on U-Net
CN113704276A (en) Map updating method and device, electronic equipment and computer readable storage medium
CN112330562A (en) Heterogeneous remote sensing image transformation method and system
CN113435337A (en) Video target detection method and device based on deformable convolution and attention mechanism
CN114581657A (en) Image semantic segmentation method, device and medium based on multi-scale strip-shaped void convolution
CN109934870A (en) Object detection method, device, equipment, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 1308, 13th floor, East Tower, 33 Fuxing Road, Haidian District, Beijing 100036

Applicant after: China Telecom Digital Intelligence Technology Co.,Ltd.

Address before: Room 1308, 13th floor, East Tower, 33 Fuxing Road, Haidian District, Beijing 100036

Applicant before: CHINA TELECOM GROUP SYSTEM INTEGRATION Co.,Ltd.