CN111814696A

CN111814696A - Video ship target detection method based on improved YOLOv3

Info

Publication number: CN111814696A
Application number: CN202010667301.5A
Authority: CN
Inventors: 齐亮; 吕欣妍; 万振刚; 齐霄磊; 朱立标; 陈连凯; 贾璇; 黄晶
Original assignee: Suzhou Xinchuanpin Intelligent Technology Co ltd
Current assignee: Suzhou Xinchuanpin Intelligent Technology Co ltd
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2020-10-23

Abstract

The invention discloses a video ship target detection method based on improved YOLOv3, which comprises the following steps: step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format; step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1; step 3, combining the video ship data obtained based on the step 1 to carry out pixel level processing; the invention integrates a space-time attention module, excavates the information of the time domain of the front frame and the rear frame of the video and the air space in the same frame, performs information fusion, solves the problem of unequal information quantity of the adjacent frames of the video of the complex surface ship, and enables the two dimensions of the time domain dimension and the air space dimension to have different characteristic diagrams and different weights, thereby improving the precision of video ship target detection and optimizing the performance of the video ship target detection.

Description

Video ship target detection method based on improved YOLOv3

Technical Field

The invention relates to the field of target detection and computer vision, in particular to a video ship target detection method based on improved YOLOv3.

Background

With the continuous reduction of the cost of the camera equipment and the increasing maturity of the video processing technology, the video-based water target detection technology is gradually applied to a plurality of ocean management activities such as water area environment protection, water area detection, ocean right maintenance, island protection and the like, and the defects of the traditional detection means are effectively overcome. The current intelligent video monitoring system is mainly applied to the land environment, and the realization of video target detection under the complex water surface background is a very challenging task.

In the existing method for detecting ships in inland waterway by adopting active staring technology, firstly, an electronic tag is required to be arranged on a ship body, various information about the ships is stored in the electronic tag, when the ships pass through a detection area, relevant information of the ships, such as ship grade, passing time, course and the like, can be obtained by only reading and writing the electronic tag, and finally, the information is processed on a computer. An inclusion module is introduced into YOLOv3, and a ship vision system capable of accurately identifying and tracking multiple targets in sea and air in real time is designed. The method shows instability to the problems of complicated environment conditions of inland rivers and seas, ship shaking caused by ship movement, ship overlapping caused by multiple ships, repeated and unusual weather changes, large and small objective imaging factor differences of the ships and the like, and the detection and identification rate of the ships is greatly influenced. Therefore, how to effectively eliminate interference and achieve accurate detection of a target is a problem that needs to be mainly solved.

The target detection algorithm based on deep learning is generally divided into two categories: first, the family of R-CNN algorithms based on region nomination. Second, the YOLO, SSD series of algorithms that do not require region nomination. The R-CNN series target detection algorithm framework and the YOLO target detection algorithm framework provide two basic framework network structures for the research of target detection, the detection efficiency of the recurrent neural network is improved, and a high-efficiency learning tool is provided for realizing multi-scale and multi-class target detection.

In summary, the target detection based on deep learning is still a challenging subject, and in the face of a complex water surface environment, the problem that the information quantity of the video adjacent frames of the complex water surface ship is not equal still exists in the detection, and the video ship target detection precision is low, and meanwhile, the existing detection method has rare data and is not beneficial to the accuracy of the detection result.

Disclosure of Invention

The invention aims to design a video ship target detection method based on improved YOLOv3, according to the characteristics of a complex water surface environment and a self-made video ship data set, information of time domains of frames before and after a video and an airspace inside the same frame is mined, a space-time attention module is fused into a YOLOv3 model for information fusion, so that the video ship target detection precision is improved, the video ship target detection performance is optimized, strong association relation exists among image sequences, each image reflects a specific target area, the whole change in the specific area is slow and regular, a pixel level module is added into a YOLOv3 model, the video ship target monitoring sensitivity is improved, the video ship target detection accuracy can be improved, and the video ship target detection speed can be accelerated, the influence brought by the complex water surface environment is better dealt with.

The technical scheme of the invention is realized as follows: a video ship target detection method based on improved YOLOv3 comprises the following steps: step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format;

step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1;

and 3, combining the video ship data obtained based on the step 1 to perform pixel level processing: labeling matrixes for judging whether the image pair is changed or not;

step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a space-time module and a feature map obtained by a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;

and 5, detecting and analyzing the video ship target based on the improved YOLOv3 based on the weight file obtained in the step 4.

As a preferred technical solution of the present invention, in step 1, acquiring video data of a surface ship, performing frame extraction processing, extracting ship image data, and making a video ship data set by referring to a VOC data set format, the method includes:

step 1-1, performing frame extraction processing on video ship data by using a VS (visual sense) program, and further extracting video frame data of a ship;

and 1-2, marking the image by using LabelImg visualized image annotation software and generating an annotation file in a corresponding xml format.

As a preferred technical scheme of the invention, information fusion of time domains of front and rear frames of a video and a space domain in the same frame is mined in the step 2, and a space-time attention module introduces an attention mechanism, so that two dimensions, namely a time domain dimension and a space domain dimension, have different feature maps and different weights;

in the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. adjacent frames share a parameter, intuitively, it should be noted that adjacent frames that are more similar to the reference frame, for each frame i e ∈ [ -N, + N ], then the similar distance d can be calculated by the following formula (1):

wherein

And phi (F)_t ^a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, stationary gradient back propagation, the time domain attention mechanism is space-specific, spatial

Of the same size as the time domain

The time domain attention mechanism feature map is a pixel form multiplied by the original alignment feature

Adjusting attention modulation characteristics with additional fused convolutional layers

Wherein, lines and [, ] respectively represent multiplications and cascades by element;

and calculating a space attention mechanism according to the fused features, adopting a pyramid model to increase the acceptance range of the attention mechanism to different target scales, carrying out feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.

As a preferred technical solution of the present invention, in the step 3, for pixel level variation detection, two registered sequence image pairs are input, and output is a 0 and 1 matrix corresponding to the image size, where 0 and 1 respectively represent inconsistency and consistency, training is performed in two stages, where in the first stage, a large number of various image pairs are used for training and learning, and there is a certain similarity and difference between the image pairs; the second stage performs fine-tuning for a particular sequence of images, the image pairs being obtained from a combination of the sequence images.

As a preferred technical solution of the present invention, in step 4, according to the video ship data set self-made in step 1, training a YOLOv3 model based on a Darknet-53 feature network model, and performing Concat operation in combination with a feature map obtained by a spatio-temporal module and a pixel level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, the method includes:

step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features from an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted through the grids is directly matched and positioned with the central coordinates of a target object in a real bounding box, and the target object is detected and classified on the basis;

and 4-2, training an improved YOLOv3 network model based on the Darknet-53 characteristic network model, wherein the learning rate is 0.001, momentum is set to be 0.9, weight decay is set to be 0.0005, the training times are 50200 times, and when the loss function fluctuates up and down at a certain value, the training is stopped to obtain a trained weight file.

As a preferred technical solution of the present invention, in the Concat operation in step 4-1, Concat splices two or more feature maps in the channel or num dimension, and assumes that the two input channels are X channels respectively₁,X₂,...,X_cAnd Y₁,Y₂,...,Y_c，

Representing a convolution, then the single output channel of Concat is:

as a preferred technical solution of the present invention, in the step 5, according to the weight file obtained in the step 4, a command of "dark net. exe detector demo data/co.data yolov3.cfgyolov3.weights-i 0-depth 0.25-ext _ output test. mp4-out _ file xxx. mp4> xxx.txt" is run in the cmd environment, and the video ship target is detected and stored.

Compared with the prior art, the method integrates the space-time attention module, excavates the time domain of the front frame and the rear frame of the video and the information of the airspace in the same frame, performs information fusion, solves the problem that the information quantity of the adjacent frames of the video of the complex surface ship is unequal, and enables the two dimensions of the time domain dimension and the airspace dimension to have different characteristic graphs and different weights, thereby improving the precision of the target detection of the video ship and optimizing the target detection performance of the video ship; the invention introduces the pixel level module, the image pair is obtained by combining the sequence images, the problem of rare data is solved to a certain extent, the accuracy of the detection structure is facilitated, the depth network from the pixel to the pixel can realize the sequence image detection with higher precision, the target classification can be carried out on the basis of change, and the sensitivity of the video ship target detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without paying creative efforts.

FIG. 1 is a basic framework of a video ship detection method based on improved YOLOv3 according to an embodiment of the present invention;

FIG. 2 is a Darknet-53 network framework of the present invention;

FIG. 3 is a schematic diagram of the structure of a depth network of an image pair to a consistency matrix according to the present invention;

FIG. 4 is a schematic diagram of the space-time attention module structure according to the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-4, the present invention provides a technical solution: a video ship target detection method based on improved YOLOv3 comprises the following steps:

step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format;

in step 1, collecting video data of a water surface ship and performing frame extraction processing, extracting ship image data, and making a video ship data set by referring to a VOC data set format, wherein the method comprises the following steps:

step 1-2, in the training process of the network model, in order to accurately obtain the position of the image target, all collected images need to be labeled, so that the information of the target in the image can be obtained: number of targets, pose of targets, category of targets, coordinates of four vertices of the target bounding box, etc. And (3) marking the image by using image marking software visualized by LabelImg and generating a corresponding mark file in an xml format.

and 2, information fusion of time domains of front and rear frames of the video and a space domain in the same frame is mined, and the inter-frame time domain relation of the video and the intra-frame space domain relation of the video are non-negligible for the detection task of the video target. Due to factors such as shaking, shielding and target motion, the information amount of adjacent frames of the video is unequal, so that the weight parameter is increased, and the subsequent target characteristics are adversely affected. The space-time attention module introduces an attention mechanism, so that two dimensions of a time domain dimension and a space domain dimension have different feature maps and different weights.

In the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. adjacent frame sharing parameters, intuitively, attention should be paid to adjacent frames that are more similar to the reference frame. For each frame i ∈ [ -N, + N ], then the similar distance d can be calculated by the following equation (1):

wherein

And phi (F)_t ^a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, steady gradients are propagating backwards. The time domain attention mechanism is space-specific, spatial

Of the same size as the time domain

Wherein, l and [, ] respectively represent multiplications and cascades by element.

in step 3, for Pixel level change detection, an image segmentation network structure for Pixel-to-Pixel deep learning is used for reference, a task is directly faced, a mode of improving Pixel-to-Pixel is provided to be improved into a network of image pairs to a labeling matrix which is changed or not, namely two registered sequence image pairs are input and output as 0 and 1 matrixes with corresponding image sizes, 0 and 1 respectively represent inconsistency and consistency, training is carried out in two stages, a first stage adopts various large number of image pairs for training and learning, and certain similarity and difference exist between the image pairs; in the second stage, fine adjustment is carried out on a specific image sequence, and an image pair is obtained by combining sequence images, so that the problem of rare data is solved to a certain extent. The pixel-to-pixel depth network can realize high-precision sequence image detection and can classify targets on the basis of change.

in step 4, according to the video ship data set self-made in step 1, training a YOLOv3 model based on a Darknet-53 feature network model, performing Concat operation by combining a feature map obtained by a space-time module and a pixel level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, wherein the training comprises:

step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features of an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted through the grids is directly matched and positioned with the central coordinates of a target object in a real bounding box, and the target object is detected and classified on the basis.

In the Concat operation in step 4-1, Concat splices two or more feature maps in the channel or num dimension, assuming that the two input channels are X respectively₁,X₂,...,X_cAnd Y₁, Y₂,...,Y_c，

Representing a convolution, then the single output channel of Concat is:

the Concat function is mostly used for utilizing semantic information of feature maps with different scales, performing feature fusion on the semantic information in a channel increasing mode, and is mostly used for multitask problems for num dimension splicing.

In step 5, according to the weight file obtained in step 4, a command of "dark net. exe detecto demo data/co.data yolovv 3.cfg yolovv 3.weights-i0-thresh 0.25-ext _ outputtest. mp4-out _ filename xxx. mp4> xxx.txt" is run in the cmd environment, and the video ship target is detected and stored.

The parts of the invention not disclosed are all the prior art, and the specific structure, materials and working principle are not described in detail. Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A video ship target detection method based on improved YOLOv3 is characterized by comprising the following steps:

step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a feature map obtained by a space-time module and a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;

2. The method for detecting the video ship target based on the improved YOLOv3 as claimed in claim 1, wherein in the step 1, the video data of the ship on the water surface are collected and processed by frame extraction, the ship image data are extracted, and the video ship data set is self-made according to the VOC data set format, which includes:

3. The improved YOLOv 3-based video ship target detection method of claim 1, wherein: information fusion of time domains of front and rear frames of the video and a space domain in the same frame is mined in the step 2, and a space-time attention module introduces an attention mechanism to enable two dimensions of a time domain dimension and a space domain dimension to have different feature maps and different weights;

in the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. the adjacent frame sharing parameter, intuitively, it should be noted that adjacent frames more similar to the reference frame, for each frame i ∈ [ -N, + N ], then the similarity distance d can be calculated by the following formula (1):

wherein

Of the same size as the time domain

and calculating a space attention mechanism according to the fused features, increasing the acceptance range of the attention mechanism to different target scales by adopting a pyramid model, performing feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.

4. The method for detecting the target of the video ship based on the improved YOLOv3 as claimed in claim 1, wherein for the pixel level variation detection in step 3, two registered sequence image pairs are input and output as 0,1 matrixes corresponding to the image sizes, 0 and 1 respectively represent inconsistency and consistency, the training is performed in two stages, the first stage adopts a large variety of image pairs for training and learning, and the image pairs have certain similarity and difference; the second stage performs fine-tuning for a particular sequence of images, the image pairs being obtained from a combination of the images of the sequence.

5. The method for detecting the video ship target based on the improved YOLOv3 as claimed in claim 1, wherein in the step 4, according to the self-made video ship data set in the step 1, training the YOLOv3 model based on a Darknet-53 feature network model, and performing Concat operation by combining a spatio-temporal module and a feature map obtained by a pixel-level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, the method includes:

step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features from an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted by the grids is directly matched and positioned with the central coordinates of a target object in a real frame, and the target object is detected and classified on the basis;

6. The method as claimed in claim 5, wherein the Concat operation in step 4-1, Concat, concatenates two or more feature maps in channel or num dimension, and assumes that X is the channel of each of the two inputs₁,X₂,...,X_cAnd Y₁,Y₂,...,Y_c，

Representing a convolution, then the single output channel of Concat is:

7. the method for detecting the target of the video ship based on the improved YOLOv3 as claimed in claim 1, wherein the command "dark net. exe detector demoda/co. data YOLOv3.cfg YOLOv3. weight-i 0-thresh 0.25-ext _ output test. mp4-out _ file xxx. mp4> xxx.txt" is executed in cmd environment according to the weight file obtained in step 4 to detect and save the target of the video ship.