CN111814696A - Video ship target detection method based on improved YOLOv3 - Google Patents
Video ship target detection method based on improved YOLOv3 Download PDFInfo
- Publication number
- CN111814696A CN111814696A CN202010667301.5A CN202010667301A CN111814696A CN 111814696 A CN111814696 A CN 111814696A CN 202010667301 A CN202010667301 A CN 202010667301A CN 111814696 A CN111814696 A CN 111814696A
- Authority
- CN
- China
- Prior art keywords
- video
- ship
- frame
- data
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 45
- 238000000605 extraction Methods 0.000 claims abstract description 16
- 230000004927 fusion Effects 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 14
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims abstract description 13
- 238000005065 mining Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 20
- 230000007246 mechanism Effects 0.000 claims description 18
- 238000000034 method Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 7
- 238000002372 labelling Methods 0.000 claims description 4
- 230000003213 activating effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 abstract description 3
- 230000008859 change Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 235000006506 Brasenia schreberi Nutrition 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/08—Detecting or categorising vehicles
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video ship target detection method based on improved YOLOv3, which comprises the following steps: step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format; step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1; step 3, combining the video ship data obtained based on the step 1 to carry out pixel level processing; the invention integrates a space-time attention module, excavates the information of the time domain of the front frame and the rear frame of the video and the air space in the same frame, performs information fusion, solves the problem of unequal information quantity of the adjacent frames of the video of the complex surface ship, and enables the two dimensions of the time domain dimension and the air space dimension to have different characteristic diagrams and different weights, thereby improving the precision of video ship target detection and optimizing the performance of the video ship target detection.
Description
Technical Field
The invention relates to the field of target detection and computer vision, in particular to a video ship target detection method based on improved YOLOv3.
Background
With the continuous reduction of the cost of the camera equipment and the increasing maturity of the video processing technology, the video-based water target detection technology is gradually applied to a plurality of ocean management activities such as water area environment protection, water area detection, ocean right maintenance, island protection and the like, and the defects of the traditional detection means are effectively overcome. The current intelligent video monitoring system is mainly applied to the land environment, and the realization of video target detection under the complex water surface background is a very challenging task.
In the existing method for detecting ships in inland waterway by adopting active staring technology, firstly, an electronic tag is required to be arranged on a ship body, various information about the ships is stored in the electronic tag, when the ships pass through a detection area, relevant information of the ships, such as ship grade, passing time, course and the like, can be obtained by only reading and writing the electronic tag, and finally, the information is processed on a computer. An inclusion module is introduced into YOLOv3, and a ship vision system capable of accurately identifying and tracking multiple targets in sea and air in real time is designed. The method shows instability to the problems of complicated environment conditions of inland rivers and seas, ship shaking caused by ship movement, ship overlapping caused by multiple ships, repeated and unusual weather changes, large and small objective imaging factor differences of the ships and the like, and the detection and identification rate of the ships is greatly influenced. Therefore, how to effectively eliminate interference and achieve accurate detection of a target is a problem that needs to be mainly solved.
The target detection algorithm based on deep learning is generally divided into two categories: first, the family of R-CNN algorithms based on region nomination. Second, the YOLO, SSD series of algorithms that do not require region nomination. The R-CNN series target detection algorithm framework and the YOLO target detection algorithm framework provide two basic framework network structures for the research of target detection, the detection efficiency of the recurrent neural network is improved, and a high-efficiency learning tool is provided for realizing multi-scale and multi-class target detection.
In summary, the target detection based on deep learning is still a challenging subject, and in the face of a complex water surface environment, the problem that the information quantity of the video adjacent frames of the complex water surface ship is not equal still exists in the detection, and the video ship target detection precision is low, and meanwhile, the existing detection method has rare data and is not beneficial to the accuracy of the detection result.
Disclosure of Invention
The invention aims to design a video ship target detection method based on improved YOLOv3, according to the characteristics of a complex water surface environment and a self-made video ship data set, information of time domains of frames before and after a video and an airspace inside the same frame is mined, a space-time attention module is fused into a YOLOv3 model for information fusion, so that the video ship target detection precision is improved, the video ship target detection performance is optimized, strong association relation exists among image sequences, each image reflects a specific target area, the whole change in the specific area is slow and regular, a pixel level module is added into a YOLOv3 model, the video ship target monitoring sensitivity is improved, the video ship target detection accuracy can be improved, and the video ship target detection speed can be accelerated, the influence brought by the complex water surface environment is better dealt with.
The technical scheme of the invention is realized as follows: a video ship target detection method based on improved YOLOv3 comprises the following steps: step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format;
step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1;
and 3, combining the video ship data obtained based on the step 1 to perform pixel level processing: labeling matrixes for judging whether the image pair is changed or not;
step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a space-time module and a feature map obtained by a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;
and 5, detecting and analyzing the video ship target based on the improved YOLOv3 based on the weight file obtained in the step 4.
As a preferred technical solution of the present invention, in step 1, acquiring video data of a surface ship, performing frame extraction processing, extracting ship image data, and making a video ship data set by referring to a VOC data set format, the method includes:
step 1-1, performing frame extraction processing on video ship data by using a VS (visual sense) program, and further extracting video frame data of a ship;
and 1-2, marking the image by using LabelImg visualized image annotation software and generating an annotation file in a corresponding xml format.
As a preferred technical scheme of the invention, information fusion of time domains of front and rear frames of a video and a space domain in the same frame is mined in the step 2, and a space-time attention module introduces an attention mechanism, so that two dimensions, namely a time domain dimension and a space domain dimension, have different feature maps and different weights;
in the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. adjacent frames share a parameter, intuitively, it should be noted that adjacent frames that are more similar to the reference frame, for each frame i e ∈ [ -N, + N ], then the similar distance d can be calculated by the following formula (1):
whereinAnd phi (F)t a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, stationary gradient back propagation, the time domain attention mechanism is space-specific, spatialOf the same size as the time domain
The time domain attention mechanism feature map is a pixel form multiplied by the original alignment featureAdjusting attention modulation characteristics with additional fused convolutional layers
Wherein, lines and [, ] respectively represent multiplications and cascades by element;
and calculating a space attention mechanism according to the fused features, adopting a pyramid model to increase the acceptance range of the attention mechanism to different target scales, carrying out feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.
As a preferred technical solution of the present invention, in the step 3, for pixel level variation detection, two registered sequence image pairs are input, and output is a 0 and 1 matrix corresponding to the image size, where 0 and 1 respectively represent inconsistency and consistency, training is performed in two stages, where in the first stage, a large number of various image pairs are used for training and learning, and there is a certain similarity and difference between the image pairs; the second stage performs fine-tuning for a particular sequence of images, the image pairs being obtained from a combination of the sequence images.
As a preferred technical solution of the present invention, in step 4, according to the video ship data set self-made in step 1, training a YOLOv3 model based on a Darknet-53 feature network model, and performing Concat operation in combination with a feature map obtained by a spatio-temporal module and a pixel level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, the method includes:
step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features from an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted through the grids is directly matched and positioned with the central coordinates of a target object in a real bounding box, and the target object is detected and classified on the basis;
and 4-2, training an improved YOLOv3 network model based on the Darknet-53 characteristic network model, wherein the learning rate is 0.001, momentum is set to be 0.9, weight decay is set to be 0.0005, the training times are 50200 times, and when the loss function fluctuates up and down at a certain value, the training is stopped to obtain a trained weight file.
As a preferred technical solution of the present invention, in the Concat operation in step 4-1, Concat splices two or more feature maps in the channel or num dimension, and assumes that the two input channels are X channels respectively1,X2,...,XcAnd Y1,Y2,...,Yc,Representing a convolution, then the single output channel of Concat is:
as a preferred technical solution of the present invention, in the step 5, according to the weight file obtained in the step 4, a command of "dark net. exe detector demo data/co.data yolov3.cfgyolov3.weights-i 0-depth 0.25-ext _ output test. mp4-out _ file xxx. mp4> xxx.txt" is run in the cmd environment, and the video ship target is detected and stored.
Compared with the prior art, the method integrates the space-time attention module, excavates the time domain of the front frame and the rear frame of the video and the information of the airspace in the same frame, performs information fusion, solves the problem that the information quantity of the adjacent frames of the video of the complex surface ship is unequal, and enables the two dimensions of the time domain dimension and the airspace dimension to have different characteristic graphs and different weights, thereby improving the precision of the target detection of the video ship and optimizing the target detection performance of the video ship; the invention introduces the pixel level module, the image pair is obtained by combining the sequence images, the problem of rare data is solved to a certain extent, the accuracy of the detection structure is facilitated, the depth network from the pixel to the pixel can realize the sequence image detection with higher precision, the target classification can be carried out on the basis of change, and the sensitivity of the video ship target detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without paying creative efforts.
FIG. 1 is a basic framework of a video ship detection method based on improved YOLOv3 according to an embodiment of the present invention;
FIG. 2 is a Darknet-53 network framework of the present invention;
FIG. 3 is a schematic diagram of the structure of a depth network of an image pair to a consistency matrix according to the present invention;
FIG. 4 is a schematic diagram of the space-time attention module structure according to the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-4, the present invention provides a technical solution: a video ship target detection method based on improved YOLOv3 comprises the following steps:
in step 1, collecting video data of a water surface ship and performing frame extraction processing, extracting ship image data, and making a video ship data set by referring to a VOC data set format, wherein the method comprises the following steps:
step 1-1, performing frame extraction processing on video ship data by using a VS (visual sense) program, and further extracting video frame data of a ship;
step 1-2, in the training process of the network model, in order to accurately obtain the position of the image target, all collected images need to be labeled, so that the information of the target in the image can be obtained: number of targets, pose of targets, category of targets, coordinates of four vertices of the target bounding box, etc. And (3) marking the image by using image marking software visualized by LabelImg and generating a corresponding mark file in an xml format.
Step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1;
and 2, information fusion of time domains of front and rear frames of the video and a space domain in the same frame is mined, and the inter-frame time domain relation of the video and the intra-frame space domain relation of the video are non-negligible for the detection task of the video target. Due to factors such as shaking, shielding and target motion, the information amount of adjacent frames of the video is unequal, so that the weight parameter is increased, and the subsequent target characteristics are adversely affected. The space-time attention module introduces an attention mechanism, so that two dimensions of a time domain dimension and a space domain dimension have different feature maps and different weights.
In the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. adjacent frame sharing parameters, intuitively, attention should be paid to adjacent frames that are more similar to the reference frame. For each frame i ∈ [ -N, + N ], then the similar distance d can be calculated by the following equation (1):
whereinAnd phi (F)t a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, steady gradients are propagating backwards. The time domain attention mechanism is space-specific, spatialOf the same size as the time domain
The time domain attention mechanism feature map is a pixel form multiplied by the original alignment featureAdjusting attention modulation characteristics with additional fused convolutional layers
Wherein, l and [, ] respectively represent multiplications and cascades by element.
And calculating a space attention mechanism according to the fused features, adopting a pyramid model to increase the acceptance range of the attention mechanism to different target scales, carrying out feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.
And 3, combining the video ship data obtained based on the step 1 to perform pixel level processing: labeling matrixes for judging whether the image pair is changed or not;
in step 3, for Pixel level change detection, an image segmentation network structure for Pixel-to-Pixel deep learning is used for reference, a task is directly faced, a mode of improving Pixel-to-Pixel is provided to be improved into a network of image pairs to a labeling matrix which is changed or not, namely two registered sequence image pairs are input and output as 0 and 1 matrixes with corresponding image sizes, 0 and 1 respectively represent inconsistency and consistency, training is carried out in two stages, a first stage adopts various large number of image pairs for training and learning, and certain similarity and difference exist between the image pairs; in the second stage, fine adjustment is carried out on a specific image sequence, and an image pair is obtained by combining sequence images, so that the problem of rare data is solved to a certain extent. The pixel-to-pixel depth network can realize high-precision sequence image detection and can classify targets on the basis of change.
Step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a space-time module and a feature map obtained by a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;
in step 4, according to the video ship data set self-made in step 1, training a YOLOv3 model based on a Darknet-53 feature network model, performing Concat operation by combining a feature map obtained by a space-time module and a pixel level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, wherein the training comprises:
step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features of an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted through the grids is directly matched and positioned with the central coordinates of a target object in a real bounding box, and the target object is detected and classified on the basis.
In the Concat operation in step 4-1, Concat splices two or more feature maps in the channel or num dimension, assuming that the two input channels are X respectively1,X2,...,XcAnd Y1, Y2,...,Yc,Representing a convolution, then the single output channel of Concat is:
the Concat function is mostly used for utilizing semantic information of feature maps with different scales, performing feature fusion on the semantic information in a channel increasing mode, and is mostly used for multitask problems for num dimension splicing.
And 4-2, training an improved YOLOv3 network model based on the Darknet-53 characteristic network model, wherein the learning rate is 0.001, momentum is set to be 0.9, weight decay is set to be 0.0005, the training times are 50200 times, and when the loss function fluctuates up and down at a certain value, the training is stopped to obtain a trained weight file.
And 5, detecting and analyzing the video ship target based on the improved YOLOv3 based on the weight file obtained in the step 4.
In step 5, according to the weight file obtained in step 4, a command of "dark net. exe detecto demo data/co.data yolovv 3.cfg yolovv 3.weights-i0-thresh 0.25-ext _ outputtest. mp4-out _ filename xxx. mp4> xxx.txt" is run in the cmd environment, and the video ship target is detected and stored.
The parts of the invention not disclosed are all the prior art, and the specific structure, materials and working principle are not described in detail. Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (7)
1. A video ship target detection method based on improved YOLOv3 is characterized by comprising the following steps:
step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format;
step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1;
and 3, combining the video ship data obtained based on the step 1 to perform pixel level processing: labeling matrixes for judging whether the image pair is changed or not;
step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a feature map obtained by a space-time module and a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;
and 5, detecting and analyzing the video ship target based on the improved YOLOv3 based on the weight file obtained in the step 4.
2. The method for detecting the video ship target based on the improved YOLOv3 as claimed in claim 1, wherein in the step 1, the video data of the ship on the water surface are collected and processed by frame extraction, the ship image data are extracted, and the video ship data set is self-made according to the VOC data set format, which includes:
step 1-1, performing frame extraction processing on video ship data by using a VS (visual sense) program, and further extracting video frame data of a ship;
and 1-2, marking the image by using LabelImg visualized image annotation software and generating an annotation file in a corresponding xml format.
3. The improved YOLOv 3-based video ship target detection method of claim 1, wherein: information fusion of time domains of front and rear frames of the video and a space domain in the same frame is mined in the step 2, and a space-time attention module introduces an attention mechanism to enable two dimensions of a time domain dimension and a space domain dimension to have different feature maps and different weights;
in the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. the adjacent frame sharing parameter, intuitively, it should be noted that adjacent frames more similar to the reference frame, for each frame i ∈ [ -N, + N ], then the similarity distance d can be calculated by the following formula (1):
whereinAnd phi (F)t a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, stationary gradient back propagation, the time domain attention mechanism is space-specific, spatialOf the same size as the time domain
The time domain attention mechanism feature map is a pixel form multiplied by the original alignment featureAdjusting attention modulation characteristics with additional fused convolutional layers
Wherein, lines and [, ] respectively represent multiplications and cascades by element;
and calculating a space attention mechanism according to the fused features, increasing the acceptance range of the attention mechanism to different target scales by adopting a pyramid model, performing feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.
4. The method for detecting the target of the video ship based on the improved YOLOv3 as claimed in claim 1, wherein for the pixel level variation detection in step 3, two registered sequence image pairs are input and output as 0,1 matrixes corresponding to the image sizes, 0 and 1 respectively represent inconsistency and consistency, the training is performed in two stages, the first stage adopts a large variety of image pairs for training and learning, and the image pairs have certain similarity and difference; the second stage performs fine-tuning for a particular sequence of images, the image pairs being obtained from a combination of the images of the sequence.
5. The method for detecting the video ship target based on the improved YOLOv3 as claimed in claim 1, wherein in the step 4, according to the self-made video ship data set in the step 1, training the YOLOv3 model based on a Darknet-53 feature network model, and performing Concat operation by combining a spatio-temporal module and a feature map obtained by a pixel-level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, the method includes:
step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features from an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted by the grids is directly matched and positioned with the central coordinates of a target object in a real frame, and the target object is detected and classified on the basis;
and 4-2, training an improved YOLOv3 network model based on the Darknet-53 characteristic network model, wherein the learning rate is 0.001, momentum is set to be 0.9, weight decay is set to be 0.0005, the training times are 50200 times, and when the loss function fluctuates up and down at a certain value, the training is stopped to obtain a trained weight file.
6. The method as claimed in claim 5, wherein the Concat operation in step 4-1, Concat, concatenates two or more feature maps in channel or num dimension, and assumes that X is the channel of each of the two inputs1,X2,...,XcAnd Y1,Y2,...,Yc,Representing a convolution, then the single output channel of Concat is:
7. the method for detecting the target of the video ship based on the improved YOLOv3 as claimed in claim 1, wherein the command "dark net. exe detector demoda/co. data YOLOv3.cfg YOLOv3. weight-i 0-thresh 0.25-ext _ output test. mp4-out _ file xxx. mp4> xxx.txt" is executed in cmd environment according to the weight file obtained in step 4 to detect and save the target of the video ship.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010667301.5A CN111814696A (en) | 2020-07-13 | 2020-07-13 | Video ship target detection method based on improved YOLOv3 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010667301.5A CN111814696A (en) | 2020-07-13 | 2020-07-13 | Video ship target detection method based on improved YOLOv3 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111814696A true CN111814696A (en) | 2020-10-23 |
Family
ID=72842310
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010667301.5A Withdrawn CN111814696A (en) | 2020-07-13 | 2020-07-13 | Video ship target detection method based on improved YOLOv3 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111814696A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800838A (en) * | 2020-12-28 | 2021-05-14 | 浙江万里学院 | Channel ship detection and identification method based on deep learning |
CN113205151A (en) * | 2021-05-25 | 2021-08-03 | 上海海事大学 | Ship target real-time detection method and terminal based on improved SSD model |
CN117974734A (en) * | 2024-03-29 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Image processing method, apparatus, storage medium, and program product |
-
2020
- 2020-07-13 CN CN202010667301.5A patent/CN111814696A/en not_active Withdrawn
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800838A (en) * | 2020-12-28 | 2021-05-14 | 浙江万里学院 | Channel ship detection and identification method based on deep learning |
CN113205151A (en) * | 2021-05-25 | 2021-08-03 | 上海海事大学 | Ship target real-time detection method and terminal based on improved SSD model |
CN113205151B (en) * | 2021-05-25 | 2024-02-27 | 上海海事大学 | Ship target real-time detection method and terminal based on improved SSD model |
CN117974734A (en) * | 2024-03-29 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Image processing method, apparatus, storage medium, and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
de Silva et al. | Automated rip current detection with region based convolutional neural networks | |
CN111814696A (en) | Video ship target detection method based on improved YOLOv3 | |
CN113609896A (en) | Object-level remote sensing change detection method and system based on dual-correlation attention | |
Liu et al. | Subtler mixed attention network on fine-grained image classification | |
Hidaka et al. | Pixel-level image classification for detecting beach litter using a deep learning approach | |
CN116579616B (en) | Risk identification method based on deep learning | |
Golovko et al. | Development of solar panels detector | |
Wang et al. | A feature-supervised generative adversarial network for environmental monitoring during hazy days | |
Li et al. | Small target deep convolution recognition algorithm based on improved YOLOv4 | |
Li et al. | A review of deep learning methods for pixel-level crack detection | |
CN110827320A (en) | Target tracking method and device based on time sequence prediction | |
Cao et al. | Multi angle rotation object detection for remote sensing image based on modified feature pyramid networks | |
Lowphansirikul et al. | 3D Semantic segmentation of large-scale point-clouds in urban areas using deep learning | |
Chen et al. | Multi-scale attention networks for pavement defect detection | |
CN116168240A (en) | Arbitrary-direction dense ship target detection method based on attention enhancement | |
Wang et al. | SAR ship detection in complex background based on multi-feature fusion and non-local channel attention mechanism | |
Luo et al. | RBD-Net: robust breakage detection algorithm for industrial leather | |
Yu et al. | Automatic segmentation of golden pomfret based on fusion of multi-head self-attention and channel-attention mechanism | |
Li et al. | Class-aware tiny object recognition over large-scale 3D point clouds | |
CN112785629A (en) | Aurora motion characterization method based on unsupervised deep optical flow network | |
Zhao et al. | Ocean ship detection and recognition algorithm based on aerial image | |
CN116824488A (en) | Target detection method based on transfer learning | |
CN114037737B (en) | Neural network-based offshore submarine fish detection and tracking statistical method | |
CN115578364A (en) | Weak target detection method and system based on mixed attention and harmonic factor | |
CN113313091B (en) | Density estimation method based on multiple attention and topological constraints under warehouse logistics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201023 |
|
WW01 | Invention patent application withdrawn after publication |