CN111814696A - Video ship target detection method based on improved YOLOv3 - Google Patents

Video ship target detection method based on improved YOLOv3 Download PDF

Info

Publication number
CN111814696A
CN111814696A CN202010667301.5A CN202010667301A CN111814696A CN 111814696 A CN111814696 A CN 111814696A CN 202010667301 A CN202010667301 A CN 202010667301A CN 111814696 A CN111814696 A CN 111814696A
Authority
CN
China
Prior art keywords
video
ship
frame
data
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010667301.5A
Other languages
Chinese (zh)
Inventor
齐亮
吕欣妍
万振刚
齐霄磊
朱立标
陈连凯
贾璇
黄晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Xinchuanpin Intelligent Technology Co ltd
Original Assignee
Suzhou Xinchuanpin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Xinchuanpin Intelligent Technology Co ltd filed Critical Suzhou Xinchuanpin Intelligent Technology Co ltd
Priority to CN202010667301.5A priority Critical patent/CN111814696A/en
Publication of CN111814696A publication Critical patent/CN111814696A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video ship target detection method based on improved YOLOv3, which comprises the following steps: step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format; step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1; step 3, combining the video ship data obtained based on the step 1 to carry out pixel level processing; the invention integrates a space-time attention module, excavates the information of the time domain of the front frame and the rear frame of the video and the air space in the same frame, performs information fusion, solves the problem of unequal information quantity of the adjacent frames of the video of the complex surface ship, and enables the two dimensions of the time domain dimension and the air space dimension to have different characteristic diagrams and different weights, thereby improving the precision of video ship target detection and optimizing the performance of the video ship target detection.

Description

Video ship target detection method based on improved YOLOv3
Technical Field
The invention relates to the field of target detection and computer vision, in particular to a video ship target detection method based on improved YOLOv3.
Background
With the continuous reduction of the cost of the camera equipment and the increasing maturity of the video processing technology, the video-based water target detection technology is gradually applied to a plurality of ocean management activities such as water area environment protection, water area detection, ocean right maintenance, island protection and the like, and the defects of the traditional detection means are effectively overcome. The current intelligent video monitoring system is mainly applied to the land environment, and the realization of video target detection under the complex water surface background is a very challenging task.
In the existing method for detecting ships in inland waterway by adopting active staring technology, firstly, an electronic tag is required to be arranged on a ship body, various information about the ships is stored in the electronic tag, when the ships pass through a detection area, relevant information of the ships, such as ship grade, passing time, course and the like, can be obtained by only reading and writing the electronic tag, and finally, the information is processed on a computer. An inclusion module is introduced into YOLOv3, and a ship vision system capable of accurately identifying and tracking multiple targets in sea and air in real time is designed. The method shows instability to the problems of complicated environment conditions of inland rivers and seas, ship shaking caused by ship movement, ship overlapping caused by multiple ships, repeated and unusual weather changes, large and small objective imaging factor differences of the ships and the like, and the detection and identification rate of the ships is greatly influenced. Therefore, how to effectively eliminate interference and achieve accurate detection of a target is a problem that needs to be mainly solved.
The target detection algorithm based on deep learning is generally divided into two categories: first, the family of R-CNN algorithms based on region nomination. Second, the YOLO, SSD series of algorithms that do not require region nomination. The R-CNN series target detection algorithm framework and the YOLO target detection algorithm framework provide two basic framework network structures for the research of target detection, the detection efficiency of the recurrent neural network is improved, and a high-efficiency learning tool is provided for realizing multi-scale and multi-class target detection.
In summary, the target detection based on deep learning is still a challenging subject, and in the face of a complex water surface environment, the problem that the information quantity of the video adjacent frames of the complex water surface ship is not equal still exists in the detection, and the video ship target detection precision is low, and meanwhile, the existing detection method has rare data and is not beneficial to the accuracy of the detection result.
Disclosure of Invention
The invention aims to design a video ship target detection method based on improved YOLOv3, according to the characteristics of a complex water surface environment and a self-made video ship data set, information of time domains of frames before and after a video and an airspace inside the same frame is mined, a space-time attention module is fused into a YOLOv3 model for information fusion, so that the video ship target detection precision is improved, the video ship target detection performance is optimized, strong association relation exists among image sequences, each image reflects a specific target area, the whole change in the specific area is slow and regular, a pixel level module is added into a YOLOv3 model, the video ship target monitoring sensitivity is improved, the video ship target detection accuracy can be improved, and the video ship target detection speed can be accelerated, the influence brought by the complex water surface environment is better dealt with.
The technical scheme of the invention is realized as follows: a video ship target detection method based on improved YOLOv3 comprises the following steps: step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format;
step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1;
and 3, combining the video ship data obtained based on the step 1 to perform pixel level processing: labeling matrixes for judging whether the image pair is changed or not;
step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a space-time module and a feature map obtained by a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;
and 5, detecting and analyzing the video ship target based on the improved YOLOv3 based on the weight file obtained in the step 4.
As a preferred technical solution of the present invention, in step 1, acquiring video data of a surface ship, performing frame extraction processing, extracting ship image data, and making a video ship data set by referring to a VOC data set format, the method includes:
step 1-1, performing frame extraction processing on video ship data by using a VS (visual sense) program, and further extracting video frame data of a ship;
and 1-2, marking the image by using LabelImg visualized image annotation software and generating an annotation file in a corresponding xml format.
As a preferred technical scheme of the invention, information fusion of time domains of front and rear frames of a video and a space domain in the same frame is mined in the step 2, and a space-time attention module introduces an attention mechanism, so that two dimensions, namely a time domain dimension and a space domain dimension, have different feature maps and different weights;
in the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. adjacent frames share a parameter, intuitively, it should be noted that adjacent frames that are more similar to the reference frame, for each frame i e ∈ [ -N, + N ], then the similar distance d can be calculated by the following formula (1):
Figure BDA0002580957500000031
wherein
Figure BDA0002580957500000032
And phi (F)t a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, stationary gradient back propagation, the time domain attention mechanism is space-specific, spatial
Figure BDA0002580957500000033
Of the same size as the time domain
Figure BDA0002580957500000041
The time domain attention mechanism feature map is a pixel form multiplied by the original alignment feature
Figure BDA0002580957500000042
Adjusting attention modulation characteristics with additional fused convolutional layers
Figure BDA0002580957500000043
Figure RE-GDA0002621344030000044
Figure BDA0002580957500000045
Wherein, lines and [, ] respectively represent multiplications and cascades by element;
and calculating a space attention mechanism according to the fused features, adopting a pyramid model to increase the acceptance range of the attention mechanism to different target scales, carrying out feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.
As a preferred technical solution of the present invention, in the step 3, for pixel level variation detection, two registered sequence image pairs are input, and output is a 0 and 1 matrix corresponding to the image size, where 0 and 1 respectively represent inconsistency and consistency, training is performed in two stages, where in the first stage, a large number of various image pairs are used for training and learning, and there is a certain similarity and difference between the image pairs; the second stage performs fine-tuning for a particular sequence of images, the image pairs being obtained from a combination of the sequence images.
As a preferred technical solution of the present invention, in step 4, according to the video ship data set self-made in step 1, training a YOLOv3 model based on a Darknet-53 feature network model, and performing Concat operation in combination with a feature map obtained by a spatio-temporal module and a pixel level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, the method includes:
step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features from an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted through the grids is directly matched and positioned with the central coordinates of a target object in a real bounding box, and the target object is detected and classified on the basis;
and 4-2, training an improved YOLOv3 network model based on the Darknet-53 characteristic network model, wherein the learning rate is 0.001, momentum is set to be 0.9, weight decay is set to be 0.0005, the training times are 50200 times, and when the loss function fluctuates up and down at a certain value, the training is stopped to obtain a trained weight file.
As a preferred technical solution of the present invention, in the Concat operation in step 4-1, Concat splices two or more feature maps in the channel or num dimension, and assumes that the two input channels are X channels respectively1,X2,...,XcAnd Y1,Y2,...,Yc
Figure BDA0002580957500000051
Representing a convolution, then the single output channel of Concat is:
Figure BDA0002580957500000052
as a preferred technical solution of the present invention, in the step 5, according to the weight file obtained in the step 4, a command of "dark net. exe detector demo data/co.data yolov3.cfgyolov3.weights-i 0-depth 0.25-ext _ output test. mp4-out _ file xxx. mp4> xxx.txt" is run in the cmd environment, and the video ship target is detected and stored.
Compared with the prior art, the method integrates the space-time attention module, excavates the time domain of the front frame and the rear frame of the video and the information of the airspace in the same frame, performs information fusion, solves the problem that the information quantity of the adjacent frames of the video of the complex surface ship is unequal, and enables the two dimensions of the time domain dimension and the airspace dimension to have different characteristic graphs and different weights, thereby improving the precision of the target detection of the video ship and optimizing the target detection performance of the video ship; the invention introduces the pixel level module, the image pair is obtained by combining the sequence images, the problem of rare data is solved to a certain extent, the accuracy of the detection structure is facilitated, the depth network from the pixel to the pixel can realize the sequence image detection with higher precision, the target classification can be carried out on the basis of change, and the sensitivity of the video ship target detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without paying creative efforts.
FIG. 1 is a basic framework of a video ship detection method based on improved YOLOv3 according to an embodiment of the present invention;
FIG. 2 is a Darknet-53 network framework of the present invention;
FIG. 3 is a schematic diagram of the structure of a depth network of an image pair to a consistency matrix according to the present invention;
FIG. 4 is a schematic diagram of the space-time attention module structure according to the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-4, the present invention provides a technical solution: a video ship target detection method based on improved YOLOv3 comprises the following steps:
step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format;
in step 1, collecting video data of a water surface ship and performing frame extraction processing, extracting ship image data, and making a video ship data set by referring to a VOC data set format, wherein the method comprises the following steps:
step 1-1, performing frame extraction processing on video ship data by using a VS (visual sense) program, and further extracting video frame data of a ship;
step 1-2, in the training process of the network model, in order to accurately obtain the position of the image target, all collected images need to be labeled, so that the information of the target in the image can be obtained: number of targets, pose of targets, category of targets, coordinates of four vertices of the target bounding box, etc. And (3) marking the image by using image marking software visualized by LabelImg and generating a corresponding mark file in an xml format.
Step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1;
and 2, information fusion of time domains of front and rear frames of the video and a space domain in the same frame is mined, and the inter-frame time domain relation of the video and the intra-frame space domain relation of the video are non-negligible for the detection task of the video target. Due to factors such as shaking, shielding and target motion, the information amount of adjacent frames of the video is unequal, so that the weight parameter is increased, and the subsequent target characteristics are adversely affected. The space-time attention module introduces an attention mechanism, so that two dimensions of a time domain dimension and a space domain dimension have different feature maps and different weights.
In the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. adjacent frame sharing parameters, intuitively, attention should be paid to adjacent frames that are more similar to the reference frame. For each frame i ∈ [ -N, + N ], then the similar distance d can be calculated by the following equation (1):
Figure BDA0002580957500000071
wherein
Figure BDA0002580957500000072
And phi (F)t a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, steady gradients are propagating backwards. The time domain attention mechanism is space-specific, spatial
Figure BDA0002580957500000073
Of the same size as the time domain
Figure BDA0002580957500000074
The time domain attention mechanism feature map is a pixel form multiplied by the original alignment feature
Figure BDA0002580957500000075
Adjusting attention modulation characteristics with additional fused convolutional layers
Figure BDA0002580957500000076
Figure RE-GDA0002621344030000081
Figure BDA0002580957500000082
Wherein, l and [, ] respectively represent multiplications and cascades by element.
And calculating a space attention mechanism according to the fused features, adopting a pyramid model to increase the acceptance range of the attention mechanism to different target scales, carrying out feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.
And 3, combining the video ship data obtained based on the step 1 to perform pixel level processing: labeling matrixes for judging whether the image pair is changed or not;
in step 3, for Pixel level change detection, an image segmentation network structure for Pixel-to-Pixel deep learning is used for reference, a task is directly faced, a mode of improving Pixel-to-Pixel is provided to be improved into a network of image pairs to a labeling matrix which is changed or not, namely two registered sequence image pairs are input and output as 0 and 1 matrixes with corresponding image sizes, 0 and 1 respectively represent inconsistency and consistency, training is carried out in two stages, a first stage adopts various large number of image pairs for training and learning, and certain similarity and difference exist between the image pairs; in the second stage, fine adjustment is carried out on a specific image sequence, and an image pair is obtained by combining sequence images, so that the problem of rare data is solved to a certain extent. The pixel-to-pixel depth network can realize high-precision sequence image detection and can classify targets on the basis of change.
Step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a space-time module and a feature map obtained by a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;
in step 4, according to the video ship data set self-made in step 1, training a YOLOv3 model based on a Darknet-53 feature network model, performing Concat operation by combining a feature map obtained by a space-time module and a pixel level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, wherein the training comprises:
step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features of an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted through the grids is directly matched and positioned with the central coordinates of a target object in a real bounding box, and the target object is detected and classified on the basis.
In the Concat operation in step 4-1, Concat splices two or more feature maps in the channel or num dimension, assuming that the two input channels are X respectively1,X2,...,XcAnd Y1, Y2,...,Yc
Figure BDA0002580957500000091
Representing a convolution, then the single output channel of Concat is:
Figure BDA0002580957500000092
the Concat function is mostly used for utilizing semantic information of feature maps with different scales, performing feature fusion on the semantic information in a channel increasing mode, and is mostly used for multitask problems for num dimension splicing.
And 4-2, training an improved YOLOv3 network model based on the Darknet-53 characteristic network model, wherein the learning rate is 0.001, momentum is set to be 0.9, weight decay is set to be 0.0005, the training times are 50200 times, and when the loss function fluctuates up and down at a certain value, the training is stopped to obtain a trained weight file.
And 5, detecting and analyzing the video ship target based on the improved YOLOv3 based on the weight file obtained in the step 4.
In step 5, according to the weight file obtained in step 4, a command of "dark net. exe detecto demo data/co.data yolovv 3.cfg yolovv 3.weights-i0-thresh 0.25-ext _ outputtest. mp4-out _ filename xxx. mp4> xxx.txt" is run in the cmd environment, and the video ship target is detected and stored.
The parts of the invention not disclosed are all the prior art, and the specific structure, materials and working principle are not described in detail. Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. A video ship target detection method based on improved YOLOv3 is characterized by comprising the following steps:
step 1, acquiring video data of a water surface ship, performing frame extraction processing, extracting video ship image data, and making a video ship data set according to a VOC data set format;
step 2, mining information fusion of time domains of front and rear frames of the video and an airspace inside the same frame based on the video ship data obtained in the step 1;
and 3, combining the video ship data obtained based on the step 1 to perform pixel level processing: labeling matrixes for judging whether the image pair is changed or not;
step 4, according to the video ship data set self-made in the step 1, feature extraction is carried out on the basis of a Darknet-53 feature network model, Concat operation is carried out by combining a feature map obtained by a space-time module and a pixel level module, so that a fused multi-scale feature map is obtained, and finally a trained weight file is further obtained;
and 5, detecting and analyzing the video ship target based on the improved YOLOv3 based on the weight file obtained in the step 4.
2. The method for detecting the video ship target based on the improved YOLOv3 as claimed in claim 1, wherein in the step 1, the video data of the ship on the water surface are collected and processed by frame extraction, the ship image data are extracted, and the video ship data set is self-made according to the VOC data set format, which includes:
step 1-1, performing frame extraction processing on video ship data by using a VS (visual sense) program, and further extracting video frame data of a ship;
and 1-2, marking the image by using LabelImg visualized image annotation software and generating an annotation file in a corresponding xml format.
3. The improved YOLOv 3-based video ship target detection method of claim 1, wherein: information fusion of time domains of front and rear frames of the video and a space domain in the same frame is mined in the step 2, and a space-time attention module introduces an attention mechanism to enable two dimensions of a time domain dimension and a space domain dimension to have different feature maps and different weights;
in the embedding space, the temporal attention mechanism is to calculate the similarity between adjacent frames, i.e. the adjacent frame sharing parameter, intuitively, it should be noted that adjacent frames more similar to the reference frame, for each frame i ∈ [ -N, + N ], then the similarity distance d can be calculated by the following formula (1):
Figure RE-FDA0002621344020000021
wherein
Figure RE-FDA0002621344020000022
And phi (F)t a) Is two embeddings, can be realized by convolution kernel, the output of sigmoid activating function is [0,1 ]]In between, stationary gradient back propagation, the time domain attention mechanism is space-specific, spatial
Figure RE-FDA0002621344020000023
Of the same size as the time domain
Figure RE-FDA0002621344020000024
The time domain attention mechanism feature map is a pixel form multiplied by the original alignment feature
Figure RE-FDA0002621344020000025
Adjusting attention modulation characteristics with additional fused convolutional layers
Figure RE-FDA0002621344020000026
Figure RE-FDA0002621344020000027
Figure RE-FDA0002621344020000028
Wherein, lines and [, ] respectively represent multiplications and cascades by element;
and calculating a space attention mechanism according to the fused features, increasing the acceptance range of the attention mechanism to different target scales by adopting a pyramid model, performing feature adjustment fusion by multiplying and adding elements one by one through Mask operation, and finally obtaining a final feature map through up-sampling.
4. The method for detecting the target of the video ship based on the improved YOLOv3 as claimed in claim 1, wherein for the pixel level variation detection in step 3, two registered sequence image pairs are input and output as 0,1 matrixes corresponding to the image sizes, 0 and 1 respectively represent inconsistency and consistency, the training is performed in two stages, the first stage adopts a large variety of image pairs for training and learning, and the image pairs have certain similarity and difference; the second stage performs fine-tuning for a particular sequence of images, the image pairs being obtained from a combination of the images of the sequence.
5. The method for detecting the video ship target based on the improved YOLOv3 as claimed in claim 1, wherein in the step 4, according to the self-made video ship data set in the step 1, training the YOLOv3 model based on a Darknet-53 feature network model, and performing Concat operation by combining a spatio-temporal module and a feature map obtained by a pixel-level module to obtain a fused multi-scale feature map, and finally further obtaining a trained weight file, the method includes:
step 4-1, a Darknet-53 network references a Restnet network structure and introduces a resimul structure, the Darknet-53 network comprises a convolution layer, a pooling layer and a softmax layer, a Yolov3 algorithm directly extracts features from an input image through a feature extraction network Darknet-53 by adopting a regression idea to obtain a multi-scale feature map, a spatio-temporal attention module and a pixel level module are fused through a Concat operation to further obtain the fused multi-scale feature map, then the input image is divided into grids with corresponding sizes, a bounding box predicted by the grids is directly matched and positioned with the central coordinates of a target object in a real frame, and the target object is detected and classified on the basis;
and 4-2, training an improved YOLOv3 network model based on the Darknet-53 characteristic network model, wherein the learning rate is 0.001, momentum is set to be 0.9, weight decay is set to be 0.0005, the training times are 50200 times, and when the loss function fluctuates up and down at a certain value, the training is stopped to obtain a trained weight file.
6. The method as claimed in claim 5, wherein the Concat operation in step 4-1, Concat, concatenates two or more feature maps in channel or num dimension, and assumes that X is the channel of each of the two inputs1,X2,...,XcAnd Y1,Y2,...,Yc
Figure FDA0002580957490000031
Representing a convolution, then the single output channel of Concat is:
Figure FDA0002580957490000032
7. the method for detecting the target of the video ship based on the improved YOLOv3 as claimed in claim 1, wherein the command "dark net. exe detector demoda/co. data YOLOv3.cfg YOLOv3. weight-i 0-thresh 0.25-ext _ output test. mp4-out _ file xxx. mp4> xxx.txt" is executed in cmd environment according to the weight file obtained in step 4 to detect and save the target of the video ship.
CN202010667301.5A 2020-07-13 2020-07-13 Video ship target detection method based on improved YOLOv3 Withdrawn CN111814696A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010667301.5A CN111814696A (en) 2020-07-13 2020-07-13 Video ship target detection method based on improved YOLOv3

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010667301.5A CN111814696A (en) 2020-07-13 2020-07-13 Video ship target detection method based on improved YOLOv3

Publications (1)

Publication Number Publication Date
CN111814696A true CN111814696A (en) 2020-10-23

Family

ID=72842310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010667301.5A Withdrawn CN111814696A (en) 2020-07-13 2020-07-13 Video ship target detection method based on improved YOLOv3

Country Status (1)

Country Link
CN (1) CN111814696A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800838A (en) * 2020-12-28 2021-05-14 浙江万里学院 Channel ship detection and identification method based on deep learning
CN113205151A (en) * 2021-05-25 2021-08-03 上海海事大学 Ship target real-time detection method and terminal based on improved SSD model
CN117974734A (en) * 2024-03-29 2024-05-03 阿里巴巴(中国)有限公司 Image processing method, apparatus, storage medium, and program product

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800838A (en) * 2020-12-28 2021-05-14 浙江万里学院 Channel ship detection and identification method based on deep learning
CN113205151A (en) * 2021-05-25 2021-08-03 上海海事大学 Ship target real-time detection method and terminal based on improved SSD model
CN113205151B (en) * 2021-05-25 2024-02-27 上海海事大学 Ship target real-time detection method and terminal based on improved SSD model
CN117974734A (en) * 2024-03-29 2024-05-03 阿里巴巴(中国)有限公司 Image processing method, apparatus, storage medium, and program product

Similar Documents

Publication Publication Date Title
de Silva et al. Automated rip current detection with region based convolutional neural networks
CN111814696A (en) Video ship target detection method based on improved YOLOv3
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
Liu et al. Subtler mixed attention network on fine-grained image classification
Hidaka et al. Pixel-level image classification for detecting beach litter using a deep learning approach
CN116579616B (en) Risk identification method based on deep learning
Golovko et al. Development of solar panels detector
Wang et al. A feature-supervised generative adversarial network for environmental monitoring during hazy days
Li et al. Small target deep convolution recognition algorithm based on improved YOLOv4
Li et al. A review of deep learning methods for pixel-level crack detection
CN110827320A (en) Target tracking method and device based on time sequence prediction
Cao et al. Multi angle rotation object detection for remote sensing image based on modified feature pyramid networks
Lowphansirikul et al. 3D Semantic segmentation of large-scale point-clouds in urban areas using deep learning
Chen et al. Multi-scale attention networks for pavement defect detection
CN116168240A (en) Arbitrary-direction dense ship target detection method based on attention enhancement
Wang et al. SAR ship detection in complex background based on multi-feature fusion and non-local channel attention mechanism
Luo et al. RBD-Net: robust breakage detection algorithm for industrial leather
Yu et al. Automatic segmentation of golden pomfret based on fusion of multi-head self-attention and channel-attention mechanism
Li et al. Class-aware tiny object recognition over large-scale 3D point clouds
CN112785629A (en) Aurora motion characterization method based on unsupervised deep optical flow network
Zhao et al. Ocean ship detection and recognition algorithm based on aerial image
CN116824488A (en) Target detection method based on transfer learning
CN114037737B (en) Neural network-based offshore submarine fish detection and tracking statistical method
CN115578364A (en) Weak target detection method and system based on mixed attention and harmonic factor
CN113313091B (en) Density estimation method based on multiple attention and topological constraints under warehouse logistics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201023

WW01 Invention patent application withdrawn after publication