CN111553406B

CN111553406B - Target detection system, method and terminal based on improved YOLO-V3

Info

Publication number: CN111553406B
Application number: CN202010333517.8A
Authority: CN
Inventors: 田鹏程
Original assignee: Shanghai Kaike Intelligent Technology Co ltd
Current assignee: Shanghai Kaike Intelligent Technology Co ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2023-04-28
Anticipated expiration: 2040-04-24
Also published as: CN111553406A

Abstract

The invention discloses an object detection system based on improved YOLO-V3, comprising: the system comprises an image acquisition module, an image preprocessing module, a dark net-39 main network module, a multi-scale convolution layer feature combination module, a weighted feature fusion module and a prediction module, wherein the dark net-39 main network module adopts a dark net-39 main network model to extract image features, so as to obtain feature images of 5 different scale convolution layers; the multi-scale convolution layer feature combination module is used for optimally combining the feature images of the 5 different scale convolution layers to obtain a combined feature image; the weighted feature fusion module is used for carrying out weighted feature fusion on the combined feature map; the prediction module is used for carrying out regression prediction on the fused feature images by using a YOLO-V3 algorithm to obtain a target detection result. The system has a smaller network model, accelerates the target detection speed, enhances the network feature fusion effect and realizes a better detection result.

Description

Target detection system, method and terminal based on improved YOLO-V3

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection system, method and terminal based on YOLO-V3.

Background

YOLO (You Only Look Once) -V3 is a popular object detection algorithm at present, the speed is high and stable, but a Darknet-53 network structure is adopted in a YOLO-V3 backbone network, the parameter is 65.86BFLOPs (Billion Float Point Operations), and model parameters are large, so that the speed of the algorithm is greatly reduced when embedded equipment is operated, and the real-time detection effect cannot be achieved. When 416×416 is input, the minimum feature map size for YOLO-V3 to extract features is 13×13, which is still large, resulting in poor detection of medium or large size objects by the YOLO-V3 operator. YOLOv3 utilizes multi-scale feature graphs from different layers to predict targets with different sizes, and fuses high-low layer feature information, so that the feature fusion effect is poor because the feature graph contribution degree of different layers is often different although the detection precision is improved to a certain extent.

Disclosure of Invention

Aiming at the defects in the prior art, the object detection system, the method, the terminal and the medium based on the YOLO-V3 provided by the embodiment of the invention have the advantages of high object detection speed, improved detection effect on medium or large-size objects, improved fusion effect of YOLO-V3 fused with different layer feature graphs, and improved mAP index of object detection.

In a first aspect, an embodiment of the present invention provides a YOLO-V3-based target detection system, including: an image acquisition module, an image preprocessing module, a dark-39 backbone network module, a multi-scale convolution layer characteristic combination module, a weighted characteristic fusion module and a prediction module,

the image acquisition module is used for acquiring an image to be identified;

the image preprocessing module is used for preprocessing an image to be identified to obtain a preprocessed image;

the dark net-39 main network module is improved through a dark net-53 main network to obtain a dark net-39 main network model, and image features are extracted through the dark net-39 main network model to obtain feature diagrams of 5 different scale convolution layers;

the multi-scale convolution layer feature combination module is used for optimally combining the feature images of the 5 different scale convolution layers to obtain a combined feature image;

the weighted feature fusion module is used for carrying out weighted feature fusion on the combined feature map;

the prediction module is used for carrying out regression prediction on the fused feature images by using a YOLO-V3 algorithm to obtain a target detection result.

In a second aspect, an embodiment of the present invention provides a target detection method based on improved YOLO-V3, including:

acquiring an image to be identified;

preprocessing an image to be identified to obtain a preprocessed image;

extracting image features by adopting a trained dark net-39 backbone network model to obtain feature images of 5 convolution layers with different scales;

optimally combining the feature images of the convolution layers with different scales to obtain a combined feature image;

carrying out weighted feature fusion on the combined feature map;

and carrying out regression prediction on the fused feature images by using a YOLO-V3 algorithm to obtain a target detection result.

In a third aspect, an embodiment of the present invention provides an intelligent terminal, including a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, and the memory is configured to store a computer program, where the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method steps described in the foregoing embodiments.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method steps described in the above embodiments.

The invention has the beneficial effects that:

according to the target detection system, method, terminal and medium based on the improved YOLO-V3, the dark net-39 main network is adopted for feature extraction, the size of a model is reduced, the target detection speed is increased, 5 different scale convolution layers are adopted for feature image extraction, shallow layer feature information and deep layer feature information are fully fused, the detection effect of a medium or large-size object is improved, the combination weighting feature fusion is carried out on the feature images of different convolution layers according to different contribution degrees of the feature images of different convolution layers, the network feature fusion effect is enhanced, and a better detection result is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale.

FIG. 1 is a block diagram showing an object detection system based on modified YOLO-V3 according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a target detection method based on modified YOLO-V3 according to a second embodiment of the present invention;

fig. 3 is a block diagram of an intelligent terminal according to a third embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention pertains.

Referring to FIG. 1, there is shown a block diagram of an object detection system based on modified YOLO-V3 according to a first embodiment of the present invention, the system comprising: the system comprises an image acquisition module 101, an image preprocessing module 102, a dark-39 backbone network module 103, a multi-scale convolution layer feature combination module 104, a weighted feature fusion module 105 and a prediction module 106, wherein the image acquisition module 101 is used for acquiring an image to be identified; the image preprocessing module 102 is used for preprocessing an image to be identified to obtain a preprocessed image; the dark net-39 main network module 103 is used for obtaining a dark net-39 main network model by improving a dark net-53 main network, and extracting image features by adopting the dark net-39 main network model to obtain feature diagrams of 5 different scale convolution layers; the multi-scale convolution layer feature combination module 104 is configured to optimally combine feature graphs of 5 different scale convolution layers to obtain a combined feature graph, where the optimal combination is to perform different combinations according to different layers, the first and the last layers are combined two by two, and the middle layer is combined three; the weighted feature fusion module 105 is used for performing weighted feature fusion on the combined feature map; the prediction module 106 is configured to perform regression prediction on the fused feature map by using YOLO-V3 algorithm, so as to obtain a target detection result.

The image preprocessing module 102 comprises an image rotation unit and a scaling unit, wherein the image rotation unit is used for randomly overturning, rotating and cutting an image to be identified; the scaling unit is used for performing scale transformation on the image to be identified.

The dark net-39 backbone network module 103 performs channel clipping on the dark net-53 network, reduces the number of model parameters, and improves the operation efficiency while fully extracting the picture features, so that compared with the original calculated amount, the improved YOLO-V3 algorithm is reduced by 80%, and the speed is improved by 4 times. The structure of the dark net-39 backbone network in the dark net-39 backbone network module is shown in Table 1.

The dark net-39 main network module comprises a dark net-39 main network training unit, wherein the dark net-39 main network training unit adds 2 convolution layers in a main network of a traditional YOLO-V3 algorithm, and adopts 5 feature graphs of different scale convolution layers to detect targets; and acquiring a data set, dividing the data set into a training set, a testing set and a verification set, re-clustering coordinates of the boundary frames on the training set by adopting a k-means clustering algorithm, and calculating 15 boundary frame coordinates of the characteristic diagrams of the convolution layers with 5 different scales.

The dark net-39 main network module performs reasonable pruning on the dark net-53 main network, optimizes the network structure, removes some redundant convolution operations, and obtains the dark net-39 main network, wherein the method comprises the specific operation of halving the number of channels of a Level 5 layer, and simultaneously taking the Level 5 layer as a characteristic output layer, wherein the stride is 4 at the moment, so that the method is beneficial to better improving the detection rate of small target objects. The Level 4 and Level 3 and Level 2 layers halve the number of channels and the number of operations is halved, and the stride is 8, 16 and 32, respectively. Finally, a 3×3 convolution layer is added, and the feature extraction effect is enhanced while the parameter number is hardly increased, and the stride is 64. The dark net-39 network at this time cannot directly load the weight parameters of the original dark net-53 and needs to be retrained. In the embodiment, classification training is performed on the ImageNet LSVRC 2012 dataset, 90 epochs are trained, the initial learning rate is 1e-03, the learning rate is reduced by ten times when step is 170000 and 350000, the batch_size is 128, and the weight attenuation coefficient is 5e-04.

Taking the coco dataset as an example, the coco 2017 detection dataset has 118287 training sets, 5000 verification sets and 40670 test sets, which are 80 categories in total. Furthermore, this process is normalized to account for the different picture sizes on the training set. In the field of object detection, the measure of similarity between two bounding boxes is often based on the size of the IOU (area interaction ratio), detectionResult represents the predicted rectangular box area, groundTruth represents the real rectangular box area,

then for target detection, the distance metric formula may be calculated as follows:

d(box,centroid)＝1-IOU(box,centroid)

centroid refers to the center of a bounding box, the greater the IOU value between two bounding boxes, the smaller the distance between them. Before inputting the image to be identified into the dark net-39 backbone network module, the image preprocessing module preprocesses the image to be identified, transforms the image size into a fixed size, adopts a multi-scale training method in the embodiment, and randomly selects one size from the set {256,320,384,448,512,576,640,704,768} as the image input size at this time. Taking the input to-be-identified image size of 448×448 as an example, calculating 15 boundary frame coordinates of the 5-scale convolution layer feature images as follows:

(4,6),(7,16),(14,9),(22,17),(13,30),(28,37),(46,23),(25,70),(49,58),(86,39),(56,124),(99,83),(114,205),(199,124),(294,275)。

inputting 448 x 448 images to be identified, creating image pyramids of the images to be identified, inputting image pyramids of different levels into corresponding networks, respectively performing target detection on feature maps of different depths, and performing target detection on future layers through feature maps of the current layerThe feature map of the (2) is up-sampled and utilized, so that the current feature map can obtain information of a future layer, and low-order semantic information and high-order semantic information are organically fused, thereby improving detection accuracy. The feature sizes of the feature layers of the pyramid network are 7 multiplied by 7,14 multiplied by 14,28 multiplied by 28,56 multiplied by 56 and 112 multiplied by 112, the feature image sizes are sequentially 1 st, 2 nd, 3 rd, 4 th and 5 th layers from small to large, meanwhile, the first four layers perform up-sampling operation on the feature pyramid with 2 times of step length, and are fused with the depth residual error network of the next layer to form a depth fused rapid detection model, so that the expression capacity of the feature pyramid is enhanced, and compared with the traditional YOLOv3 network, the feature pyramid has a wider range, and therefore, the detection effect of small targets and objects with larger sizes can be remarkably improved. In order not to increase the calculation amount, the embodiment replaces the 3×3 convolution layer in the pyramid network by the depth separable convolution, and the calculation amount is remarkably reduced. After the feature pyramid network is adopted, a total of 5 feature graphs of convolution layers with different scales are arranged, and the optimal combination mode is selected according to the experimental effect, so that model parameters can be greatly reduced. For example: fusing 7×7 and 14×14 feature maps, and downsampling 14×14 feature maps before fusing, wherein two feature maps with 7×7 size are obtained, namely L ₁ And L ₂ In order to better fuse the features, the embodiment adopts a weighted feature fusion mode, and the fused features are F ₁ ，L ₁ Corresponding weighting coefficient w ₁ ，L ₂ The corresponding weighting coefficient is w ₂ Then:

the prediction module adopts YOLO-V3 to perform regression prediction on the weighted and fused feature graphs, the YOLO-V3 divides the feature graphs into n×n grids (feature graphs with different scales, N is different in size, 5 scales are shared in this embodiment, N is 7,14, 28,56 and 112 respectively, 3 different bounding boxes are predicted by each grid, the target detection result can be expressed as n×n× [3× (c+con+b) ], C represents the number of categories, con represents the confidence level, and B represents the coordinates of the bounding box.

In order to enable the detection network to quickly converge, the cut dark net-39 network structure is pre-trained on an image net data set, and the obtained weight file is directly used as an initialization weight to be loaded into the detection network. The super-parameters are set when the dark net-39 network is pre-trained, wherein the training epoch is 120, the initial learning rate is 1e-04, the learning rate adopts a cosine_decay mode, the final learning rate is 1e-6, the momentum is 0.9, the batch_size is set to 32, and the weight attenuation coefficient is 5e-04 by adopting an l2 regularization mode.

According to the target detection system based on the improved YOLO-V3, the dark net-39 main network is adopted for feature extraction, the size of a model is reduced, the target detection speed is increased, 5 different-scale convolution layers are adopted for feature image extraction, shallow layer feature information and deep layer feature information are fully fused, the detection effect of a medium or large-size object is improved, the combination weighting feature fusion is carried out on the feature images of different convolution layers according to different contribution degrees of the feature images of the different convolution layers, the network feature fusion effect is enhanced, and a better detection result is achieved.

In the first embodiment described above, there is provided an object detection system based on modified YOLO-V3, and in correspondence thereto, the present application also provides an object detection method based on modified YOLO-V3. Please refer to fig. 2, which is a flowchart of a target detection method based on the modified YOLO-V3 according to a second embodiment of the present invention. Since the method embodiments are substantially similar to the apparatus embodiments, the description is relatively simple, and reference is made to the description of the apparatus embodiments for relevant points. The method embodiments described below are merely illustrative.

As shown in fig. 2, a flowchart of a target detection method based on improved YOLO-V3 according to a second embodiment of the present invention is shown, and the method includes:

s201, acquiring an image to be identified.

In the present embodiment, the input image to be recognized has a size of 448×448.

S202, preprocessing an image to be identified to obtain a preprocessed image.

Specifically, the specific method for preprocessing the image to be identified comprises the following steps:

randomly turning over and cutting the image to be identified horizontally/vertically;

and performing scale transformation on the image to be identified.

And S203, extracting image features by adopting a trained dark-39 backbone network model to obtain feature graphs of 5 convolution layers with different scales.

Specifically, the step of training the dark net-39 backbone network model, the specific method for training the dark net-39 backbone network model comprises the following steps:

2 convolution layers are added in a main network of a traditional YOLO-V3 algorithm, and 5 different scale convolution layer feature maps are adopted for target detection.

Specifically, the method comprises the steps of carrying out reasonable pruning on a dark net-53 network, optimizing a network structure, removing some redundant convolution operations, and obtaining the dark net-39 network, wherein the specific operation is that the number of channels of a Level 5 layer is halved, meanwhile, the Level 5 layer is also used as a characteristic output layer, and the stride is 4 at the moment, so that the small target object detection rate is improved better. The Level 4 and Level 3 and Level 2 layers halve the number of channels and the number of operations is halved, and the stride is 8, 16 and 32, respectively. Finally, a 3×3 convolution layer is added, and the feature extraction effect is enhanced while the parameter number is hardly increased, and the stride is 64. The dark net-39 network at this time cannot directly load the weight parameters of the original dark net-53 and needs to be retrained. In the embodiment, classification training is performed on the image Net LSVRC 2012 data set, 120 epochs are trained, the initial learning rate is 1e-03, the learning rate is reduced by ten times when step is 170000 and 350000, the batch_size is 128, and the weight attenuation coefficient is 5e-04.

Acquiring a data set, and dividing the data set into a training set, a testing set and a verification set;

and re-clustering the coordinates of the boundary frames on the training set by adopting a k-means clustering algorithm, and calculating 15 boundary frame coordinates of the characteristic diagrams of the convolution layers with 5 different scales.

And S204, optimally combining the feature graphs of the convolution layers with different scales to obtain a combined feature graph.

And S205, carrying out weighted feature fusion on the combined feature map.

S206, carrying out regression prediction on the fused feature images by using a YOLO-V3 algorithm to obtain a target detection result.

According to the target detection method based on the improved YOLO-V3, the dark net-39 main network is adopted for feature extraction, the size of a model is reduced, the target detection speed is increased, 5 different-scale convolution layers are adopted for feature image extraction, shallow layer feature information and deep layer feature information are fully fused, the detection effect of a medium or large-size object is improved, the combination weighting feature fusion is carried out on the feature images of different convolution layers according to different contribution degrees of the feature images of the different convolution layers, the network feature fusion effect is enhanced, and a better detection result is achieved.

As shown in fig. 3, a schematic structural diagram of an intelligent terminal according to a third embodiment of the present invention is shown, where the terminal includes a processor 301, an input device 302, an output device 303, and a memory 304, where the processor 301, the input device 302, the output device 303, and the memory 304 are connected to each other, and the memory 304 is used to store a computer program, where the computer program includes program instructions, and the processor 301 is configured to invoke the program instructions to execute the method described in the second embodiment.

It should be appreciated that in embodiments of the present invention, the processor 301 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 302 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., and the output device 303 may include a display (LCD, etc.), a speaker, etc.

The memory 304 may include read only memory and random access memory and provides instructions and data to the processor 301. A portion of memory 304 may also include non-volatile random access memory. For example, the memory 304 may also store information of device type.

In a specific implementation, the processor 301, the input device 302, and the output device 303 described in the embodiments of the present invention may perform an implementation described in the method embodiments provided in the embodiments of the present invention, or may perform an implementation described in the system embodiments of the present invention, which are not described herein again.

In a further embodiment of the invention, a computer-readable storage medium is provided, which stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method described in the above embodiment.

The computer readable storage medium may be an internal storage unit of the terminal according to the foregoing embodiment, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working procedures of the terminal and the unit described above may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In several embodiments provided in the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention, and are intended to be included within the scope of the appended claims and description.

Claims

1. An improved YOLO-V3 based target detection system, comprising: an image acquisition module, an image preprocessing module, a dark-39 backbone network module, a multi-scale convolution layer characteristic combination module, a weighted characteristic fusion module and a prediction module,

the image acquisition module is used for acquiring an image to be identified;

the prediction module is used for carrying out regression prediction on the fused feature images by using a YOLO-V3 algorithm to obtain a target detection result;

the dark net-39 main network module comprises a dark net-39 main network training unit, wherein the dark net-39 main network training unit adds 2 convolution layers in a main network of a traditional YOLO-V3 algorithm, and adopts 5 feature graphs of different scale convolution layers to detect targets;

acquiring a data set, dividing the data set into a training set, a testing set and a verification set,

2. The improved YOLO-V3 based object detection system of claim 1, wherein the image preprocessing module comprises an image rotation unit and a scaling unit, the image rotation unit being used for performing random horizontal/vertical flipping, cropping of an image to be identified; the scaling unit is used for performing scale transformation on the image to be identified.

3. An improved YOLO-V3-based target detection method, comprising:

acquiring an image to be identified;

preprocessing an image to be identified to obtain a preprocessed image;

carrying out weighted feature fusion on the combined feature map;

carrying out regression prediction on the fused feature images by using a YOLO-V3 algorithm to obtain a target detection result;

the method also comprises the step of training the dark net-39 trunk network model, and the concrete method for training the dark net-39 trunk network model comprises the following steps:

2 convolution layers are added in a main network of a traditional YOLO-V3 algorithm, and 5 feature maps of convolution layers with different scales are adopted for target detection;

4. The improved YOLO-V3 based object detection method of claim 3, wherein said specific method of preprocessing the image to be identified comprises:

and performing scale transformation on the image to be identified.

5. A smart terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, the memory being for storing a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method of any of claims 3-4.

6. A computer readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 3-4.