CN116385773A

CN116385773A - Small target detection method, storage medium and electronic equipment

Info

Publication number: CN116385773A
Application number: CN202310224481.3A
Authority: CN
Inventors: 苏炯龙; 张昱轩
Original assignee: Xian Jiaotong Liverpool University
Current assignee: Xian Jiaotong Liverpool University
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-07-04

Abstract

The invention provides a small target detection method, a storage medium and electronic equipment, wherein the small target detection method comprises the following steps: step one, a training data set is established; labeling and classifying the images in the training data set, wherein the images in the training data set are classified into a training set, a verification set and a test set after labeling; step three, inputting the training set and the verification set into a deep learning neural network for training, wherein the deep learning neural network adopts a YOLO v8 frame, and the training set and the verification set are characterized in that: the BottleNeck module of the C2f unit in the back set component adopts a BottleNeck module in the Hornet unit, and/or the BottleNeck module of the C2f unit in the neg component adopts a COT module; and step four, detecting the image to be detected by adopting the trained deep learning neural network. According to the small target detection method, the storage medium and the electronic equipment, the accuracy is improved for the small target detection.

Description

Small target detection method, storage medium and electronic equipment

Technical Field

The present invention relates to the field of computer vision detection technologies, and in particular, to a method for detecting a small target, a storage medium, and an electronic device.

Background

Major strategies for identifying small targets in the market include a binocular camera identification scheme and a monocular camera matching target detection scheme. One of the important reasons that the effect of the latter is far inferior to that of the former is that the effect and quality of target detection cannot be effectively balanced, and the phenomenon that the accuracy cannot reach the standard under a certain speed requirement occurs.

YOLO, RTMDET, SSD, etc., have been used in many applications. The high speed and high accuracy of the YOLO algorithm has made it more useful in the industry. The improved YOLO single-stage target detection algorithm can accurately and rapidly identify the target, reduce hardware cost and research and development cost, and has higher robustness and portability. Currently, there are already applications of YOLO algorithms based in part on the floor, such as the method shown in application number CN202111162490.1, using the YOLOv3 scheme for fire prediction of small targets. Because YOLOv3 uses a dark backbone network, it is difficult to link contexts and discover context relationships in a traditional convolution manner, and thus it is difficult to extract features of small objects.

Disclosure of Invention

Based on the defects in the prior art, the invention provides a small target detection method, a storage medium and electronic equipment, which can well realize the detection of the small target.

In order to achieve the above object, the present invention provides a method for detecting a small target, comprising:

step one, a training data set is established;

labeling and classifying the images in the training data set, wherein the images in the training data set are classified into a training set, a verification set and a test set after labeling;

step three, inputting the training set and the verification set into a deep learning neural network for training, wherein the deep learning neural network adopts a YOLO v8 frame, and the training set and the verification set are characterized in that: the BottleNeck module of the C2f unit in the back set component adopts a BottleNeck module in the Hornet unit, and/or the BottleNeck module of the C2f unit in the neg component adopts a COT module;

and step four, detecting the image to be detected by adopting the trained deep learning neural network.

In an embodiment, in the first step, the images in the training dataset include a near image and a far image, each having a resolution greater than 1280x 1280.

In an embodiment, in the second step, labeling is performed by using LabelImg software, and the labeling is stored in YOLO data format or COCO data set format, and the training set is: the validation set: the number ratio of the test sets is 8:1:1.

In an embodiment, a step two and a step three are further included between the step two and the step three, the training data set is enhanced, and the adding includes performing any one or more of the following three modes first: and (3) reversing the image from top to bottom, enhancing the color gamut, copying the image, scaling the image, and enhancing the data by using a mosaic enhancement mode.

In an embodiment, before the training in the third step, setting a training strategy, where the setting a training strategy includes setting training rounds, batch processing amount, number of working threads, learning rate, SGD training parameters, and loss coefficients, and the loss coefficients include a predicted frame loss coefficient, a classification loss coefficient, and a DFL regression loss function.

In one embodiment, the third step includes: three up-sampling operations and three down-sampling operations, four feature heads are obtained; decoupling the characteristics by using a decoupling head, dividing the characteristics into two paths, wherein one path is subjected to convolution-classification after one CBS convolution to be used as a classified output tensor, and the other path is subjected to convolution-regression after one CBS convolution to obtain a target, and the output channel number is tensor of 4 x reg_max; classification and regression losses are calculated and back propagated, where classification losses use BCE loss functions and regression losses use dfl+ciou to calculate losses.

In an embodiment, the fourth step further includes: cutting the read-in test picture into 4 pictures, wherein the height and width of each picture are half of those of the original picture, sequentially sending the cut pictures into a network for detection, respectively carrying out four times of detection, splicing the result and the image into the original picture, merging the recognized result, and deriving the final coordinate information detx [ i ] and the dety [ i ] as follows:

in an embodiment, the fourth step further includes that the image to be detected is not subjected to any data enhancement other than scaling to the specified recognition resolution.

In one embodiment, the SPPF structure in the backup assembly employs an SPPFCSPC structure.

The present invention also provides a storage medium having stored thereon a computer program which, when read and executed by a processor, performs a method of detection of a small object as described above.

The present invention also provides an electronic device including: the device comprises a processor, a memory and a communication bus, wherein the processor is communicated with the memory through the communication bus so as to execute the detection method of the small target.

The invention relates to a detection method of a small target, a storage medium and electronic equipment,

according to the small target detection method, the storage medium and the electronic equipment, the adopted deep learning neural network is improved based on a YOLO v8 frame, wherein: the BottleNeck module of the C2f unit in the back box assembly is optimized to adopt the BottleNeck module of the Hornet unit, and/or the BottleNeck module of the C2f unit in the neg assembly is optimized to adopt the COT module, so that the precision is improved for small target detection.

Drawings

The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any way. In addition, the shapes, proportional sizes, and the like of the respective components in the drawings are merely illustrative for aiding in understanding the present invention, and are not particularly limited. Those skilled in the art with access to the teachings of the present invention can select a variety of possible shapes and scale sizes to practice the present invention as the case may be. In the drawings:

fig. 1 is a schematic flow chart of a method for detecting a small target according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of a deep learning neural network in a method for detecting a small target according to a first embodiment of the present invention;

fig. 3 is a schematic structural diagram of a C2fHB unit in a backhaul component in a deep learning neural network in a method for detecting a small target according to a first embodiment of the present invention;

fig. 4 is a schematic structural diagram of an SPPFCSPC unit in a backhaul assembly in a deep learning neural network in a small target detection method according to a first embodiment of the present invention;

fig. 5 is a schematic structural diagram of a COT2f unit in a neg component in a deep learning neural network in a method for detecting a small target according to a first embodiment of the present invention;

fig. 6 is a schematic diagram of a cut-map detection included in step four in a method for detecting a small target according to a first embodiment of the present invention.

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, shall fall within the scope of the invention.

Referring to fig. 1, a first embodiment of the present invention provides a method for detecting a small target, including:

step one, a training data set is established;

In an embodiment, in the first step, the images in the training dataset include a near image and a far image, each having a resolution greater than 1280x 1280. So can guarantee that the effect of follow-up training is better.

In an embodiment, in the second step, labelImg software is used for marking, the target in the image in the training dataset is marked, and the target is saved as a YOLO data format or a COCO data set format, and the training set is: the validation set: the number ratio of the test sets is 8:1:1. For example, in one embodiment, the training set is 1600, the validation set is 200, and the test set is 200.

In an embodiment, a step two and a step three are further included between the step two and the step three, the training data set is enhanced, and the adding includes performing any one or more of the following three modes first: and (3) reversing the image from top to bottom, enhancing the color gamut, copying the image, scaling the image, and enhancing the data by using a mosaic enhancement mode. The reinforcement is not suitable for targets with unique posture characteristics such as numbers, letters and the like; the color gamut enhancement is not suitable for a plurality of different kinds of targets with similar colors in part; the image replication enhancement is not needed if the number of data sets is sufficient.

In an embodiment, before the training in the third step, setting a training strategy, where the setting a training strategy includes setting training rounds, batch processing amount, number of working threads, learning rate, SGD training parameters, and loss coefficients, and the loss coefficients include a predicted frame loss coefficient, a classification loss coefficient, and a DFL regression loss function. In a specific embodiment, the training round is set to 300 rounds, the batch processing amount is set to 8, the number of working threads is set to 8, the learning rate is set to 0.01, the mosaic reinforcement is turned off at the last 15 rounds, the learning rate is changed to 0.01 times of the original learning rate, the SGD training parameter is set to 0.937, the loss coefficient is set, wherein the prediction frame loss coefficient is set to 7.5, the classification loss coefficient is set to 0.5, and the dfl regression loss function is set to 1.5. The resolution of the input image is set to 1280x 1280.

Referring to fig. 2, the deep learning neural network used in the third step of the present invention is modified based on YOLO v8 (YOLO algorithm 8 th edition) framework. The YOLO v8 framework includes a back bone component 10, a back component 20, and a head component 30. The deep learning neural network improves the back bone component 10 and the back bone component 20. In the deep learning neural network, the backbone component 10 comprises a combination of four CBS units and C2fHB units which are sequentially connected, wherein the C2fHB units replace C2f units of the backbone component 10 in a YOLO v8 frame, specifically, a BottleNeck module in a Hornet unit replaces a BottleNeck module of an original C2f unit, and the circulation gating convolution in the Hornet unit realizes high-order space interaction modeling based on a convolution structure completely through the circulation design of gating convolution. Meanwhile, the device module is compatible with various convolution forms, and the self-attention mechanism in the transducer structure is interactively modeled and expanded from two-dimensional space to any order under the condition of increasing a very small amount of calculation amount. Referring to fig. 3, the C2fHB unit includes a Split module, a Cat module, and n HorBlock modules connected between the Split module and the Cat module in a cascade relationship, where an input end of the Split module and an output end of the Cat module are respectively connected with a CBS module. The HorBlock module comprises an LN module and a g module which are connected in sequence ⁿ Conv module, out module, and the LN module input terminal is also connected with the out moduleAnd (5) connecting the modules. A CBS unit is further connected to the front of the first CBS unit+c2 fHB unit, and an SPPFCSPC unit is further connected to the rear of the fourth CBS unit+c2 fHB unit. The SPPFCSPC unit is an SPPF unit for replacing the backfone assembly 10 in the YOLO v8 frame, improving speed and accuracy. Referring to fig. 4, the SPPFCSPC unit includes three Conv modules (convolution modules) connected in sequence, wherein an output end of a third Conv module is connected to one Cat module, and is connected to three MP modules (max pooling modules) connected in sequence, and output ends of the three MP modules are respectively connected to the Cat modules, and output ends of the Cat modules are connected to two Conv modules connected in sequence, and then to another Cat module and one Conv module. The input of the first Conv module is also connected to another Conv module, which is then connected to the other Cat module. The combination of the fourth CBS unit+c fHB unit outputs the first intermediate data via the SPPFCSPC unit, and the combination of the first three CBS units+c fHB units outputs one intermediate data respectively, which are the fourth intermediate data, the third intermediate data, and the second intermediate data sequentially from front to back, and the total of four intermediate data are sent to the negk component 20.

Referring again to fig. 2, the neg component 20 includes a combination of four cascaded Cat units+cot2f units+simam units. The first intermediate data is connected to a first Up cell (UpSample cell) and to a Cat cell of a combination of a fourth Cat cell+cot2f cell+simam cell. The first Up unit is connected with another Cat unit, the second intermediate data is also connected with the Cat unit, and the Cat unit is connected with another COT2f unit; the COT2f unit is connected to the second Up unit and to the Cat unit in the combination of the third Cat unit+COT2f unit+SimAM unit. The second Up unit is connected with another Cat unit, the third intermediate data is also connected with the Cat unit, and the Cat unit is connected with another COT2f unit; the COT2f unit is connected to the third Up unit and to the Cat unit in the combination of the second Cat unit+COT2f unit+SimAM unit. The third Up cell is connected to the Cat cell in the combination of the first Cat cell+COT2f cell+SimAM cell, and the fourth intermediate data is also connected to the Cat cell. The COT2f unit replaces the C2f unit of the neg component 20 in the YOLO v8 frame, and specifically, the BottleNeck module of the C2f unit adopts a COT module. Referring to fig. 5, the COT2f unit includes a Split module, a Cat module, and n COTBlock modules connected between the Split module and the Cat module in a cascade relationship, where an input end of the Split module and an output end of the Cat module are respectively connected with a CBS module. The COTBlock module comprises a Conv module, an x module, a Cat module, a theta 1*1 module, a delta 1*1 module, a Mul module, a Fusion module, a y module and an Add module which are sequentially connected, wherein the input end of the Conv module is also connected with the ADD module, a V1*1 module is further connected between the x module and the Mul module, a k-by-k module is further connected between the x module and the Fusion module, and the output end of the k-by-k module is further connected with the Cat module.

Referring to fig. 2 again, the Head assembly 30 includes four Head modules respectively connected to a combination of Cat unit+cot2f unit+simam unit.

In the third step, the deep learning neural network adopts a hornet+c2f structural network as a backbone network to extract features. In the Hornet unit, g is used ⁿ Conv replaces the self-attention factor in the transducer structure and uses less computation, and its FLPs must meet:

the neg component 20 (also called PAN-FPN) is employed to extract image features. Tensors are extracted after the structure of hornet+C2f is finished, the step sizes are 4, 8, 16 and 32, and the P2, P3, P4 and P5 characteristic layers with the dimensions of [160,160,128], [80,80,256], [40,40,512], [20,20,1024] are finally obtained. The use of the C2f structure fused in the PAN-FPN structure by the COT (Contextual Transformer Networks) structure greatly enhances the context understanding capability. Four feature heads are obtained through three up-sampling operations and three down-sampling operations. After each downsampling, a SimAM attention mechanism is added, and the formula is as follows:

the attention mechanism can effectively improve the accuracy of the neural network, and neurons with airspace inhibition effect should be given higher importance by evaluating the importance of each neuron.

Then, using a decoupling head to decouple the characteristics, dividing the characteristics into two paths, wherein one path is subjected to convolution-classification after one CBS convolution to be used as a classified output tensor, and the other path is subjected to convolution-regression after one CBS convolution to obtain a target, and the output channel number is tensor of 4 x reg_max; classification and regression losses are calculated and back propagated, where classification losses use BCE loss functions and regression losses use dfl+ciou to calculate losses.

After the training turns set before training are all repeated, the training is finished, a weight model pt file is obtained, and then a test data set can be inferred, and an image to be detected can be detected. In one embodiment, the code weight file is:

[-1,1,nn.Upsample,[None,2,'nearest']],

[[-1,2],1,Concat,[1]],

[ -1,3, COT2f, [ number of channels ] ],

[ -1, simAM, [ channel number ] ].

Referring to fig. 6, in an embodiment, the fourth step further includes: cutting the read-in test picture into 4 pictures, wherein the height and width of each picture are half of those of the original picture, sequentially sending the cut pictures into a network for detection, respectively carrying out four times of detection, splicing the result and the image into the original picture, merging the recognized result, and deriving the final coordinate information detx [ i ] and the dety [ i ] as follows:

moreover, the image to be detected is not subjected to any data enhancement other than scaling to the specified recognition resolution. And step four, small target classification and coordinate frame information can be obtained.

The inventors have made various improvements to the YOLO v8 algorithm for small target reasoning for different contests, such as RoboMaster university national university formazan contests. In a specific embodiment, the scheme of the present invention improves 12-point accuracy over the YOLO v8 official version. In this embodiment, there are 4000 pictures in total. The following table shows the floating point operand, parameter amounts, mAP values and inference speeds (TensorRT, RTX3060 12 GB) between different schemes.

The scheme of the invention improves larger precision under the condition of losing certain reasoning speed.

A second embodiment of the present invention provides a storage medium having stored thereon a computer program which, when read and executed by a processor, performs a method of detecting a small object as described above.

A second embodiment of the present invention provides an electronic device, including: the device comprises a processor, a memory and a communication bus, wherein the processor is communicated with the memory through the communication bus so as to execute the detection method of the small target.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the present teachings should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. The disclosures of all articles and references, including patent applications and publications, are incorporated herein by reference for the purpose of completeness. The omission of any aspect of the subject matter disclosed herein in the preceding claims is not intended to forego such subject matter, nor should the applicant be deemed to have such subject matter not considered to be part of the disclosed subject matter.

Claims

1. A method for detecting a small target, comprising:

step one, a training data set is established;

2. The method of claim 1, wherein in the first step, the images in the training dataset include a near image and a far image, each having a resolution greater than 1280x 1280.

3. The method for detecting small targets according to claim 1, wherein in the second step, labeling is performed by LabelImg software, and the labeling is stored in YOLO data format or COCO data set format, and the training set is: the validation set: the number ratio of the test sets is 8:1:1.

4. The method of claim 1, further comprising a step two three between the step two and the step three, wherein the training data set is enhanced, the enhancing comprising one or more of the following three ways: and (3) reversing the image from top to bottom, enhancing the color gamut, copying the image, scaling the image, and enhancing the data by using a mosaic enhancement mode.

5. The method of claim 1, further comprising setting a training strategy before the training in the third step, wherein the setting the training strategy includes setting training rounds, batch, number of work threads, learning rate, SGD training parameters, loss coefficients, and the loss coefficients include a prediction frame loss coefficient, a classification loss coefficient, and a DFL regression loss function.

6. The method for detecting a small target according to claim 1, wherein the third step comprises: three up-sampling operations and three down-sampling operations, four feature heads are obtained; decoupling the characteristics by using a decoupling head, dividing the characteristics into two paths, wherein one path is subjected to convolution-classification after one CBS convolution to be used as a classified output tensor, and the other path is subjected to convolution-regression after one CBS convolution to obtain a target, and the output channel number is tensor of 4 x reg_max; classification and regression losses are calculated and back propagated, where classification losses use BCE loss functions and regression losses use dfl+ciou to calculate losses.

7. The method for detecting a small object according to claim 1, wherein the fourth step further comprises: cutting the read-in test picture into 4 pictures, wherein the height and width of each picture are half of those of the original picture, sequentially sending the cut pictures into a network for detection, respectively carrying out four times of detection, splicing the result and the image into the original picture, merging the recognized result, and deriving the final coordinate information detx [ i ] and the dety [ i ] as follows:

8. the method of detecting small objects according to claim 1, wherein the fourth step further comprises the step of not performing any data enhancement of the image to be detected other than scaling to a specified recognition resolution.

9. The method of claim 1, wherein the SPPF structure in the backup assembly is an SPPFCSPC structure.

10. A storage medium having stored thereon a computer program which, when read and executed by a processor, performs a method of detecting a small object according to any of claims 1-9.

11. An electronic device, comprising: a processor, a memory and a communication bus, said processor being in communication with said memory connection via said communication bus for performing the method of detection of small objects according to any of claims 1-9.