CN113807472A

CN113807472A - Hierarchical target detection method and device

Info

Publication number: CN113807472A
Application number: CN202111375392.6A
Authority: CN
Inventors: 张雪; 罗壮; 张海强; 李成军
Original assignee: Zhidao Network Technology Beijing Co Ltd
Current assignee: Zhidao Network Technology Beijing Co Ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2021-12-17
Anticipated expiration: 2041-11-19
Also published as: CN113807472B

Abstract

The application relates to a hierarchical target detection method and a hierarchical target detection device. The method comprises the following steps: inputting a training set into a hierarchical Yolov5 network model; predicting marking frames with different sizes in a training set through 3 prediction branches of a hierarchical Yolov5 network model, and respectively obtaining respective prediction outputs of the 3 prediction branches; respectively calculating respective loss function values of 3 prediction branches according to respective prediction outputs of the 3 prediction branches of the hierarchical Yolov5 network model; if the maximum value of the loss function values of 3 prediction branches of the hierarchical Yolov5 network model is smaller than a set loss threshold value and/or the number of times of loop iteration training reaches a set iteration number, determining that the training of the hierarchical Yolov5 network model is completed; and inputting images containing targets into the trained hierarchical Yolov5 network model so as to enable the trained hierarchical Yolov5 network model to carry out target detection. The scheme provided by the application can achieve the optimal detection effect on the targets with different sizes.

Description

Hierarchical target detection method and device

Technical Field

The present application relates to the field of navigation technologies, and in particular, to a hierarchical target detection method and apparatus.

Background

Automatically detecting various different sized targets (e.g., traffic signs) in a traffic scene from the traffic scene is a primary processing step in vehicle autodrive. The method can quickly and accurately detect various targets with different sizes in the traffic scene, can provide accurate environmental information for automatic navigation of the automatic driving vehicle, and is the key for realizing safe driving.

When the related technology detects targets with different sizes, shallow and deep features are often fused, information redundancy is carried out on the large, medium and small targets, mutual interference is caused during fusion, and partial background invalid noise data are brought in, so that the targets cannot achieve the optimal detection effect.

Disclosure of Invention

In order to solve or partially solve the problems in the related art, the application provides a hierarchical target detection method and a hierarchical target detection device, which can achieve the optimal detection effect on targets with different sizes.

A first aspect of the present application provides a hierarchical target detection method, including:

inputting a training set into a hierarchical Yolov5 network model, wherein the hierarchical Yolov5 network model removes information interaction functions of 3 prediction branches on the basis of a Yolov5 network model, and the 3 prediction branches of the hierarchical Yolov5 network model directly output detection results;

predicting marking frames with different sizes in a training set through 3 prediction branches of a hierarchical Yolov5 network model respectively to obtain respective prediction outputs of the 3 prediction branches of the hierarchical Yolov5 network model respectively;

calculating loss function values of the 3 prediction branches of the hierarchical Yolov5 network model respectively according to the prediction outputs of the 3 prediction branches of the hierarchical Yolov5 network model;

if the maximum value of the loss function values of 3 prediction branches of the hierarchical Yolov5 network model is smaller than a set loss threshold value and/or the number of times of loop iteration training reaches a set iteration number, determining that the training of the hierarchical Yolov5 network model is completed;

inputting images containing targets into a trained hierarchical Yolov5 network model so that the trained hierarchical Yolov5 network model can perform target detection.

Preferably, before determining that the training of the hierarchical YOLOV5 network model is completed, if the maximum value of the loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model is less than a set loss threshold value and/or the number of times of the loop iteration training reaches a set iteration number, the method further includes:

and respectively finishing the training of the 3 predicted branches of the hierarchical Yolov5 network model in one iteration according to the learning times of the 3 predicted branches of the hierarchical Yolov5 network model in one cycle of one iteration.

Preferably, the respectively completing the training of the 3 predicted branches of the hierarchical YOLOV5 network model before one iteration according to the learning times of the 3 predicted branches of the hierarchical YOLOV5 network model in one cycle of one iteration further includes:

and respectively determining the learning times of the 3 prediction branches in one period of the one iteration according to the loss function values of the 3 prediction branches of the hierarchical Yolov5 network model.

Preferably, the predicting the labeling frames with different sizes in the training set by the 3 prediction branches of the hierarchical YOLOV5 network model to obtain the prediction outputs of the 3 prediction branches of the hierarchical YOLOV5 network model respectively includes:

splitting the marking boxes with different sizes in the training set into 3 types: small labeling box class, medium labeling box class and large labeling box class;

the method comprises the steps of respectively predicting a labeling frame of a small labeling frame class, a labeling frame of a medium labeling frame class and a labeling frame of a large labeling frame class through a small target prediction branch, a medium target prediction branch and a large target prediction branch of a hierarchical Yolov5 network model, respectively obtaining prediction output of the small target prediction branch on the labeling frame of the small labeling frame class and prediction output of the medium target prediction branch on the labeling frame of the medium labeling frame class, and prediction output of the large target prediction branch on the labeling frame of the large labeling frame class.

Preferably, the labeling boxes with different sizes in the training set are split into 3 types: the small labeling box class, the medium labeling box class and the large labeling box class comprise:

clustering by adopting a clustering algorithm according to the marking frames with different sizes in the training set to obtain 3 clustering center frames;

according to the 3 clustering center frames, determining to split the marking frames with different sizes in the training set into 3 types of boundary lines;

according to the determined boundary line, dividing the marking frames with different sizes in the training set into 3 types: small labeling box class, medium labeling box class and large labeling box class.

A second aspect of the present application provides a hierarchical object detection apparatus, the apparatus comprising:

the system comprises a first input module, a second input module and a third input module, wherein the first input module is used for inputting a training set to a hierarchical Yolov5 network model, the hierarchical Yolov5 network model removes the information interaction function of 3 prediction branches on the basis of a Yolov5 network model, and the 3 prediction branches of the hierarchical Yolov5 network model directly output detection results;

the prediction output module is used for predicting marking frames with different sizes in the training set input by the first input module through 3 prediction branches of the hierarchical YOLOV5 network model respectively to obtain respective prediction outputs of the 3 prediction branches of the hierarchical YOLOV5 network model respectively;

a loss calculation module, configured to calculate, according to the prediction outputs of the 3 prediction branches of the hierarchical YOLOV5 network model obtained by the prediction output module, loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model, respectively;

a training completion module, configured to determine to complete training of the hierarchical YOLOV5 network model if the maximum value of the loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model obtained by the loss calculation module is less than a set loss threshold and/or the number of times of loop iteration training reaches a set iteration number;

a second input module, configured to input an image including a target to the trained hierarchical YOLOV5 network model determined by the training completion module, so that the trained hierarchical YOLOV5 network model performs target detection.

Preferably, the apparatus further comprises:

and the training module is used for respectively finishing the training of the 3 prediction branches of the hierarchical Yolov5 network model in one iteration according to the learning times of the 3 prediction branches of the hierarchical Yolov5 network model in one cycle of one iteration.

Preferably, the apparatus further comprises:

and the learning times calculation module is used for respectively determining the learning times of the 3 prediction branches in one period of the one iteration according to the loss function values of the 3 prediction branches of the hierarchical Yolov5 network model obtained by the loss calculation module.

A third aspect of the present application provides an electronic device comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to perform the method as described above.

The technical scheme provided by the application can comprise the following beneficial effects:

according to the technical scheme, 3 prediction branches of a hierarchical Yolov5 network model are independent in structure, independent in characteristics, independent in output and independent in loss function value, the 3 prediction branches are independently trained, targets with different sizes in large, medium and small sizes are detected through the 3 prediction branches of the hierarchical Yolov5 network model, the 3 prediction branches of the hierarchical Yolov5 network model can directly output detection results, mutual interference of prediction of the 3 prediction branches is avoided, and the optimal detection effect can be achieved on the targets with different sizes in large, medium and small sizes.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

Fig. 1 is a schematic flowchart of a hierarchical object detection method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a hierarchical YOLOV5 network model of a hierarchical target detection method according to an embodiment of the present application;

FIG. 3 is another schematic flow chart diagram illustrating a hierarchical object detection method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a hierarchical object detection apparatus according to an embodiment of the present application;

fig. 5 is another schematic structural diagram of a hierarchical object detection apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

The related technology YOLOV5 network model detects large, medium and small targets, an image containing a large target A, a medium target B and a small target C is input into the YOLOV5 network model, and semantic features F1, F2 and F3 of different levels are obtained after the input image passes through a convolutional neural network of the YOLOV5 network model.

F1 is a shallow layer feature, and the information amount contained in each pixel of the hierarchical feature is the sum of the size ranges of the dashed frames 101 in the original image when compared with the original image scale in 8-fold down-sampling, and the F1 shallow layer feature most comprehensively expresses the information of the small target C, but insufficiently expresses the information of the large and medium targets A, B.

F2 is a middle layer feature, and the hierarchical feature is sampled 16 times compared with the original scale, the information content of each pixel of the hierarchical feature is the sum of the size ranges of the broken line frames 102 in the original scale, the information expression of the middle target B is most comprehensive in the F2 layer feature, but the information expression is insufficient for the large target a, and the information of the small target C is over-expressed, and the information contains much background information except for the target C.

F3 is a deep level feature, and the level feature is sampled 32 times compared with the original scale, the information content of each pixel is the sum of the size range of the broken line frame 103 in the original scale, the F3 level feature expresses the most information of the large target a, but expresses the information of the target B, C excessively, and the information contains much background information except the target.

In order to improve the target detection capability of the Yolov5 network model on targets with different sizes, the information of F1, F2 and F3 with different scales is fused, the detection results of the targets with different sizes are comprehensively regressed, the overall detection result is subjected to loss back propagation, and the model is optimized.

For large target a, F1 and F2 add redundancy of information; for the medium target B, F1 increases information redundancy, and the addition of F3 characteristics can introduce much background interference information; for small target C, F2 and F3 introduce background interference. Therefore, the correlation technique YOLOV5 network model can not achieve the optimal detection effect on large, medium and small targets with different sizes.

In view of the above problems, embodiments of the present application provide a hierarchical target detection method, which can achieve optimal detection effects for targets with different sizes.

The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.

The first embodiment is as follows:

fig. 1 is a schematic flowchart of a hierarchical object detection method according to an embodiment of the present application; fig. 2 is a schematic structural diagram of a hierarchical YOLOV5 network model of a hierarchical target detection method according to an embodiment of the present application.

Referring to fig. 1, a hierarchical object detection method includes:

in step S101, a training set is input to a hierarchical YOLOV5 network model, wherein the hierarchical YOLOV5 network model removes the information interaction function of 3 predicted branches on the basis of the YOLOV5 network model, and the 3 predicted branches of the hierarchical YOLOV5 network model directly output the detection result.

In one embodiment, as shown in fig. 2, the hierarchical YOLOV5 network model includes a backbone network 20, 3 predicted branches: small target prediction branch 201, medium target prediction branch 202, large target prediction branch 203. The hierarchical Yolov5 network model removes the information interaction function of 3 prediction branches on the basis of the Yolov5 network model, the 3 prediction branches of the hierarchical Yolov5 network model have independent structure, independent characteristics, independent output and independent loss function values, the 3 prediction branches are independently trained, the output information of the 3 prediction branches is not interacted with each other, and the detection result can be directly output.

In step S102, the labeling frames with different sizes in the training set are predicted by the 3 prediction branches of the hierarchical YOLOV5 network model, so as to obtain the prediction outputs of the 3 prediction branches of the hierarchical YOLOV5 network model.

In an embodiment, according to the size of the label box in the training set, the label boxes with different sizes may be split into 3 types through the boundary of the label boxes: small labeling box class, medium labeling box class and large labeling box class. Predicting and outputting the labeling frames of the small labeling frame classes through the small target prediction branches of the hierarchical YOLOV5 network model to obtain prediction frames of the small target prediction branch prediction output; predicting and outputting the labeling boxes of the medium labeling box class through the medium target prediction branch to obtain a prediction box of the medium target prediction branch prediction output; and (4) performing prediction output on the labeling boxes of the large labeling box class through the large target prediction branch to obtain a prediction box of the prediction output of the large target prediction branch.

In step S103, the loss function values of the 3 predicted branches of the hierarchical YOLOV5 network model are calculated based on the predicted outputs of the 3 predicted branches of the hierarchical YOLOV5 network model.

In one embodiment, a loss function value of a small target prediction branch is calculated according to a prediction box of a small target prediction branch prediction output of a hierarchical Yolov5 network model and a label box of a split small label box class; calculating a loss function value of the medium target prediction branch according to a prediction frame output by the medium target prediction branch prediction of the hierarchical Yolov5 network model and a label frame of the split medium label frame class; and calculating a loss function value of the large target prediction branch according to a prediction box output by the large target prediction branch prediction of the hierarchical Yolov5 network model and a label box of the split large label box class.

In step S104, if the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model is less than the set loss threshold and/or the number of times of the loop iteration training reaches the set iteration number, it is determined that the training of the hierarchical YOLOV5 network model is completed.

In one embodiment, if the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model, namely the loss function value of a small target prediction branch, the loss function value of a medium target prediction branch and the loss function value of a large target prediction branch, are less than the set loss threshold, the training of the hierarchical YOLOV5 network model can be determined to be completed; and/or if the number of times of loop iteration training of the hierarchical Yolov5 network model reaches the set iteration number, the training of the hierarchical Yolov5 network model can be determined to be finished.

In step S105, an image including a target is input to the trained hierarchical YOLOV5 network model, so that the trained hierarchical YOLOV5 network model performs target detection.

According to the hierarchical target detection method shown in the embodiment of the application, 3 prediction branches of a hierarchical Yolov5 network model are independent in structure, independent in characteristics, independent in output and independent in loss function value, and each branch of the 3 prediction branches is independently trained, targets with different sizes in large, medium and small are respectively detected through the 3 prediction branches of the hierarchical Yolov5 network model, the 3 prediction branches of the hierarchical Yolov5 network model can directly output detection results, mutual interference of prediction of the 3 prediction branches is avoided, and optimal detection effects can be achieved on the targets with different sizes in large, medium and small.

Example two:

fig. 3 is another schematic flow chart of a hierarchical object detection method according to an embodiment of the present application. Fig. 3 describes the solution of the present application in more detail with respect to fig. 1.

In step S301, a training set including objects of different sizes of the plurality of images correctly labeled by using labeling frames of different sizes is obtained.

In an embodiment, the images include targets with different sizes, and the labeling frames with different sizes may be used to correctly label the targets with different sizes in the multiple images, so as to obtain a training set including the multiple correctly labeled images. When the labeling frames with different sizes are adopted to label the targets with different sizes of the multiple images correctly, the targets of the multiple images can be classified and labeled.

In step S302, a training set is input to the hierarchical YOLOV5 network model, wherein the hierarchical YOLOV5 network model removes the information interaction function of 3 predicted branches on the basis of the YOLOV5 network model, and the 3 predicted branches of the hierarchical YOLOV5 network model directly output the detection result.

This step can be referred to the description of step S101, and is not described herein again.

In step S303, a clustering algorithm is used to cluster the labeled boxes with different sizes in the training set to obtain 3 clustered center boxes.

In one embodiment, a K-means clustering algorithm may be adopted, and 3 clustering center boxes are obtained by clustering according to the labeled boxes with different sizes of the training set: a small target clustering center frame BoxesS, a medium target clustering center frame BoxesM and a large target clustering center frame BoxesL.

In step S304, it is determined to split the labeling boxes with different sizes in the training set into 3 classes of boundary lines according to the 3 clustering center boxes.

In one embodiment, splitting the callout boxes of different sizes in the training set into 3 classes of boundary lines includes splitting the first boundary line into Split_smSecond boundary line Split_ml. The first boundary line Split can be respectively determined according to the height and the width of a small target clustering center frame BoxesS, a medium target clustering center frame BoxesM and a large target clustering center frame BoxesL_smSecond boundary line Split_ml。

First boundary line Split_sm = mean(max(BoxesS.w，BoxesS.h)，min(BoxesM.w，BoxesM.h))；

Second boundary line Split_ml = mean(max(BoxesM.w，BoxesM.h)，min(BoxesL.w， BoxesL.h))。

Where, boxes s.w is the width of the small target cluster center box, boxes s.h is the height of the small target cluster center box, boxes m.w is the width of the medium target cluster center box, boxes m.h is the height of the medium target cluster center box, max (boxes s.w, boxes s.h) is the maximum value of both boxes s.w, boxes s.h, min (boxes m.w, boxes m.h) is the minimum value of both boxes m.w, boxes m.h, boxes l.w is the width of the large target cluster center box, boxes l.h is the height of the large target cluster center box, max (boxes m.w, boxes m.h) is the maximum value of both boxes m.w, boxes m.h, min (boxes l.w, boxes l.h) is the average value of both boxes l.w, boxes m.h.

In step S305, according to the determined boundary line, the labeling frames with different sizes in the training set are divided into 3 types: small labeling box class, medium labeling box class and large labeling box class.

In one embodiment, each label box for each image (sample) in the training set is based on the first boundary line Split_smAnd a second boundary line Split_mlAnd dividing the labeling frames with different sizes of each image in the training set into three types, namely a small labeling frame type targetS type, a medium labeling frame type targetM type and a large labeling frame type targetL type. Assuming that the narrowest edge of the labeling frame of the target in the image is x, if x is less than or equal to Split_smIf yes, the label box corresponding to x belongs to the targetS class; if Split_sm<x<Split_mlIf yes, the label box corresponding to x belongs to the targetM class; if x is not less than Split_mlThen the label box corresponding to x belongs to the targetL class.

In step S306, according to the labeled boxes divided into 3 classes, a clustering algorithm is respectively used to generate anchor boxes of the 3 predicted branches of the hierarchical YOLOV5 network model.

In one implementation mode, according to a labeling box belonging to a targetS class, an anchor box anchors of a small target prediction branch branchS is generated by adopting a K-means clustering algorithm, wherein the anchor box anchors of the small target prediction branch branchS comprises 3 anchor boxes; according to the labeling frame belonging to the targetM class, generating anchor frames anchors of the medium target prediction branch branchM by adopting a K-means clustering algorithm, wherein the anchor frames of the medium target prediction branch branchM comprise 3 anchor frames; and generating anchor frames anchors of the large target prediction branch branchL by adopting a K-means clustering algorithm according to the marking frames belonging to the targetL class, wherein the anchor frames of the large target prediction branch branchL comprise 3 anchor frames.

branchS anchors = [BoxesS1.w，BoxesS1.h，BoxesS2.w，BoxesS2.h，BoxesS3.w，BoxesS3.h]；

branchM anchors = [BoxesM1.w，BoxesM1.h，BoxesM2.w，BoxesM2.h，BoxesM3.w，BoxesM3.h]；

branchL anchors = [BoxesL1.w，BoxesL1.h，BoxesL2.w，BoxesL2.h， BoxesL3.w，BoxesL3.h]。

In one embodiment, the anchor frame generating step includes:

s3061: given k cluster center points (Wj, Hj), j takes 1, 2.., k, where (Wj, Hj) is the width and height of the original anchor frame.

S3062: and calculating the distance d from each marking frame to each clustering center point, and distributing the marking frame to the cluster corresponding to the clustering center closest to the distance d according to the principle of closest distance.

S3063: and after all the marking frames are distributed, recalculating the clustering center point of each cluster according to the average value of the width and the height of all the marking frames in each cluster.

S3064: and repeating the steps of S3062 and S3063 until the position change of the clustering center point is smaller than a set threshold value, and generating an anchor frame.

In one embodiment, k has a value of 3.

In step S307, the 3 classes of label boxes are predicted by using the 3 predicted branches of the hierarchical YOLOV5 network model, so as to obtain the respective prediction outputs of the 3 predicted branches of the hierarchical YOLOV5 network model.

In one embodiment, the small target prediction branch, the medium target prediction branch, and the large target prediction branch of the hierarchical YOLOV5 network model may be used to respectively predict the labeling boxes of the small labeling box class, the medium labeling box class, and the large labeling box class, so as to respectively obtain the prediction output of the small target prediction branch on the labeling box of the small labeling box class, the prediction output of the medium target prediction branch on the labeling box of the medium labeling box class, and the prediction output of the large target prediction branch on the labeling box of the large labeling box class.

In a specific implementation mode, the small target prediction branch branchS performs prediction output on a marking frame belonging to a targetS class according to an anchor frame anchors of the small target prediction branch branchS to obtain a prediction frame detectS of the small target prediction branch prediction output; the medium target prediction branch branchM performs prediction output on a marking frame belonging to the targetM class according to an anchor frame anchors of the medium target prediction branch branchM to obtain a prediction frame detectM of the medium target prediction branch prediction output; and the large target prediction branch branchL performs prediction output on the marking frame belonging to the targetL class according to the anchor frame anchors of the large target prediction branch branchL to obtain a prediction frame detectL of the large target prediction branch prediction output.

In a specific embodiment, the backbone network of the hierarchical YOLOV5 model obtains a small target feature by using 8-fold down-sampling for the labeling boxes belonging to the targetS class in the training set, obtains a medium target feature by using 16-fold down-sampling for the labeling boxes belonging to the targetM class, and obtains a large target feature by using 32-fold down-sampling for the labeling boxes belonging to the targetL class. The small target prediction branch branchS performs prediction output on the small target characteristics belonging to the targetS class marking frame according to the three anchor frames anchormers of the small target prediction branch branchS, and predicts and outputs a prediction frame detectS of the small target; the medium target prediction branch branchM performs prediction output on medium target characteristics belonging to the targetM class marking frame according to the three anchor frames anchor of the medium target prediction branch branchM, and predicts and outputs a prediction frame detectM of the medium target; and the large target prediction branch branchL performs prediction output on the large target characteristics belonging to the targetL class marking frame according to the three anchor frames anchor thereof, and predicts and outputs the prediction frame detectL of the large target.

In step 308, the loss function values of the 3 predicted branches of the hierarchical YOLOV5 network model are calculated according to the predicted outputs of the 3 predicted branches of the hierarchical YOLOV5 network model.

In one embodiment, the penalty function Loss includes a location penalty function using a DIOU _ Loss (Distance _ IOU _ Loss), and a classification penalty function using a Focal Loss. The DIOU _ Loss considers the information of the distance between the center points of the bounding boxes on the basis of the IOU and the GIOU, and the convergence speed is higher. Focal local reduces the impact of easily classified samples on the Loss function, and focuses on the training of samples that are difficult to classify. According to a label box of a targetS class, a label box of a targetM class and a label box of a targetL class before respective prediction of 3 prediction branches of the hierarchical Yolov5 network model, predicted prediction outputs detectS, detectM and detectL and a Loss function, Loss function values Loss of the 3 prediction branches of the hierarchical Yolov5 network model are independently calculated respectively. Loss function value LossS = f (targetS, detectS) for small target prediction branch; loss function values LossM = f (targetM, detectM) for medium target prediction branches; loss function value LossL = f (targetL, detectL) for the large target prediction branch.

In step S309, the learning times of each of the 3 prediction branches in one cycle of one iteration are respectively determined according to the loss function values of each of the 3 prediction branches of the hierarchical YOLOV5 network model.

In one embodiment, when 3 independent prediction branches of the hierarchical YOLOV5 network model are independently trained using the same training set, the loss function values of the 3 independent prediction branches are not the same size. The larger the loss function value of the prediction branch is, the larger the training difficulty of the independent prediction branch by adopting the same sample of the training set is, and the larger the training learning times are, that is, when 3 independent prediction branches of the hierarchical YOLOV5 network model are trained by using the same training set, the learning times of one epoch of one iteration of the 3 independent prediction branches should be different. In an epoch of one iteration, the learning times of the 3 independent prediction branches in the epoch of one iteration can be respectively calculated according to the training set of the input hierarchical YOLOV5 network model and the loss function values of the 3 independent prediction branches.

Learning times of the small target prediction branch TrainS:

TrainS=round（softmax（LossS，LossM，LossL）[0]*10，0）

=round（

*10，0）；

learning times of medium target prediction branches, TrainM:

TrainM=round（softmax（LossS，LossM，LossL）[1]*10，0）

=round（

*10，0）；

learning times of the large target prediction branch, tranl:

TrainL=round（softmax（LossS，LossM，LossL）[2]*10，0）

=round（

*10，0）。

in round (X, 0), if X is smaller than 1, round (X, 0) = 1. For example, train = round (softmax (LossS, LossM, LossL) [2] × 10, 0) = round (0.02 × 10, 0) = round (0.2, 0) = 1.

In one embodiment, epoch can be understood as a "period," and one period of an iteration (epoch) is the use of the entire training set. For example, the training set has 1000 samples in total, each sample is used to train the hierarchical YOLOV5 network model in turn, and when the 1000 samples are used once, the hierarchical YOLOV5 network model training learning of one cycle of one iteration is said to be completed. The learning times of the small target prediction branch, the medium target prediction branch and the large target prediction branch of the hierarchical Yolov5 network model in an epoch of one iteration can be respectively calculated according to the loss function value of the small target prediction branch, the loss function value of the medium target prediction branch and the loss function value of the large target prediction branch of the hierarchical Yolov5 network model.

In step S310, training of the 3 predicted branches of the hierarchical YOLOV5 network model in one iteration is respectively completed according to the learning times of the 3 predicted branches of the hierarchical YOLOV5 network model in one cycle of one iteration.

In one embodiment, the epochs of one iteration of the hierarchical YOLOV5 network model may be one or more, and the learning times of each epoch of one iteration may be the same or different. Because 3 prediction branches of the hierarchical YOLOV5 network model are independent in structure, independent in characteristics, independent in loss function values and independent in training times, 3 prediction branches are independently trained and learned separately in each epoch of one iteration, 3 prediction branches can independently and reversely propagate respective loss function values according to the learning times of one epoch of one iteration respectively, respective parameters of the 3 prediction branches are respectively updated, the learning times of each epoch of one iteration of the 3 prediction branches are completed, each epoch of one iteration is completed, and training of one iteration is completed.

In step S311, it is determined whether the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model is smaller than a set loss threshold or whether the number of times of the loop iteration training reaches a set iteration number; if yes, go to step S312; if not, step S303 is performed.

In one embodiment, the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model and/or the number of times of the loop iteration training may be determined, and if the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model is less than the set loss threshold value and/or the number of times of the loop iteration training reaches the set iteration number, step S312 is executed; if the maximum value of the loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model is greater than or equal to the set loss threshold value or the number of times of the loop iteration training does not reach the set iteration number, step S303 is executed, and the iteration training is continuously performed on the hierarchical YOLOV5 network model until the maximum value of the loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model is less than the set loss threshold value and/or the number of times of the loop iteration training reaches the set iteration number.

In step S312, it is determined that training of the hierarchical YOLOV5 network model is complete.

It can be understood that when the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model is less than the set loss threshold value and the number of times of the loop iteration training reaches the set iteration number, the training of the hierarchical YOLOV5 network model is determined to be completed; or when the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model is smaller than a set loss threshold value and the number of times of the loop iteration training reaches a set iteration number, the training of the hierarchical YOLOV5 network model is determined to be completed.

In step S313, an image including a target is input to the trained hierarchical YOLOV5 network model, so that the trained hierarchical YOLOV5 network model performs target detection.

Further, in the hierarchical target detection method shown in the embodiment of the application, a clustering algorithm is adopted to cluster the labeled boxes with different sizes in the training set to obtain 3 clustering center boxes; according to the 3 clustering center boxes, determining to divide the marking boxes with different sizes in the training set into 3 types of boundary lines; according to the determined boundary line, dividing the marking boxes with different sizes in the training set into 3 types: the small labeling frame class, the medium labeling frame class and the large labeling frame class enable 3 prediction branches of the hierarchical Yolov5 network model to obtain respective optimal anchor frames and can maximally contain the labeling frame with the size, 3 prediction branches of the hierarchical Yolov5 network model can correspondingly and accurately predict and output the 3 labeling frames, training and learning of the hierarchical Yolov5 network model are facilitated, and training efficiency of the hierarchical Yolov5 network model is improved.

Example three:

corresponding to the embodiment of the application function implementation method, the application also provides a hierarchical target detection device, electronic equipment and a corresponding embodiment.

Fig. 4 is a schematic structural diagram of a hierarchical object detection apparatus according to an embodiment of the present application.

Referring to fig. 4, a hierarchical object detection apparatus includes a first input module 401, a prediction output module 402, a loss calculation module 403, a training completion module 404, and a second input module 405.

The first input module 401 is configured to input a training set to a hierarchical YOLOV5 network model, where the hierarchical YOLOV5 network model removes information interaction functions of 3 prediction branches on the basis of the YOLOV5 network model, and 3 prediction branches of the hierarchical YOLOV5 network model directly output a detection result.

A prediction output module 402, configured to respectively predict, through 3 prediction branches of the hierarchical YOLOV5 network model, the labeling frames with different sizes in the training set input by the first input module 401, and respectively obtain respective prediction outputs of the 3 prediction branches of the hierarchical YOLOV5 network model.

And a loss calculating module 403, configured to calculate loss function values of the 3 predicted branches of the hierarchical YOLOV5 network model respectively according to the prediction outputs of the 3 predicted branches of the hierarchical YOLOV5 network model obtained by the prediction output module 402.

A training completion module 404, configured to determine that training of the hierarchical YOLOV5 network model is completed if the maximum value of the loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model obtained by the loss calculation module 403 is less than a set loss threshold and/or the number of times of the loop iteration training reaches a set iteration number.

A second input module 405, configured to input an image including a target to the trained hierarchical YOLOV5 network model determined by the training completion module 404, so that the trained hierarchical YOLOV5 network model performs target detection.

According to the technical scheme shown in the embodiment of the application, 3 prediction branches of a hierarchical Yolov5 network model are independent in structure, independent in characteristics, independent in output and independent in loss function value, and each branch of the 3 prediction branches is independently trained, targets with different sizes in large, medium and small are respectively detected through the 3 prediction branches of the hierarchical Yolov5 network model, the 3 prediction branches of the hierarchical Yolov5 network model can directly output detection results, mutual interference of prediction of the 3 prediction branches is avoided, and optimal detection effects can be achieved on the targets with different sizes in large, medium and small.

Example four:

fig. 5 is another schematic structural diagram of the hierarchical object detection apparatus according to the embodiment of the present application.

Referring to fig. 5, a hierarchical object detection apparatus includes a first input module 401, a prediction output module 402, a loss calculation module 403, a training completion module 404, a second input module 405, a learning number calculation module 501, and a training module 502.

The first input module 401 is configured to obtain a training set that includes correctly labeling targets with different sizes in a plurality of images by using labeling boxes with different sizes; inputting a training set into a hierarchical Yolov5 network model, wherein the hierarchical Yolov5 network model removes the information interaction function of 3 prediction branches on the basis of a Yolov5 network model, and the 3 prediction branches of the hierarchical Yolov5 network model directly output detection results.

The prediction output module 402 is configured to cluster the labeled boxes with different sizes in the training set input by the first input module 401 by using a clustering algorithm to obtain 3 clustering center boxes; according to the 3 clustering center boxes, determining to divide the marking boxes with different sizes in the training set into 3 types of boundary lines; according to the determined boundary line, dividing the marking boxes with different sizes in the training set into 3 types: small labeling box class, medium labeling box class and large labeling box class; according to the label boxes divided into 3 classes, respectively adopting a clustering algorithm to generate respective anchor boxes of 3 prediction branches of the hierarchical YOLOV5 network model; and respectively predicting the 3 types of marking boxes through 3 prediction branches of the hierarchical Yolov5 network model to respectively obtain the respective prediction outputs of the 3 prediction branches of the hierarchical Yolov5 network model.

A learning number calculating module 501, configured to determine, according to the loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model obtained by the loss calculating module 403, the learning number of the 3 prediction branches in one cycle of one iteration.

A training module 502, configured to complete training of the 3 prediction branches of the hierarchical YOLOV5 network model in one iteration respectively according to the learning times, determined by the learning times calculation module 501, of the 3 prediction branches of the hierarchical YOLOV5 network model in one cycle of one iteration.

A training completion module 404, configured to determine whether the maximum value of the loss function values of the 3 prediction branches of the hierarchical YOLOV5 network model obtained by the loss calculation module 403 is smaller than a set loss threshold or whether the number of times of loop iteration training reaches a set iteration number; if the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model obtained by the loss calculation module 403 is less than a set loss threshold value and/or the number of times of the cyclic iterative training reaches a set iteration number, determining that the training of the hierarchical YOLOV5 network model is completed; if the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model is greater than or equal to the set loss threshold value or the number of times of the loop iteration training does not reach the set iteration number, the prediction output module 402, the loss calculation module 403, the learning number calculation module 501, the training module 502 and the training completion module 404 are executed, the iterative training of the hierarchical YOLOV5 network model is continued until the training completion module 404 judges that the maximum value of the loss function values of 3 prediction branches of the hierarchical YOLOV5 network model is less than the set loss threshold value and/or the number of times of the loop iteration training reaches the set iteration number, and the training of the hierarchical YOLOV5 network model is determined to be completed.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 6, the electronic device 60 includes a memory 601 and a processor 602.

The Processor 602 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 601 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 602 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 601 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, as well. In some embodiments, memory 601 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 601 has stored thereon executable code that, when processed by the processor 602, may cause the processor 602 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having executable code (or a computer program or computer instruction code) stored thereon, which, when executed by a processor of an electronic device (or server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the present application.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A hierarchical object detection method, comprising:

2. The method of claim 1, wherein if the maximum value of the loss function values of the 3 predicted branches of the hierarchical YOLOV5 network model is less than a set loss threshold and/or the number of iterative training cycles reaches a set number of iterations, determining to complete the training of the hierarchical YOLOV5 network model further comprises:

3. The method of claim 2, wherein the training of the 3 predicted branches of the hierarchical YOLOV5 network model before an iteration is completed according to the learning times of the 3 predicted branches of the hierarchical YOLOV5 network model in one cycle of the iteration, respectively, further comprises:

4. The method of claim 1, wherein the obtaining the prediction output of each of the 3 prediction branches of the hierarchical YOLOV5 network model by predicting the labeling boxes with different sizes in the training set by the 3 prediction branches of the hierarchical YOLOV5 network model respectively comprises:

5. The method of claim 4, wherein the splitting of the labeling boxes of different sizes in the training set into 3 classes: the small labeling box class, the medium labeling box class and the large labeling box class comprise:

6. A hierarchical object detection apparatus, comprising:

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 7, further comprising:

9. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-5.

10. A computer-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-5.