CN116935168B

CN116935168B - Method, device, computer equipment and storage medium for target detection

Info

Publication number: CN116935168B
Application number: CN202311176243.6A
Authority: CN
Inventors: 余燕清; 张如高; 李发成; 虞正华
Original assignee: Suzhou Moshi Intelligent Technology Co ltd
Current assignee: Suzhou Moshi Intelligent Technology Co ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2024-01-30
Anticipated expiration: 2043-09-13
Also published as: CN116935168A

Abstract

The invention relates to the technical field of target detection, and discloses a target detection method, a target detection device, computer equipment and a storage medium, wherein the target detection method comprises the following steps: acquiring a classification data set, a semantic segmentation data set, a part of labeling data set and a label broad data set for training; training a target detection model according to a plurality of data sets; the semantic segmentation data set is used for training the classification branches and/or centrality branches of the target detection model, and the marked data in the part of marked data set and the data consistent with the required labels in the label broad data set are used for training the classification branches, the frame regression branches and the centrality branches of the target detection model; and obtaining a trained target detection model. The invention can make full use of the labels of the existing multiple data sets, can train by utilizing the data in the multiple data sets, has good training effect and does not need to label the data sets manually.

Description

Method, device, computer equipment and storage medium for target detection

Technical Field

The present invention relates to the field of target detection technologies, and in particular, to a target detection method, apparatus, computer device, and storage medium.

Background

With the development of deep learning and the open source of algorithm, the acquisition and labeling of data become a real technical barrier for projects and even companies. The self-supervision, semi-supervision, weak supervision, unsupervised, migration, distillation and other algorithms are necessary measures for many small items. In the case of an autopilot scenario, an image-level public dataset is readily available, based on which target detection can be achieved.

However, the quality of the labeling of the existing public data set is different, the labeled category definition is also five-flower eight-door, the requirement of a target detection project is difficult to accurately meet, and only a small part of mass data can be used for target detection, so that the target detection effect is poor; and labeling the data sets by itself, a significant amount of cost is required.

Disclosure of Invention

In view of the above, the present invention provides a method, apparatus, computer device and storage medium for object detection, so as to solve the problem that the existing data set is not suitable for object detection.

In a first aspect, the present invention provides a method of target detection, the method comprising: acquiring a plurality of data sets for training, the plurality of data sets comprising: a classification dataset, a semantic segmentation dataset, a partial annotation dataset, and a tag broad dataset; training a target detection model according to the multiple data sets; the target detection model is a full-convolution single-stage target detection model, the classification data set is used for training classification branches and/or centrality branches of the target detection model, the semantic segmentation data set is used for training the classification branches of the target detection model, the marked data in the part of marked data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model, and the data consistent with the required labels in the label broad data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model; obtaining a trained target detection model; performing target detection on an image to be identified according to the target detection model, and identifying an object in the image to be identified;

Wherein the method further comprises: presetting an initial target detection model;

the training of the object detection model according to the plurality of data sets comprises: taking the initial target detection model as a teacher model; dividing the data in the data set into marked areas and unmarked areas according to the marking condition of the data set; taking the anchor point in the marked area as a positive sample anchor point; performing target detection on the unlabeled area according to the teacher model, and determining the centrality of an anchor point in the unlabeled area; taking an anchor point with the centrality larger than a first preset threshold value as a pseudo sample anchor point with a pseudo tag, and taking an anchor point with the centrality smaller than a second preset threshold value as a negative sample anchor point; the first preset threshold value is larger than or equal to the second preset threshold value; semi-supervised learning is carried out according to the positive sample anchor point, the pseudo sample anchor point and the negative sample anchor point, so that a corresponding student model is obtained, and the student model is a full convolution single-stage target detection model;

dividing the data in the data set into a marked area and an unmarked area according to the marking condition of the data set, wherein the method comprises the following steps: setting super parameters for representing the sizes, and determining center coordinates of classified data in the classified data set; and taking the central coordinate as the center, and taking the region in the super parameter range as the labeling region of the classified data.

According to the target detection method provided by the embodiment, the FCOS model is adopted as the target detection model to be trained, and one or more branches of the target detection model are respectively subjected to learning training by utilizing multiple data sets such as a classification data set, a semantic segmentation data set, a part of labeling data set and a label broad data set, so that the required target detection model can be obtained under the condition that a single data set cannot train the target detection model. In this embodiment, the labeled data sets are not suitable for training the target detection model due to incomplete labeling, but the method can comprehensively utilize the labeled data sets and realize fusion of a plurality of different labeled data sets based on the FCOS model, so as to train and obtain the target detection model capable of completing the target detection task. The method can make full use of the labels of the existing multiple data sets, can train by utilizing mass data in the multiple data sets, has a good training effect, and does not need to label the data sets manually.

By using the semi-supervised learning method, more pseudo sample anchor points and negative sample anchor points which can be used for training can be extracted from the data set, the public data set can be more effectively utilized, and the method is suitable for target detection projects which need a large amount of data but are time-stressed. The labeling area suitable for the classified data can be simply and accurately determined by taking the center coordinates of the classified data as the center.

In some optional embodiments, the dividing the data in the data set into the marked area and the unmarked area according to the marking condition of the data set includes: performing target detection on the label broad data in the label broad data set according to the teacher model, and determining classification scores of frames in the label broad data; taking the region corresponding to the frame as a first labeling region of the tag broad data under the condition that the classification score is larger than a third preset threshold value; the positive sample anchor points of the first labeling area in the label broad dataset are used for training classification branches, frame regression branches and centrality branches of the student model.

By using the classification score of the teacher model on the label broad data, positive sample anchor points suitable for training three branches of classification branches, frame regression branches and centrality branches can be extracted from the label broad data, and the training can be better performed.

In some optional embodiments, the dividing the data in the data set into marked areas and unmarked areas according to the marking situation of the data set further includes: taking the region corresponding to the frame as a second labeling region of the tag broad data under the condition that the classification score is smaller than a fourth preset threshold value; the positive sample anchor points of the second labeling areas in the label broad data set are used for training frame regression branches and centrality branches of the student model; the fourth preset threshold is less than or equal to the third preset threshold.

In some optional embodiments, the training the object detection model according to the plurality of data sets further comprises: performing weak supervision segmentation processing on the classified data in the classified data set, and determining frames in the classified data; positive sample anchor points in frames of the classification data are used for training classification branches, frame regression branches and centrality branches of the student model; performing weak supervision segmentation processing on the semantic segmentation data in the semantic segmentation data set, and determining frames in the semantic segmentation data; positive sample anchor points in the frames of the semantic segmentation data are used for training classification branches, frame regression branches and centrality branches of the student model.

The frames in the classification data and the semantic segmentation data are determined by weak supervision segmentation, and based on the frames, three branches of classification branches, frame regression branches and centrality branches of the student model can be trained, so that the public data sets can be fully utilized, and the generalization capability of the target detection model can be improved.

In some alternative embodiments, the pseudo-sample anchor and the negative sample anchor in the portion of the annotation data are used to train classification branches of the student model.

In a second aspect, the present invention provides an apparatus for target detection, the apparatus comprising: an acquisition module for acquiring a plurality of data sets for training, the plurality of data sets comprising: a classification dataset, a semantic segmentation dataset, a partial annotation dataset, and a tag broad dataset; the training module is used for training the target detection model according to the multiple data sets to obtain a trained target detection model; the target detection model is a full-convolution single-stage target detection model, the classification data set is used for training classification branches and/or centrality branches of the target detection model, the semantic segmentation data set is used for training the classification branches of the target detection model, the marked data in the part of marked data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model, and the data consistent with the required labels in the label broad data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model; performing target detection on an image to be identified according to the target detection model, and identifying an object in the image to be identified; the preset module is used for presetting an initial target detection model;

The training module trains the target detection model according to the plurality of data sets, including: taking the initial target detection model as a teacher model; dividing the data in the data set into marked areas and unmarked areas according to the marking condition of the data set; taking the anchor point in the marked area as a positive sample anchor point; performing target detection on the unlabeled area according to the teacher model, and determining the centrality of an anchor point in the unlabeled area; taking an anchor point with the centrality larger than a first preset threshold value as a pseudo sample anchor point with a pseudo tag, and taking an anchor point with the centrality smaller than a second preset threshold value as a negative sample anchor point; the first preset threshold value is larger than or equal to the second preset threshold value; semi-supervised learning is carried out according to the positive sample anchor point, the pseudo sample anchor point and the negative sample anchor point, so that a corresponding student model is obtained, and the student model is a full convolution single-stage target detection model;

the training module divides the data in the data set into marked areas and unmarked areas according to the marking condition of the data set, and the training module comprises the following steps: setting super parameters for representing the sizes, and determining center coordinates of classified data in the classified data set; and taking the central coordinate as the center, and taking the region in the super parameter range as the labeling region of the classified data.

In a third aspect, the present invention provides a computer device comprising: the processor is in communication with the memory, and the memory stores computer instructions, and the processor executes the computer instructions to perform the method for detecting an object according to the first aspect or any of the embodiments corresponding to the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of object detection of the first aspect or any of its corresponding embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a method of training a target detection model according to an embodiment of the invention;

FIG. 2 is a flow chart of another method of training a target detection model according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a portion of annotation data provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of object detection for part of annotation data according to an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus for training a target detection model according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

When the target detection is carried out, the data in the data set is required to be disclosed as completely marked data, and the completely marked data has key fields such as labels (label), frames (binding boxes) and the like; however, existing partially published datasets are poorly annotated and do not have complete annotations. For example, for classified data, it is labeled only with labels, and no borders are present.

At present, a semi-supervised method is generally adopted to try to detect targets by adopting a small amount of complete marked data and a large amount of unmarked data, so that the marking workload is reduced. However, the difference between the semi-supervised method and the fully supervised method is large, so that the target detection effect is poor.

Based on the above, the embodiment of the invention provides a method for training a target detection model, wherein the target detection model is a full convolution single-Stage target detection model, namely an FCOS (Fully Convolutional One-Stage) model, and the FCOS model is obtained through training, so that the trained FCOS model can be utilized for target detection.

The FCOS model is an Anchor-free detection model, that is, an Anchor box (Anchor box) is not needed, the original Anchor box is classified and regressed, and the classification and regressions of the Anchor points are changed into four distance values l, r, t, b from each Anchor point to the upper, lower, left and right boundaries of the frame in the prediction feature map, wherein l represents the distance from the Anchor point to the left boundary (left) of the frame, r represents the distance from the Anchor point to the right boundary (right) of the frame, t represents the distance from the Anchor point to the upper boundary (top) of the frame, and b represents the distance from the Anchor point to the lower boundary (bottom) of the frame.

The FCOS model contains three branches: a classification (classification) branch, a bounding box regression (regression) branch, and a centrality (center-less) branch. The classification branches are used for predicting the categories of the anchor points; the frame regression branch is used for predicting the frame size of each anchor point, namely predicting four distance values l, r, t, b from the anchor point to four boundaries of the frame; the centrality branch is used for predicting the centrality of each anchor point, one anchor point corresponds to one centrality, and the centrality can represent the centrality of the anchor point relative to the frame. If the four values of the frame corresponding to a certain anchor point are respectively:、/>、/>、the centrality of the anchor point>Can be expressed as:

。

accordingly, the three branches of the FCOS model have corresponding penalty functions; for example, the loss function of the classification branch is L _cls The loss function of the frame regression branch is L _reg The loss function of the centrality branch is L _ctr . The penalty of the FCOS model consists of the penalty functions of these three branches together. The loss function will not be described in detail in this embodiment.

According to the method for training the target detection model, which is provided by the embodiment, corresponding branches of the FCOS model are respectively trained by utilizing a plurality of different types of data sets, so that the target detection model capable of carrying out target detection is obtained.

In accordance with an embodiment of the present invention, there is provided a method embodiment for training a target detection model, it being noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

In this embodiment, a method for training a target detection model is provided, which may be used in a computer or a server. FIG. 1 is a flow chart of a method of training a target detection model according to an embodiment of the invention, as shown in FIG. 1, the flow comprising the following steps.

Step S102, acquiring a plurality of data sets for training, the plurality of data sets including: classification datasets, semantic segmentation datasets, partial annotation datasets, and tag-broad datasets.

In this embodiment, the data set used for training includes a plurality of data sets, and the data set may be a public data set or a private data set marked by the user, which is not limited in this embodiment; typically, training is accomplished using a data set that is disclosed on a network. After the data sets are acquired, the data sets can be classified by means of artificial division or based on introduction of the data sets or characteristics of data in the data sets, so as to determine which data sets belong to classified data sets, which data sets belong to semantically segmented data sets, and the like.

In this embodiment, the data set used includes at least four types: a classification dataset, a semantic segmentation dataset, a partial annotation dataset, and a tag broad dataset; the different classes of data sets, their annotated classes and annotation forms are also generally different.

The classification data set includes a plurality of classification data sets that are typically used to train a classification model. The classification data may be specifically an image, which is labeled with a label as a whole, but not with a border.

The semantic segmentation dataset comprises a plurality of semantic segmentation data, which is typically used to train a semantic segmentation model. The semantic segmentation data can be specifically an image, each segmented object is marked with different labels, and the position of each segmented object is segmented; wherein the semantic segmentation data typically does not distinguish between different objects of the same class.

The partial annotation data set comprises a plurality of partial annotation data, and the partial annotation data set is generally used for training a target detection model, namely, the partial annotation data in the partial annotation data set is marked with labels and frames, but only the partial annotation data set marks the objects of partial categories. For example, there is a current need to train a target detection model suitable for an automatic driving scene, which is required to be able to detect vehicles, human bodies, surrounding buildings, etc. in an image; however, if a certain target detection data set a is used for realizing human body detection, only human bodies are marked in the target detection data set a, and vehicles, buildings and the like are not marked; thus, the object detection dataset a belongs to a kind of partial annotation dataset when training the object detection model of the autopilot scenario. It will be appreciated that this portion of the annotation dataset is relative to the requirements of the target detection model that currently needs to be trained.

The tag-wide dataset comprises a plurality of tag-wide data, which is also typically used to train the object detection model, i.e. the tag-wide data is labeled with tags and borders, but as the name implies, the tags of the tag-wide data are broader. For example, it is currently required to train a target detection model capable of identifying different types of vehicles, and the target detection model is required to detect different types of vehicles in an image and to label the different types of vehicles; however, if a certain object detection data set B is used for realizing object detection in the automatic driving scene, it simply marks which objects are vehicles, and does not subdivide different kinds of vehicles; in other words, the object detection data set B is only broadly labeled with a vehicle (car), and the vehicle is not specifically labeled as belonging specifically to a bus (bus), a taxi (taxi), or the like. Thus, the object detection data set B belongs to one kind of tag-wide data set when training an object detection model capable of identifying different kinds of vehicles. It will be appreciated that this broad data set of labels is also relative to the requirements of the target detection model that currently needs to be trained.

Step S104, training a target detection model according to various data sets; the method comprises the steps of training a classification branch and/or a centrality branch of a target detection model by using a classification data set, training the classification branch and/or the centrality branch of the target detection model by using a semantic segmentation data set, training the classification branch, the frame regression branch and the centrality branch of the target detection model by using labeled data in a part of labeled data set, and training the classification branch, the frame regression branch and the centrality branch of the target detection model by using data consistent with required labels in a label broad data set.

Typically, separate classification datasets, semantic segmentation datasets, partial annotation datasets, or label-wide datasets are not typically used for target detection due to incomplete annotation of these datasets, which are different from the requirements of the target detection model. In this embodiment, the FCOS model is used to perform model training in combination with these different types of data sets, so as to achieve target detection.

In this embodiment, when training the target detection model based on the data sets, that is, when training the FCOS model based on the data sets, a part or all of branches of the FCOS model are trained based on respective characteristics of each data set, so that training of the FCOS model is achieved, and the trained FCOS model can meet requirements.

Specifically, the classification data in the classification data set is marked with a label, so that the classification branch of the target detection model can be trained based on the classification data marked with the label; further, since the object in the classification data is generally located in the center of the image, a frame located in the center may be provided for the classification data, and the centrality branch of the target detection model may be trained based on the frame. Since the border does not accurately represent the position of the object, it is not suitable for training the border regression branch of the object detection model.

The semantic segmentation data set is similar to the classification data set, and the objects in the semantic segmentation data are marked with labels, so that classification branches of the target detection model can be trained based on the semantic segmentation data marked with the labels. However, since the semantic segmentation data does not distinguish between different objects in the same class, for example, for two human bodies that are close to each other and partially overlap in an image, the two human bodies are labeled as one object in the semantic segmentation data, and the label is a human body (person), the semantic segmentation data is not suitable for training the centrality branch and the frame regression branch of the target detection model.

The part of the marked data set is marked data, namely, the marked data is marked with a label and a frame, and the marked data is part of the marked data; it will be appreciated that the annotated data is complete annotation data that enables complete training of the target detection model, i.e., training of the classification branches, the frame regression branches, and the centrality branches of the target detection model.

For a broad data set of labels, in this embodiment, it is necessary to determine data consistent with the labels required for the object detection model, and train the object detection model based on the data. Wherein, although the label of the data is wide, it does not affect the accuracy of the frame; if the data broad label is consistent with the required label, the data marked with the accurate label and the accurate frame can be extracted from the label broad data set, and the data can also carry out complete training on the target detection model, namely, the classification branch, the frame regression branch and the centrality branch of the target detection model can be trained similarly to the marked data.

Wherein, since the three branches of the FCOS model have corresponding penalty functions, the penalty of the FCOS model is composed of the penalty functions of the three branches together, e.g., the penalty L of the FCOS model _FCOS = L _cls + L _reg + L _ctr The method comprises the steps of carrying out a first treatment on the surface of the By setting different loss functions for different data sets, training of the corresponding branches can be achieved. For example, if a classification dataset is used to train the classification branches and centrality branches of the target detection model, then the loss function corresponding to the classification dataset may be expressed as l=l _cls + L _ctr 。

And step S106, obtaining a trained target detection model.

In this embodiment, the class data set, the semantic segmentation data set, the part of the labeling data set and the label broad data set are used to respectively learn and train one or more branches of the target detection model, and finally, the data of multiple data sets can be integrated, and the required target detection model can be obtained through training. After the target detection model is obtained, target detection can be performed based on the target detection model.

According to the method for training the target detection model, the FCOS model is adopted as the target detection model to be trained, and one or more branches of the target detection model are respectively subjected to learning training by utilizing multiple data sets such as a classification data set, a semantic segmentation data set, a part of labeling data set and a label broad data set, so that the target detection model can be obtained through training under the condition that a single data set cannot train the target detection model. In this embodiment, the labeled data sets are not suitable for training the target detection model due to incomplete labeling, but the method can comprehensively utilize the labeled data sets and realize fusion of a plurality of different labeled data sets based on the FCOS model, so as to train and obtain the target detection model capable of completing the target detection task. The method can make full use of the labels of the existing multiple data sets, can train by utilizing mass data in the multiple data sets, has a good training effect, and does not need to label the data sets manually.

In this embodiment, a method for training a target detection model is provided, which may be used in a computer or a server, etc., and fig. 2 is a flowchart of a method for training a target detection model according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps.

Step S202, acquiring a plurality of data sets for training, the plurality of data sets including: classification datasets, semantic segmentation datasets, partial annotation datasets, and tag-broad datasets.

Please refer to step S102 in the embodiment shown in fig. 1, which is not described herein.

In step S203, an initial target detection model is preset.

In this embodiment, an initial model of the target detection model, that is, an initial target detection model, may be predetermined, and it is understood that the initial target detection model is also an FCOS model. Wherein the initial target detection model may be an existing FCOS model; alternatively, the initial target detection model may be a model trained based on a small amount of fully annotated data; alternatively, the initial object detection model may be a model trained by fusing part of the data in the plurality of data sets, for example, a model trained by using labeled data in a part of the labeled data sets. The manner of acquiring the initial target detection model is not limited in this embodiment.

Step S204, training a target detection model according to various data sets; the method comprises the steps of training a classification branch and/or a centrality branch of a target detection model by using a classification data set, training the classification branch and/or the centrality branch of the target detection model by using a semantic segmentation data set, training the classification branch, the frame regression branch and the centrality branch of the target detection model by using labeled data in a part of labeled data set, and training the classification branch, the frame regression branch and the centrality branch of the target detection model by using data consistent with required labels in a label broad data set.

In this embodiment, the step S204 "training the object detection model according to various data sets" may include the following steps S2041 to S2045.

In step S2041, an initial target detection model is used as a teacher model.

In this embodiment, training of the target detection model is achieved by adopting a semi-supervised learning mode. Specifically, after an initial target detection model is determined, the initial target detection model is used as a teacher (teacher) model in semi-supervised learning.

Step S2042, dividing the data in the data set into marked areas and unmarked areas according to the marking condition of the data set; and taking the anchor points in the marked area as positive sample anchor points.

In this embodiment, different kinds of data sets, where the data sets have corresponding labeling conditions, based on the labeling conditions, it may be determined which regions are labeled with labels and frames, and the regions are used as labeled regions, that is, the labeled regions are labeled with labels and frames; accordingly, regions other than the noted region are referred to as unlabeled regions. It can be appreciated that since the anchor points in the labeling area are labeled intact, the anchor points in the labeling area can be used as positive samples during training, i.e. positive sample anchor points.

For example, for the part of the labeling data in the part of the labeling data set, the label and the border are labeled on the part of the object, so that the area corresponding to the border of the object can be used as the labeling area. For example, referring to fig. 3, the part of the labeling data is a part of a labeled image 301, the image 301 has a building and a human body, wherein the building is labeled with a label (labeling) and a frame 302, and the human body is not labeled; therefore, the region within the border 302 may be referred to as a labeled region and the region outside the border 302 may be referred to as an unlabeled region.

For the semantic segmentation data set, based on the position of the object subjected to semantic segmentation, the frame corresponding to the object can be determined, and then the region in the frame is used as the labeling region of the semantic segmentation data.

In some alternative embodiments, the step S2042 "dividing the data in the data set into the marked area and the unmarked area according to the marking condition of the data set" may include the following steps A1 to A2.

And step A1, setting an over-parameter for representing the size, and determining the center coordinates of the classified data in the classified data set.

And A2, taking the central coordinate as the center, and taking the region in the super parameter range as the labeling region of the classified data.

In this embodiment, since the classification data is not marked with a frame, but the object in the classification data is generally located in the middle of the image, a superparameter may be set for the classification data set, where the superparameter represents the size of the dimension, and the superparameter is, for example, a side length or a radius. Based on the center coordinates of the classified data and the super parameters, a region within the super parameter range with the center coordinates as the center can be determined, and the region is used as a labeling region of the classified data. For example, if the hyper-parameter indicates a radius, a circular region having the hyper-parameter as a radius may be used as a labeling region of the classification data. The embodiment uses the center coordinates of the classified data as the center, and can simply and accurately determine the labeling area suitable for the classified data.

And step S2043, carrying out target detection on the unlabeled area according to the teacher model, and determining the centrality of the anchor point in the unlabeled area.

In this embodiment, after determining the unlabeled area of the data, the target detection is performed on the unlabeled area by using the teacher model, and the centrality of each anchor point in the unlabeled area can be determined based on the output result of the teacher model. The teacher model can only detect targets in the unlabeled areas; alternatively, the complete data (including the marked area and the unmarked area) may be input to the teacher model, and the centrality of the anchor point in the unmarked area may be determined based on the output result of the teacher model.

In practical cases, in the classified data set, a part of classified data may have a plurality of objects, and the label of the classified data only represents one of the objects, and the rest of the objects are not marked; similarly, similar problems may exist in semantically partitioned datasets, tag-wide datasets. In other words, there may be cases of under-labeling of the classification dataset, the semantic segmentation dataset, and the tag-wide dataset, i.e., there is partially labeled data, which is similar to the problems with partially labeled datasets.

Therefore, if there is a partial labeling of the classified data set, the semantic segmentation data set, the tag broad data set, or the like, there may be unlabeled objects in the unlabeled area, and by performing target detection on the unlabeled area, the unlabeled objects in the unlabeled area may be initially identified, and semi-supervised learning may be performed based on the identified unlabeled objects.

It can be appreciated that if the classification data set and the like are marked completely, the problem of partial marking does not exist, and the unlabeled object is generally not recognized when the unlabeled area is subjected to target detection according to the teacher model. Therefore, whether the classified data set and the like have the problem of partial labeling or not, the target detection can be performed on the unlabeled area of the classified data set and the like according to the teacher model, and the target detection is irrelevant to whether the classified data set and the like have the problem of partial labeling or not.

Step S2044, using an anchor point with the centrality larger than a first preset threshold value as a pseudo sample anchor point with a pseudo tag, and using an anchor point with the centrality smaller than a second preset threshold value as a negative sample anchor point; the first preset threshold is greater than or equal to the second preset threshold.

In this embodiment, based on the magnitude of the centrality of the anchor point, the anchor point that can be labeled with the pseudo tag is selected from among them. Specifically, if the centrality of the anchor point is greater than the first preset threshold value, the anchor point can be considered to be capable of better representing the position of the object in the image, so that a pseudo tag can be set for the anchor point, and the anchor point is used as a pseudo sample anchor point. Similarly, if the centrality of the anchor point is smaller than the second preset threshold value, the anchor point may be considered to be not suitable for representing the position of the object in the image, or, the anchor point does not have the object, so the anchor point may be set as a negative sample, that is, a negative sample anchor point. Wherein the first preset threshold is greater than or equal to the second preset threshold; typically, the first preset threshold is greater than the second preset threshold. For example, the first preset threshold is 0.6 and the second preset threshold is 0.2.

For example, referring to fig. 3, if a part of the labeling data is an image 301, its unlabeled area is other area than the border 302. Performing object detection on the image 301 based on the teacher model, and recognizing that a human body exists therein; referring to FIG. 4, the frame marked by the teacher model is indicated at 303. Within this border 303, the closer to the center of the border 303 the anchor point is, the greater the center of the border, i.e., the closer to the center of the border 303, the more likely it is to be a false sample anchor point. An anchor point with the centrality smaller than a second preset threshold value in the frame 303 is used as a negative sample anchor point; also, an anchor point located outside the border 303, belonging to the background of the image 301, may also be used as a negative sample anchor point.

The traditional semi-supervised method has certain limitations because it is distinguished from the view at the image level, and objects in the image are either marked or unmarked. In this embodiment, the data in the data set may be divided into three areas, namely, a labeling area, an area where the pseudo sample anchor point is located, and a background area where the negative sample anchor point is located, and training of the target detection model is achieved based on different anchor points in the three areas; the labeling mode can represent the labeling difference of the characteristic point level, can divide one image into partial labeling, partial non-labeling or partial imperfect labeling, has finer and accurate labeling, and can realize semi-supervised learning of the characteristic point level.

And step S2045, performing semi-supervised learning according to the positive sample anchor point, the pseudo sample anchor point and the negative sample anchor point to obtain a corresponding student model, wherein the student model is a full-convolution single-stage target detection model.

In this embodiment, based on the idea of semi-supervised learning, a student model is set in addition to a teacher model; in the initial stage, the student model may be the same as the teacher model, or may be a target detection model that is set separately, which is not limited in this embodiment; it will be appreciated that the student model and the teacher model are both FCOS models.

And for each type of data set, after positive sample anchor points, pseudo sample anchor points and negative sample anchor points are determined, semi-supervised learning can be performed, so that a student model is obtained through training. Then, the teacher model is updated based on the student model, for example, the student model periodically updates parameters to the teacher model according to an EMA (exponential moving average) manner; and then, the teacher model is utilized to redetermine the anchor points of the pseudo samples, and the like, and training is carried out on the student model again, so that the training is repeated for a plurality of times, and finally, the trained target detection model can be obtained. For example, the final trained student model or teacher model may be used as the desired target detection model. Training the student model and the teacher model based on semi-supervised learning is a mature technology in the art, and this embodiment will not be described in detail.

According to the method for training the target detection model, the semi-supervised learning method is utilized, more pseudo sample anchor points and negative sample anchor points which can be used for training can be extracted from the data set, the public data set can be utilized more effectively, and the method is suitable for target detection projects which need a large amount of data but are time-intense.

When determining the labeling area, for the label broad data set, a similar process of determining part of the labeling area in the labeling data set can be adopted to determine the labeling area of the label broad data; for example, the region in all frames in the tag broad data is taken as the labeling region. In some alternative embodiments, to better identify data in the tag-wide dataset that corresponds to the desired tag, the frame in the tag-wide data may be partitioned using a teacher model. Specifically, the above step S2042 "dividing the data in the data set into the marked area and the unmarked area according to the marking condition of the data set" may include the following steps B1 to B2.

And step B1, performing target detection on the label broad data in the label broad data set according to the teacher model, and determining classification scores of frames in the label broad data.

Step B2, taking the area corresponding to the frame as a first labeling area of the tag broad data under the condition that the classification score is larger than a third preset threshold value; the positive sample anchor points of the first labeling region in the label-broad dataset are used to train classification branches, border regression branches, and centrality branches of the student model.

In this embodiment, since the FCOS model has three branches of classification branches, frame regression branches and centrality branches, when the teacher model belonging to the FCOS model performs object detection on tag broad data, a classification score (score) of the teacher model on each frame in the tag broad data can be obtained, and the higher the classification score, the more accurate the teacher model can detect the frame, and the less susceptible the influence of the broad tag of the frame itself. In this case, the region corresponding to the frame may be used as a labeling region of the tag broad data, which is referred to as a first labeling region in this embodiment. The positive sample anchor point of the first labeling area is labeled with a proper frame, so that the positive sample anchor point can be used for training frame regression branches and centrality branches; in addition, as the teacher model can accurately identify the classification of the anchor points in the first labeling area, the positive sample anchor points in the first labeling area can also be used for training classification branches of the student model.

According to the embodiment, the classification score of the teacher model on the label broad data is utilized, positive sample anchor points suitable for training three branches of classification branches, frame regression branches and centrality branches can be extracted from the label broad data, and training can be better carried out.

In some alternative embodiments, the step S2042 "dividing the data in the data set into the marked area and the unmarked area according to the marking condition of the data set" may include the following step B3.

Step B3, taking the area corresponding to the frame as a second labeling area of the tag broad data under the condition that the classification score is smaller than a fourth preset threshold value; the positive sample anchor points of the second labeling areas in the label broad data set are used for training frame regression branches and centrality branches of the student model; the fourth preset threshold is less than or equal to the third preset threshold.

In this embodiment, after determining the classification score of the frame in the tag broad data according to the teacher model, if the classification score is smaller than the fourth preset threshold, it is indicated that the teacher model cannot accurately determine the classification of the object in the frame, so that the anchor point in the frame is not easy to be used for training the classification branch; but since the border itself is still accurate, these anchors can still train the border regression branches and centrality branches. Thus, for positive sample anchors within the second labeling area in the label-broad dataset, it can be used to train the bounding regression branches and centrality branches of the student model, without training the classification branches.

Wherein the fourth preset threshold is less than or equal to the third preset threshold; in general, the fourth preset threshold may be equal to the third preset threshold, that is, the other area in the frame except for the first labeling area is used as the second labeling area. For example, the third preset threshold and the fourth preset threshold are both 0.8.

For example, if a tag for a certain vehicle is marked in the tag broad data, the tag is a vehicle (car); after target detection is performed by the teacher model, the output result of the teacher model indicates that the classification label of the vehicle is more accurate bus (bus), and the classification score exceeds 0.8, so that the label of the vehicle in the label broad data can be changed into the bus (bus), and the area in the corresponding frame is the first labeling area. Conversely, if the classification score of the teacher model is less than 0.8, the region within the corresponding frame is taken as the second labeling region.

It will be appreciated that for other data sets, positive sample anchors, negative sample anchors, etc., determined therefrom, are also used to train the respective branches of the student model, respectively. For example, positive sample anchors extracted from the classification dataset for training classification branches and/or centrality branches of the student model; positive sample anchors extracted from the semantic segmentation dataset are used to train classification branches of the student model.

In some alternative embodiments, the step S204 "training the object detection model according to the multiple data sets" may include the following steps C1 to C2 in addition to the steps S2041 to S2045.

Step C1, performing weak supervision segmentation processing on classified data in the classified data set, and determining frames in the classified data; positive sample anchors within the border of classification data are used to train classification branches, border regression branches, and centrality branches of the student model.

Step C2, carrying out weak supervision segmentation processing on the semantic segmentation data in the semantic segmentation data set, and determining frames in the semantic segmentation data; positive sample anchors within the border of the semantic segmentation data are used to train classification branches, border regression branches, and centrality branches of the student model.

In this embodiment, for the data in the classified data set or the semantic segmentation data set, a weak supervision segmentation processing mode may be adopted to determine the frames in the data, so that relatively accurate frames may be labeled for the classified data and the semantic segmentation data; the anchor points in the frame can train the frame regression branches and the centrality branches besides training the classification branches, namely, the positive sample anchor points in the frame can train the classification branches, the frame regression branches and the centrality branches of the student model.

In this embodiment, the classification dataset is used to train classification branches and/or centrality branches; in the case of a bounding box determined based on a weak supervised segmentation process, positive sample anchors within the bounding box may also be used to train the bounding box regression branches.

The semantic segmentation data set is used for training classification branches; similar to the classification dataset, in the case where the weak supervision segmentation process determines a border, positive sample anchors within the border may also be used to train the border regression branches, as well as the centrality branches.

The marked data in the partial marked data set is used for training classification branches, frame regression branches and centrality branches of the target detection model. The areas corresponding to the marked data are marked areas, and can be used as positive sample anchor points; the artificial sample anchor points and the negative sample anchor points in unlabeled data can be determined through the teacher model and are mainly used for training classification branches of the student model.

Data in the tag-wide dataset consistent with the required tag (e.g., anchors with classification scores greater than a third preset threshold) serves as positive sample anchors, training classification branches, frame regression branches, and centrality branches; and, the anchor point with the classification score smaller than the fourth preset threshold value is also used as a positive sample anchor point, and only the border regression branch and the centrality branch are trained by the positive sample anchor point. In addition, the target detection based on the teacher model can identify the false sample anchor points and the negative sample anchor points, and the false sample anchor points and the negative sample anchor points can be mainly used for training classification branches of the student model.

According to the method for training the target detection model, the frames in the classification data and the semantic segmentation data are determined through weak supervision segmentation processing, and based on the frames, three branches of classification branches, frame regression branches and centrality branches of the student model can be trained, so that the public data sets can be fully utilized, and the generalization capability of the target detection model can be improved.

Step S206, obtaining a trained target detection model.

Please refer to step S106 in the embodiment shown in fig. 1 in detail, which is not described herein.

The method for training the target detection model provided in the embodiment essentially utilizes the partially labeled data to learn complete class detection, so that the requirement of complete class detection can be met. The method takes semi-supervised learning as an integral framework, combines algorithms such as weak supervision and the like, can realize end-to-end target detection model training, has no complex intermediate process, and is simple and easy to use.

Based on the same inventive concept, there is also provided a method of object detection in the present embodiment, which is applicable to a device capable of achieving object detection, such as a mobile terminal, a computer, or the like. The method comprises the following steps: and carrying out target detection on the image to be identified according to the target detection model, and identifying the object in the image to be identified. The target detection model is obtained based on the method for training the target detection model provided by any of the embodiments.

When an object in an image to be identified needs to be identified, the image to be identified may be, for example, an environmental image acquired by a camera in an automatic driving scene, where objects such as a vehicle and a human body exist, and the object in the image to be identified may be identified by inputting the image to be identified into a trained target detection model.

The embodiment also provides a device for training the target detection model, which is used for implementing the above embodiment and the preferred implementation manner, and the description is omitted herein. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a device for training a target detection model, where the target detection model is a full convolution single-stage target detection model, as shown in fig. 5, and the device includes:

an acquisition module 501 for acquiring a plurality of data sets for training, the plurality of data sets comprising: a classification dataset, a semantic segmentation dataset, a partial annotation dataset, and a tag broad dataset;

The training module 502 is configured to train the target detection model according to the multiple data sets, so as to obtain a trained target detection model; the semantic segmentation data set is used for training the classification branches, the frame regression branches and the centrality branches of the target detection model, marked data in the part of marked data set is used for training the classification branches, the frame regression branches and the centrality branches of the target detection model, and data consistent with required labels in the label broad data set is used for training the classification branches, the frame regression branches and the centrality branches of the target detection model.

In some alternative embodiments, the apparatus further comprises: the preset module is used for presetting an initial target detection model.

And, the training module 502 trains the object detection model according to the plurality of data sets, including:

taking the initial target detection model as a teacher model;

dividing the data in the data set into marked areas and unmarked areas according to the marking condition of the data set; taking the anchor point in the marked area as a positive sample anchor point;

Performing target detection on the unlabeled area according to the teacher model, and determining the centrality of an anchor point in the unlabeled area;

taking an anchor point with the centrality larger than a first preset threshold value as a pseudo sample anchor point with a pseudo tag, and taking an anchor point with the centrality smaller than a second preset threshold value as a negative sample anchor point; the first preset threshold value is larger than or equal to the second preset threshold value;

and performing semi-supervised learning according to the positive sample anchor point, the pseudo sample anchor point and the negative sample anchor point to obtain a corresponding student model, wherein the student model is a full convolution single-stage target detection model.

In some optional embodiments, the training module 502 divides the data in the data set into a marked area and an unmarked area according to the marking condition of the data set, including:

setting super parameters for representing the sizes, and determining center coordinates of classified data in the classified data set;

and taking the central coordinate as the center, and taking the region in the super parameter range as the labeling region of the classified data.

Performing target detection on the label broad data in the label broad data set according to the teacher model, and determining classification scores of frames in the label broad data;

taking the region corresponding to the frame as a first labeling region of the tag broad data under the condition that the classification score is larger than a third preset threshold value; the positive sample anchor points of the first labeling area in the label broad dataset are used for training classification branches, frame regression branches and centrality branches of the student model.

In some optional embodiments, the training module 502 divides the data in the data set into a marked area and an unmarked area according to the marking condition of the data set, and further includes:

taking the region corresponding to the frame as a second labeling region of the tag broad data under the condition that the classification score is smaller than a fourth preset threshold value; the positive sample anchor points of the second labeling areas in the label broad data set are used for training frame regression branches and centrality branches of the student model; the fourth preset threshold is less than or equal to the third preset threshold.

In some alternative embodiments, the training module 502 trains the object detection model according to the plurality of data sets, further comprising:

Performing weak supervision segmentation processing on the classified data in the classified data set, and determining frames in the classified data; positive sample anchor points in frames of the classification data are used for training classification branches, frame regression branches and centrality branches of the student model;

performing weak supervision segmentation processing on the semantic segmentation data in the semantic segmentation data set, and determining frames in the semantic segmentation data; positive sample anchor points in the frames of the semantic segmentation data are used for training classification branches, frame regression branches and centrality branches of the student model.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The means for training the object detection model in this embodiment is presented in the form of functional units, here referred to as ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or firmware programs, and/or other devices that can provide the functionality described above.

The embodiment of the invention also provides a computer device which is provided with the device for training the target detection model shown in the figure 5.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 6, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 6.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 6.

The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method of target detection, the method comprising:

acquiring a plurality of data sets for training, the plurality of data sets comprising: a classification dataset, a semantic segmentation dataset, a partial annotation dataset, and a tag broad dataset;

training a target detection model according to the multiple data sets; the target detection model is a full-convolution single-stage target detection model, the classification data set is used for training classification branches and/or centrality branches of the target detection model, the semantic segmentation data set is used for training the classification branches of the target detection model, the marked data in the part of marked data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model, and the data consistent with the required labels in the label broad data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model;

obtaining a trained target detection model;

performing target detection on an image to be identified according to the target detection model, and identifying an object in the image to be identified;

The training of the object detection model according to the plurality of data sets comprises:

taking the initial target detection model as a teacher model;

semi-supervised learning is carried out according to the positive sample anchor point, the pseudo sample anchor point and the negative sample anchor point, so that a corresponding student model is obtained, and the student model is a full convolution single-stage target detection model;

dividing the data in the data set into a marked area and an unmarked area according to the marking condition of the data set, wherein the method comprises the following steps:

2. The method of claim 1, wherein the dividing the data in the dataset into marked areas and unmarked areas according to the marking of the dataset, further comprises:

3. The method according to claim 2, wherein the dividing the data in the dataset into marked areas and unmarked areas according to the marking condition of the dataset, further comprises:

4. The method of claim 1, wherein the training the object detection model from the plurality of data sets further comprises:

5. An apparatus for target detection, the apparatus comprising:

an acquisition module for acquiring a plurality of data sets for training, the plurality of data sets comprising: a classification dataset, a semantic segmentation dataset, a partial annotation dataset, and a tag broad dataset;

the training module is used for training the target detection model according to the multiple data sets to obtain a trained target detection model; the target detection model is a full-convolution single-stage target detection model, the classification data set is used for training classification branches and/or centrality branches of the target detection model, the semantic segmentation data set is used for training the classification branches of the target detection model, the marked data in the part of marked data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model, and the data consistent with the required labels in the label broad data set is used for training the classification branches, frame regression branches and centrality branches of the target detection model;

the preset module is used for presetting an initial target detection model;

the training module trains the target detection model according to the plurality of data sets, including:

taking the initial target detection model as a teacher model;

The training module divides the data in the data set into marked areas and unmarked areas according to the marking condition of the data set, and the training module comprises the following steps:

6. A computer device, comprising:

a memory and a processor in communication with each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of object detection of any of claims 1 to 4.

7. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of object detection according to any one of claims 1 to 4.