CN113887534B

CN113887534B - Determination method of object detection model and related device

Info

Publication number: CN113887534B
Application number: CN202111462134.1A
Authority: CN
Inventors: 曾浩; 邓大付; 黄超; 李玺; 王君乐
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-03-18
Anticipated expiration: 2041-12-03
Also published as: CN113887534A

Abstract

The embodiment of the application discloses a method for determining an object detection model and a related device, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. And performing model training by taking the labeled image sample with the sample label in the source field and the unlabeled image in the target field as training samples. According to the first output feature of the first intermediate layer of the feature extractor, the first loss function is determined through the prediction field obtained by the first field classifier, the distinguishing capability of the first field classifier on the source field and the target field can be improved based on the adjustment of the first loss function, the feature distance between the features of the source field and the target field is reduced when the features are extracted by the first intermediate layer based on the adjustment of the negative value of the first loss function, and the purpose of confusing the source field and the target field is achieved. A large amount of labeled data in the source field can be effectively used in object detection in the target field, and the training efficiency and the detection performance of the object detection model are greatly improved.

Description

Determination method of object detection model and related device

Technical Field

The present application relates to the field of data processing, and in particular, to a method for determining an object detection model and a related apparatus.

Background

The object detection model can realize the detection function of the specified object in the image, such as identifying the position of an enemy in the game image, identifying the position of a lane line in the traffic image and the like. For the object detection requirement in a field, an object detection model for the field needs to be trained, and model training needs to rely on a large amount of labeled data, which consumes a large amount of time and cost, especially in the case that the field is a new field for developers.

In the related art, a large amount of label data which has been accumulated for a long time in other related fields is used for training an object detection model, and the trained object detection model is applied to the new field to provide an object detection service. However, this cross-domain approach may cause a problem of domain gap (domain gap), which refers to a phenomenon in which the object detection performance of an object detection model trained by one domain data set is degraded on another domain data set due to differences in data distribution between the different domain data sets. That is, the object detection model obtained by the related art is not ideal in detection performance in a new field.

Therefore, a large amount of labeled data of other fields cannot be used in the new field, and the training efficiency of the object detection model in the new field is seriously reduced.

Disclosure of Invention

In order to solve the above technical problem, the present application provides a method for determining an object detection model and a related device, so that a large amount of labeled data in a source field can be effectively used in object detection in a target field, and training efficiency and detection performance of the object detection model in the target field are greatly improved.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a method for determining an object detection model, where the method includes:

acquiring a training sample set, wherein training samples in the training sample set comprise an annotated image sample in a source field and an annotated image in a target field, and a sample label of the annotated image sample is used for identifying position information of a target object in the annotated image sample;

performing model training on an initial detection model according to the training sample, wherein if the training sample is the labeled image sample, performing model parameter adjustment on the initial detection model according to a detection result of the target object and the sample label to obtain an object detection model, wherein the initial detection model comprises a feature extractor for extracting image features of the training sample, and the object detection model is used for detecting the target object on the image in the target field;

in the model training process, according to a first output feature of a first intermediate layer of the feature extractor, determining a first prediction domain corresponding to the first output feature through a first domain classifier;

determining a first loss function according to a difference between an actual belonging field of the training sample and the first prediction field;

and adjusting the model parameters of the first domain classifier based on the first loss function, and adjusting the model parameters of the first intermediate layer through the negative value of the first loss function.

On the other hand, an embodiment of the present application provides an apparatus for determining an object detection model, where the apparatus includes an obtaining unit and a training unit:

the acquiring unit is used for acquiring a training sample set, wherein training samples in the training sample set comprise an annotated image sample in a source field and an annotated image in a target field, and a sample label of the annotated image sample is used for identifying position information of a target object in the annotated image sample;

the training unit is configured to perform model training on an initial detection model according to the training sample, wherein if the training sample is the labeled image sample, performing model parameter adjustment on the initial detection model according to a detection result of the target object and the sample label to obtain an object detection model, the initial detection model includes a feature extractor for extracting image features of the training sample, and the object detection model is used for detecting the target object for the image in the target field;

the training unit is further used for determining a first prediction domain corresponding to a first output feature through a first domain classifier according to the first output feature of a first intermediate layer of the feature extractor in the model training process;

the training unit is further configured to determine a first loss function according to a difference between an actual domain to which the training sample belongs and the first prediction domain;

the training unit is further configured to adjust model parameters of the first domain classifier based on the first loss function, and adjust the model parameters of the first intermediate layer by a negative value of the first loss function.

In yet another aspect, an embodiment of the present application provides a computer device, including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of the above aspect according to instructions in the program code.

In yet another aspect, an embodiment of the present application provides a computer-readable storage medium for storing a computer program for executing the method of the above aspect.

In yet another aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the above aspect.

According to the technical scheme, the marked image sample with the sample label in the source field and the unmarked image in the target field are both used as training samples, and model training is carried out on the initial detection model. The initial detection model comprises a feature extractor for extracting the image features of the training samples, in the model training process, according to the first output features of the first middle layer of the feature extractor, the first prediction field corresponding to the first output features is determined through the first field classifier, the first loss function is determined based on the difference of the actual field of the training samples, the resolution capability of the first field classifier on the source field and the target field can be improved based on the adjustment of the first loss function, and the feature distance between the features of the source field and the target field is reduced when the features are extracted by the first middle layer based on the adjustment of the negative value of the first loss function, so that the purpose of confusing the source field and the target field is achieved. Therefore, through completely opposite optimization directions, parameter adjustment is carried out on the first field classifier and the first intermediate layer based on the thought of the countermeasure training, the characteristic extractor is guided to weaken respective unique information under the source field and the target field when the characteristics of the training samples are extracted, and the information which can be used for distinguishing the source field from the target field in the characteristics is reduced, so that the information which is relevant to the field in the image characteristics extracted by the characteristic extractor is weakened, and the field confusion effect is realized. When the training sample is the labeled image sample, model parameter adjustment can be carried out on the initial detection model according to the detection result and the sample label, so that the image feature of the image in the target field can be effectively extracted by the trained object detection model, the target object can be accurately detected, a large amount of labeled data in the source field can be effectively used in object detection in the target field, and the training efficiency and the detection performance of the object detection model in the target field are greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of a determination scenario of an object detection model according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for determining an object detection model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a domain classifier based confrontation training according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a confrontation training based on two domain classifiers according to an embodiment of the present application;

fig. 5 is a schematic view of a determination scenario of an object detection model based on a domain classifier according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for determining an object detection model according to an embodiment of the present disclosure;

fig. 7 is a device configuration diagram of an object detection model determination device according to an embodiment of the present application;

fig. 8 is a structural diagram of a terminal device according to an embodiment of the present application;

fig. 9 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

In the related art, for a target field lacking of labeled data, due to the problem of domain gaps, after sufficient labeled data exist in the target field, an object detection model for the target field can be obtained through training, so that the efficiency is low, and a large amount of labeled data existing in other fields are wasted.

Therefore, the embodiment of the application provides a method for determining an object detection model, so that a large amount of labeled data in a source field can be effectively used in object detection in a target field, and the training efficiency and the detection performance of the object detection model in the target field are greatly improved.

The method for determining the object detection model provided by the embodiment of the application can be implemented by computer equipment, and the computer equipment can be terminal equipment or a server, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server providing cloud computing services. The terminal devices include, but are not limited to, mobile phones, computers, intelligent voice interaction devices, intelligent household appliances, vehicle-mounted terminals, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.

It is understood that in the specific implementation of the present application, the various images used may relate to user information and other related data, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

Embodiments of the present application relate to Artificial Intelligence (AI), which is a theory, method, technique, and application system that simulates, extends, and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

The scheme provided by the embodiment of the application relates to the computer vision technology, the machine learning technology and other technologies of artificial intelligence, for example, the feature extraction of images is realized through the computer vision technology, the object detection is realized through the machine learning technology, and the like, wherein:

computer Vision technology (CV), which is a science for researching how to make a machine "see", and further refers to using a camera and a Computer to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The following examples are intended to illustrate in particular:

in the scenario shown in fig. 1, the server 100 is taken as an example of the aforementioned computer device. And determining a training sample set aiming at the target object needing to be detected in the target field.

The target object may be a different kind of object, and the present application is not limited thereto, and may be, for example, a human, an animal, a vehicle, a game character, and the like.

The training sample set determined according to the target object comprises training samples, wherein the training samples comprise labeled image samples in the active field and unlabeled images in the target field. The sample label of the annotated image sample is used for identifying the position information of the target object in the annotated image sample, namely, the source field and the target field both comprise the same type of target object, and the source field has sufficient annotation information relative to the target field, namely, a large number of annotated image samples which are annotated with the target object. However, the target field lacks or does not have the label information for the target object, so in the embodiment of the present application, only the label-free image of the target field, that is, the image without the mark for identifying whether the target object exists or not and the position of the target object, need to be used.

The method comprises the steps of carrying out model training on an initial detection model 101 according to training samples in a training sample set, in the process of model training, aiming at the fact that the initial detection model 101 comprises a feature extractor used for extracting image features of the training samples, determining a corresponding first prediction field through a first field classifier 103 according to first output features of a first middle layer of the feature extractor, determining a first loss function based on the difference of the actual field of the training samples, improving the distinguishing capability of the first field classifier 103 on a source field and a target field based on the adjustment of the first loss function, and guiding the first middle layer to reduce the feature distance between the features of the source field and the target field when the features are extracted based on the adjustment of a negative value of the first loss function so as to achieve the purpose of confusing the source field and the target field.

In the example of fig. 1, a Gradient Reversal Layer (GRL) may be provided on the reverse link between the first intermediate layer and the first domain classifier 103, and the negative value processing of the first penalty function is implemented by the GRL.

Therefore, through completely opposite optimization directions, parameter adjustment is carried out on the first field classifier 103 and the first intermediate layer based on the thought of the countermeasure training, the characteristic extractor is guided to weaken respective unique information under the source field and the target field when the characteristics of the training samples are extracted, and the information which can be used for distinguishing the source field from the target field in the characteristics is reduced, so that the information which is related to the field in the image characteristics extracted by the characteristic extractor is weakened no matter the training samples in the source field or the training samples in the target field, and the field confusion effect is realized.

When training is performed based on the image characteristics, model parameter adjustment can be performed on the initial detection model according to the detection result and the sample label when the training sample is the labeled image sample, so that the trained object detection model 102 can effectively extract the image characteristics for the image in the target field, and can accurately detect the target object.

Through the field-based target detection method (adaptive-based DDAOD) based on multi-level confrontation training, a large amount of labeled data in the source field can be effectively used in object detection in the target field, and the training efficiency and the detection performance of an object detection model in the target field are greatly improved.

Fig. 2 is a flowchart of a method for determining an object detection model, where the method may be executed by a server as the foregoing computer device, where the method includes:

s201: acquiring a training sample set, wherein training samples in the training sample set comprise labeled image samples in a source field and unlabeled images in a target field.

The source field and the target field belong to different fields, the range granularity identified in the field can be determined based on the requirements of actual application scenes, and based on different requirements, the field can identify a product, a product type, an application type and the like. For example, the target domain may be a new game application, a new social application, a category of games, and the like.

The target object may be a different kind of object, and the present application is not limited thereto, and may be, for example, a human, an animal, a vehicle, a game character, and the like. The specific target object to be detected may be determined based on the type of the target field, actual requirements, and the like, for example, when the target field is a shooting game, game characters in the game are human-shaped virtual images, and the target object may be a person. When the target field is an automatic driving application, the corresponding scene needs to identify vehicles in the traffic environment, and then the target object may be a vehicle.

The annotated image sample in the source domain has an annotated sample label, which is used to identify location information of a target object in the annotated image sample. That is, whether the image sample has the target object or not can be explicitly labeled through the sample label, and when the target object exists, the actual position of the target object in the labeled image sample is labeled.

It is noted that the source domain may be at least one other domain than the target domain.

For the target object detection in the target field, in order to more accurately use the annotated image samples in other fields, the source field and the annotated image samples in the source field can be determined by the target object to be detected.

In one possible implementation, the method further includes:

determining the target object to be detected according to the object detection requirement of the target field;

determining the source domain having the target object according to the target object.

That is, according to the target object which is clearly required to be detected, a domain having the target object can be determined from other domains as a source domain. Therefore, the trained object detection model has more pertinence to the detection capability of the target object in the target field and has better performance.

For example, when the target object to be detected in the target field is "person", there are three known fields, namely field 1, field 2 and field 3, and only the image in field 3 of the three fields has a person, and the person in the image has been labeled, so that the field 2 can be determined as the source field from the three fields based on the target object "person", and the image in field 2 with the person labeled can be used as the labeled image sample of the source field.

It should be noted that, based on the difference in the range granularity identified for the domain in the present application, different source domains may be selected for the target domain, which is not limited in the present application. For example, when the range identified by the field is a type of product, the source field may be selected from various products of the non-game type if the target field is a game application, and when the range identified by the field is a product, the source field may be selected from various products of the non-game type if the target field is a game of the shooting type.

S202: and performing model training on an initial detection model according to the training sample, wherein if the training sample is the labeled image sample, performing model parameter adjustment on the initial detection model according to the detection result of the target object and the sample label to obtain an object detection model.

The initial detection model comprises a feature extractor for extracting image features of the training sample, and the initial detection model can perform object detection based on the extracted image features to obtain a detection result.

Because the training sample set comprises a part of training samples with sample labels, namely the labeled image samples in the source field, when the detection results of the training samples are determined by the initial detection model, model parameters of the initial detection model can be adjusted based on the difference between the sample labels of the labeled image samples and the detection results, so that the initial detection model learns the detection capability of the target object in the model training process.

The object detection model may thus be used for detecting the target object from the image of the target area.

For example, the target field is shooting game application (FPS game), and the target object is a game role, by using the model training mode provided in the embodiment of the present application, the dependence of the object detection model on the target field subscript training samples in the training process can be reduced. For a new FPS game, only the labeled real domain data and the unlabeled game data are needed to carry out domain adaptation training, and the migration effect of the target object detection on the game data can be greatly improved. Wherein, the real domain data is a sample of an annotated image in the source domain, and the unlabelled game data is an unlabelled image in the target domain. The specific application scenarios can be at least two of the following:

first, the object detection service provided by the object detection model can provide robust high-level semantic understanding for AI of the FPS game under different scenes. The game character position information in the object detection result not only can help the game AI to sense the current game opponent's enemy situation, but also can provide important basis for the next decision. For example, the AI corresponding capability in a man-machine scene can be effectively improved.

Secondly, the object detection service provided by the object detection model can also be applied to an image-based game scene automatic testing framework. In the field of automated testing, automation of game scenes is a difficult point, because games are highly personalized products and the playing methods of different types of games are very different. Each game is usually designed and developed independently, and a game developer does not expose a uniform interface to the outside, which also means that the traditional scheme for automation based on the API is not universal in the field of games.

According to the object detection model determined by the embodiment of the application, the implemented cross-domain self-adaptive FPS game character detection algorithm does not depend on any API (application program interface) provided by a game developer, and for different FPS games, the domain adaptive training is carried out by taking game images as input, so that the performance of the object detection algorithm can be greatly improved, and the service scene of the automatic test of the game scene can be effectively supported.

Therefore, when model training is carried out based on the labeled image sample serving as the training sample, model parameter adjustment can be carried out on the initial detection model according to the detection result and the sample label, so that the object detection model obtained by training can accurately detect the target object, and a large amount of labeled data in the source field can be effectively used in object detection in the target field.

The following describes how to make the annotated image sample of the source domain applicable to the target domain during the model training process. S203-S205 described below may be performed before the training of the initial detection model is completed through S202, and it is not necessary whether the training samples have sample labels, that is, the countermeasure training described below may be performed on both the training samples of the source domain and the training samples of the target domain.

S203: and in the model training process, determining a first prediction domain corresponding to the first output feature through a first domain classifier according to the first output feature of the first intermediate layer of the feature extractor.

The feature extractor comprises a plurality of intermediate layers, each intermediate layer performs corresponding processing on the output of the previous intermediate layer, the resolution of the image features serving as training samples is gradually reduced in the process of image feature extraction, and finally the image features used for expressing the image are output. For example, the ith intermediate layer is one of the intermediate layers, the input characteristic of the ith intermediate layer may be the output characteristic of the (i-1) th intermediate layer, and the output characteristic of the ith intermediate layer may be the input characteristic of the (i + 1) th intermediate layer. The present application does not limit which of the plurality of intermediate layers the first intermediate layer is.

The first output feature of the first intermediate layer carries image information of the training sample in the resolution corresponding to the first intermediate layer, and the corresponding first prediction domain can be determined through the first domain classifier.

The resolution corresponding to the first intermediate layer may be relatively high (e.g., closer to the input layer of the feature extractor), such as a local feature extraction layer whose first output features are the bottom-level features of the training samples, and the resolution corresponding to the first intermediate layer may be relatively low (e.g., closer to the output layer of the feature extractor), such as a global feature extraction layer whose first output features are the high-level features of the training samples.

S204: determining a first loss function from a difference between an actual domain to which the training sample belongs and the first prediction domain.

The first domain classifier may be a classification model for determining whether the input data belongs to a source domain or a target domain.

Since it is clear whether the training is originally of the source domain or the target domain, whether the training sample has a sample label or not, the difference from the first prediction domain can be determined based on this, thereby determining the first loss function.

S205: and adjusting the model parameters of the first domain classifier based on the first loss function, and adjusting the model parameters of the first intermediate layer through the negative value of the first loss function.

The negative value of the first loss function may be obtained by multiplying the first loss function by a negative number (e.g., -1). The larger the difference identified by the first loss function, the smaller the difference identified by the negative value of the first loss function, thereby model optimizing the first domain classifier and the first intermediate layer in exactly opposite directions.

And adjusting the model parameters of the first field classifier based on the first loss function, so that the resolution capability of the first field classifier on the source field and the target field can be improved. And adjusting the model parameters of the first intermediate layer based on the negative value of the first loss function so as to guide the first intermediate layer to reduce the characteristic distance between the characteristics of the source field and the target field when the characteristics are extracted, thereby achieving the purpose of confusing the source field and the target field.

That is to say, the optimization direction of the first loss function to the first domain classifier is to enable the first domain classifier to more accurately distinguish the corresponding domain as the source domain or the target domain according to the first output feature of the first intermediate layer, and the optimization direction of the first intermediate layer based on the negative value of the first loss function is to weaken the domain feature that can embody the domain as much as possible when the first intermediate layer of the feature extractor extracts the first output feature, so that the first domain classifier cannot distinguish which domain the image corresponding to the first output feature comes from as much as possible. Thereby enabling antagonistic training between the first intermediate tier and the first domain classifier.

As shown in fig. 3, a GRL is provided on the reverse link (shown by a dotted line) between the first domain classifier and the first intermediate layer, and as previously described, the GRL is a gradient inversion layer, and the input first loss function may be multiplied by a negative number (e.g., -1), so that the first loss function is given a negative sign after passing through the first domain classifier, resulting in a negative value of the first loss function, which passes through the first intermediate layer. Such that the first domain classifier and the first intermediate layer are trained with diametrically opposite optimization directions.

It should be noted that the first domain classifier is only used in the training process of the initial detection model, and does not belong to a part of the initial detection model, and the first domain classifier is not used when the object detection model is obtained through training and the object detection service is provided for the target domain.

In this embodiment of the application, the first intermediate layer may be the local feature extraction layer or the global feature extraction layer.

As explained in S203 above, the local feature extraction layer is an intermediate layer with a higher resolution among the plurality of intermediate layers of the feature extractor, and is an intermediate layer closer to the input layer of the feature extractor in the model structure. Because the resolution corresponding to the local feature extraction layer is high, and the image features input into the local feature extraction layer have a large amount of image detail information (or local information) in the training sample, the local feature extraction layer pays more attention to the image details during feature extraction, so that the obtained output features carry rich image detail information.

The global feature extraction layer is an intermediate layer with lower resolution among a plurality of intermediate layers of the feature extractor, and is an intermediate layer closer to an output layer of the feature extractor on the model structure. Because the resolution corresponding to the global feature extraction layer is low, the image features input into the global feature extraction layer have a large amount of image global information in the training samples, and have less image detail information, the global feature extraction layer pays more attention to the image global during feature extraction, so that the obtained output features carry abundant image global information.

In a possible implementation manner, if the first intermediate layer is a local feature extraction layer, S203 includes:

determining pixel prediction fields corresponding to pixels included in the first output features respectively through a first field classifier according to the first output features of the first middle layer of the feature extractor;

determining the first prediction domain from the pixel prediction domain.

For example, the first output feature is an image feature with H × W pixels, and each pixel determines a corresponding pixel prediction domain through the first domain classifier, that is, each pixel belongs to the recognition result of the source domain or the target domain. The first prediction domain is determined by integrating the prediction domains of the pixels, so that the difference of the whole pixels can be considered, and the accuracy of the first prediction domain can be improved.

Therefore, through completely opposite optimization directions, parameter adjustment is carried out on the first field classifier and the first intermediate layer based on the thought of the countermeasure training, the characteristic extractor is guided to weaken respective unique information under the source field and the target field when the characteristics of the training samples are extracted, and the information which can be used for distinguishing the source field from the target field in the characteristics is reduced, so that the information which is relevant to the field in the image characteristics extracted by the characteristic extractor is weakened, and the field confusion effect is realized. The object detection model obtained by training can effectively extract image features aiming at the image in the target field, so that a large amount of labeled data in the source field can be effectively used in the object detection in the target field, and the training efficiency and the detection performance of the object detection model in the target field are greatly improved.

In order to better serve the purpose of letting the feature extractor confuse the source domain and the target domain, in one possible implementation, during the model training, the method further includes:

s11: and determining a second prediction domain corresponding to the second output feature through a second domain classifier according to the second output feature of the second middle layer of the feature extractor.

The second intermediate layer and the first intermediate layer are two different intermediate layers of the plurality of intermediate layers of the feature extractor. The input feature of the second intermediate layer is a previous intermediate layer of the second intermediate layer among the plurality of intermediate layers of the feature extractor, and the second intermediate layer performs image feature extraction on the input feature based on the corresponding resolution according to the input feature, thereby forming an output of the second intermediate layer as a second output feature.

S12: determining a second loss function based on a difference between an actual domain to which the training sample belongs and the second prediction domain.

The second domain classifier may be a classification model for determining whether the input data belongs to the source domain or the target domain.

S13: and adjusting the model parameters of the second domain classifier based on the second loss function, and adjusting the model parameters of the second middle layer through the negative value of the second loss function.

That is, on the basis of performing the countermeasure training based on the domain classifier on the first intermediate layer, a similar countermeasure training may be performed on a second intermediate layer different from the first intermediate layer in the feature extractor, so as to enhance the capability of the feature extractor to confuse the features of the source domain and the target domain.

As shown in fig. 4, a GRL is provided on the reverse link (shown by a dotted line) between the second domain classifier and the second intermediate layer such that the second loss function is given a negative sign after passing through the second domain classifier and then passes through the second intermediate layer. Such that the second domain classifier and the second intermediate layer are trained with diametrically opposite optimization directions.

The countermeasure training-based approach utilizes domain classifiers such as the first domain classifier and the second domain classifier in figure 4 to measure the difference between the source domain and the target domain, training countermeasures between the feature extractor and the domain classifier by a gradient inversion layer (GRL) to encourage domain confusion between features of the source domain and features of the target domain, therefore, when the feature extractor in the initial detection model extracts the features of the training samples in the target field or the source field, the image features used for distinguishing the fields in the extracted image features are weakened, and correspondingly, the image characteristics for the target object are strengthened, so that when the trained object detection model detects the target object aiming at the image of the target field, the image characteristics related to the target object in the image can be more concerned, and the detection performance of the target field is improved.

The main role of the domain classifier is to discriminate whether the input features are from the source domain or the target domain. The features of the middle layer of the feature extractor are input into a domain classifier, and the domain classifier is used for carrying out secondary classification on the input features and judging whether the features come from a source domain or a target domain.

When the network trains to carry out gradient return, the network passes through a Gradient Reversal Layer (GRL), the gradient reversal layer has the function of multiplying the gradient of the return transmission by a negative sign, and the return gradient can enable the optimization direction of the feature extractor part to be just opposite to that of the domain classifier: the function of the domain classifier is to hopefully judge whether the input training sample comes from the source domain or the target domain as much as possible; the optimization direction of the feature extractor is just opposite, and it is expected that the extracted features are impossible to judge whether the features come from the source domain or the target domain. The trained feature extractor can realize the field confusion between the source field features and the target field features, and can draw the feature distance between the source field and the target field, thereby improving the detection performance of the target detector in the target field.

It should be noted that the second domain classifier is only used in the training process of the initial detection model, and does not belong to a part of the initial detection model, and the second domain classifier is not used when the object detection model is obtained through training and the object detection service is provided for the target domain.

It is considered that the target domain may be very different from the image as the source domain, for example, the target domain is a game image and the source domain is a real image. Therefore, in the embodiment of the application, for the bottom layer features (i.e., the first output features extracted by the local feature extraction layer, which include more image detail information, and the local feature extraction layer is closer to the input layer of the feature extractor and therefore also recorded as bottom layer features) and the high layer features (i.e., the second output features extracted by the global feature extraction layer), which include more image global information and the global feature extraction layer is closer to the output layer of the feature extractor and therefore also recorded as high layer features), through the countermeasure training performed by the corresponding domain classifiers, the image features embodying the characteristics of the source domain or the target domain can be weakened by at least two intermediate layers of the feature extractor when the image features are extracted, so as to achieve the purpose of domain confusion.

In a possible implementation manner, the first intermediate layer is a local feature extraction layer, and the second intermediate layer is a global feature extraction layer; alternatively, the first and second electrodes may be,

the first intermediate layer is a global feature extraction layer, and the second intermediate layer is a local feature extraction layer.

For example, for the local feature extraction layer, the output first output feature of the local feature extraction layer has more image detail information, such as texture features, and the like, and the texture features related to the source domain or the target domain in the first output feature are weakened through the foregoing countermeasure training, so that the effect of aligning the underlying texture features is achieved.

For example, for the global feature extraction layer, the second output features output by the global feature extraction layer have more image global information, such as semantic features describing the global information, and the semantic features related to the source field or the target field in the second output features are weakened through the countermeasure training, so that the effect of aligning the high-level semantic features is realized.

Therefore, in the image features output by the feature extractor, the domain confusion of different feature levels (such as local image detail information and image global information) is realized.

That is, when the first intermediate layer and the second intermediate layer are the local feature extraction layer and the global feature extraction layer, respectively, after the foregoing countercheck training with the first domain classifier and the second domain classifier, respectively, the local feature extraction layer and the global feature extraction layer of the feature extractor can learn the capability of confusing the source domain and the target domain at the time of feature extraction, improve the domain confusing capability of the feature extractor for the source domain and the target domain at a plurality of feature levels, reduce the information that can be used for distinguishing the source domain from the target domain in the first output feature and the second output feature, weaken the information related to the domain in the image features extracted by the feature extractor in the training samples of the source domain and the training samples of the target domain, and accurately detect the target object based on the image features output by the feature extractor in the trained object detection model for the source domain and the target domain, the function of field confusion is realized.

In a scenario where the first middle layer and the second middle layer are the local feature extraction layer and the global feature extraction layer, respectively, the first loss function and the second loss function may be as follows:

referring to fig. 4, the bottom-layer local features (e.g., the first output features) extracted by the local feature extraction layer in the feature extractor are fed into the first domain classifier to obtain a domain prediction map with a size of H × W, each pixel of the prediction map represents the prediction result of the position, and the bottom-layer local features are strongly aligned by using L2 loss. The calculation formula of the first loss function for the entire local alignment is as follows.

Wherein the content of the first and second substances,

for the training samples from the source domain,

for the training samples from the target domain,

for the number of samples from the source domain used in a training (batch),

for the number of samples from the target domain used in this training,

the parameters of the local feature extraction layer are taken,

is a parameter of the first domain classifier,

indicating the prediction result for each position on the output prediction graph,

for the first loss function of the local alignment,

a local alignment loss function for the training sample contribution from the source domain,

a local alignment loss function contributed to training samples from the target domain.

The high-level global features (such as second output features) obtained by a global feature extraction layer in the feature extractor are sent to a second domain classifier, the second domain classifier can output label information to indicate whether the features come from a source domain or a target domain, and Focal local is used for carrying out weak alignment on the high-level global features, so that the second domain classifier can pay more attention to difficult samples which are difficult to distinguish. The calculation formula for the second penalty function for the entire global alignment is as follows.

Wherein the content of the first and second substances,

the parameters of the global feature extraction layer are,

is a parameter of the second domain classifier,

a second penalty function for the overall global alignment,

for a global alignment loss function from the source domain data contribution,

for the global alignment loss function from the target domain data contribution,

for example, the over-parameter for controlling the weight of the hard sample in the Focal loss may be set to 5.

In the above formula, the prediction domain labels for the source domain and the target domain in the first domain classifier and the second domain classifier may be 0 and 1, where 0 identifies the source domain and 1 identifies the target domain.

The domain classifier can assist in countertraining the middle layer of the feature extractor to help the feature extractor learn the confusion capacity of the source domain and the target domain, and can further assist in improving the object detection precision.

In one possible implementation, the method further includes:

s21: and acquiring a first intermediate layer characteristic of the first domain classifier.

S22: and determining input features of an object detection layer in the initial detection model according to the first intermediate layer features and the image features, and determining a detection result aiming at the target object through the object detection layer.

The first intermediate layer feature of the first domain classifier is a feature obtained by the first domain classifier before performing domain classification on the input data, the feature carries information for domain classification, and according to a resolution corresponding to the first intermediate layer in the feature extractor, the first intermediate layer feature of the first domain classifier may further include information at the resolution, such as local information or global information.

In the process of carrying out object detection on a training sample through an object detection layer of an initial detection model, the characteristics of the first intermediate layer and the image characteristics output by the characteristic extractor of the initial detection model are used as the basis of the object detection, so that the information of the resolution corresponding to the first intermediate layer can be enriched, and part of information related to the field is provided, thereby improving the information dimension which can be referred to by the object detection and ensuring the quality of a detection result.

In the foregoing scenario of performing countermeasure training on multiple intermediate layers (e.g., a first intermediate layer and a second intermediate layer) in the feature extractor, in addition to the first intermediate layer features of the first domain classifier being used as the basis for object detection, the second intermediate layer features of the second domain classifier may be added to the object detection.

In one possible implementation, the method further includes:

s31: and acquiring a first intermediate layer characteristic of the first domain classifier and a second intermediate layer characteristic of the second domain classifier.

S32: determining input features of an object detection layer in the initial detection model according to the first intermediate layer features, the second intermediate layer features and the image features, and determining a detection result for the target object through the object detection layer.

Therefore, when the first middle layer and the second middle layer of the initial detection model are respectively a local feature extraction layer and a global feature extraction layer, the middle layer features of the two domain classifiers can strengthen the local features and the global features according to the object detection, so that the data dimension and the information quantity of the features according to the object detection are comprehensively enhanced, and the quality of the object detection result is improved.

The first intermediate layer feature and the image feature can be combined to obtain a complete feature for object detection, and similarly, the first intermediate layer feature, the second intermediate layer feature and the image feature can be combined to obtain a complete feature for object detection.

The application does not limit the specific implementation of merging, for example, the merging can be completed by stitching the intermediate layer features to the image features.

When the training sample is the marked image sample in the source field, the marked image sample has a known sample label, and after the initial detection model determines the detection result of the marked image sample, the sample label can be used as the basis for evaluating the detection result, so that model training of the initial detection model is realized.

When the image features are combined with the first intermediate layer features and the second intermediate layer features before the detection result is determined, the first domain classifier and the second domain classifier can be correspondingly adjusted during model training.

In one possible implementation, S202 includes:

s2021: determining an object detection loss function from the detection result for the target object and the sample label.

S2022: and adjusting model parameters of the initial detection model, the first domain classifier and the second domain classifier according to the object detection loss function.

As shown in fig. 5, feature outputs (e.g., a first middle layer feature and a second middle layer feature) of the middle layers of the two domain classifiers are introduced into a detection head network (prediction head) of the initial object detection model as context features, and are combined with image features of the input detection head for predicting category and position information of the target object. Therefore, the domain classifier can be trained more stably while better feature alignment is realized.

The overall loss function of the whole algorithm is calculated as follows.

F is the overall parameters of the feature extractor in the initial object detection model, R is the parameters of other parts in the initial detection model, D is the domain classifier,

the total resistance to training loss, including global alignment loss and local alignment loss,

object detection loss function for initial detection model: including classification loss and bounding box regression loss,

as a function of the total detection loss for the source domain data,

for the final optimization goal.

For an overall description with reference to fig. 5, the overall algorithm calculation process is as follows:

the training sample as input data is firstly passed through a feature extractor of an initial detection model, and output features (such as a first output feature and a second output feature) of different feature levels are extracted based on a first intermediate layer and a second intermediate layer of the feature extractor.

The first output characteristic and the second output characteristic are respectively sent into a first domain classifier and a second domain classifier for distinguishing whether the corresponding output characteristic is from a source domain or a target domain, and a corresponding first loss function and a corresponding second loss function are calculated in a distinguishing mode. And simultaneously combining the first intermediate layer characteristics of the first domain classifier and the second intermediate layer characteristics of the second domain classifier as context characteristics with the image characteristics output by the characteristic extractor to obtain combined characteristics.

The initial detection model comprises an input detection head network used for determining a detection result, and the combined features are input into the input detection head network to play a role in stable training.

The combined features are used for further target object classification and regression of a bounding box, and the output detection result and the sample label jointly calculate an object detection loss function.

Data requirements are as follows: the prepared training sample set needs training samples comprising a source field and a target field, wherein the training samples in the source field are provided with sample labels, and are generally from realistic data with abundant labels, such as VOC; the training sample of the target domain is free of any annotation information, typically data from the target domain such as game data.

Training process: in model training the initial detection model, each batch may contain training samples from both the source domain and the target domain. The training sample from the source field has a sample label, so that two parts of loss are generated, and one part is to calculate the detection loss for the detection result generated by the initial detection model

Part is label calculation domain classification loss after data passes through a domain classifier

And

. The calculated loss of the target domain only comprises the domain classification loss because the data of the target domain has no object position marking information

And

。

in order to further enhance the confusion capability of the object detection model for the domain features in the source domain and the target domain, in a possible implementation, the initial detection model includes an object detection layer for determining the detection result according to the image features, and during the model training, the method further includes:

s41: acquiring detection frame characteristics output by a middle layer of the object detection layer, wherein the detection frame characteristics comprise a detection frame for object detection;

s42: determining a third prediction domain corresponding to the detection frame features according to a third domain classifier;

s43: determining a third loss function according to a difference between an actual domain to which the training sample belongs and the third prediction domain;

s44: and adjusting the model parameters of the third domain classifier based on the third loss function, and adjusting the model parameters of the object detection layer through the negative value of the third loss function.

When the initial detection model detects an object aiming at the target object, a detection frame possibly comprising the target object is determined on the input features based on the detection purpose (generally, the middle layer of the object detection layer can be an ROI Align layer), and then whether the target object is provided or not and the possible position of the target object are determined based on the information included in the detection frame.

Based on the third field classifier, the detection frame features are subjected to field classification, and based on an antagonistic training thought similar to the first field classifier and the second field classifier, by adjusting model parameters of the object detection layer, the influence of the field features of the source field on the labeling of the detection frame is weakened in the process of labeling the detection frame by the object detection layer, the alignment of the detection frame features is further realized, and the feature alignment target with more scales in the initial detection model is realized.

It should be noted that the third-domain classifier is only used in the training process of the initial detection model, and does not belong to a part of the initial detection model, and the third-domain classifier is not used when the object detection model is obtained through training and the object detection service is provided for the target domain.

The embodiment of the present application provides, in addition to the above domain-based adaptive target detection method based on multi-level countermeasure training, a domain-based adaptive target detection method (transform-based DDAOD) based on image style conversion, which can also effectively apply an annotated image sample in a source domain to training of an object detection model in a target domain in a target object detection scene in the target domain. This mode can be implemented on the basis of the embodiments corresponding to fig. 1 to 5, or can be implemented independently.

Fig. 6 is a flowchart of a method for determining an object detection model, where the method may be executed by a server as the foregoing computer device, and the method further includes:

s601: and acquiring an image style conversion model aiming at the source field and the target field.

S602: and converting the image style of the marked image sample from a source field to the target field through the image style conversion model to obtain a style training sample.

And the sample label of the style training sample is the sample label of the labeled image sample before conversion.

S603: and performing model training on the initial detection model according to the style training sample to obtain an object detection model based on the image style.

The subsequently mentioned annotated image sample itself has a sample label for identifying the target object in the annotated image sample, and the sample label of the style training sample after the style conversion is the sample label of the annotated image sample before the conversion. Therefore, after the style training sample or the marked image sample is used as the training sample of the initial detection model and is input into the initial detection model, the corresponding loss function can be determined according to the difference between the detection result and the sample label of the input training sample, and the model training is carried out on the initial detection model.

The initial detection model may be the same as the initial detection model mentioned in the embodiments corresponding to fig. 1 to fig. 5, but the feature extractor in the initial detection model does not need to be trained against by the domain classifier additionally.

The initial detection model is used for detecting an input training sample aiming at a target object to obtain a corresponding detection result, wherein the detection result identifies whether the training sample has the target object or not, and the target object is located at the position of the training sample under the condition that the target object exists.

The image style of a field refers to the expression style of the image in the field, and can embody the specific color distribution, the aesthetic form and the like of the field. The image style conversion is to convert the image style of the source domain image into the image style of the target domain, for example, the source domain is the realistic style, the target domain is the cartoon style, and the realistic style image of the source domain is converted into the cartoon style through the image style conversion, but the substantial content is still remained.

In the image conversion-based method, style conversion is first performed using the CycleGAN algorithm. The goal of CycleGAN is to learn a mapping transfer function from two different data fields and sums. The CNN joint learning mapping is utilized in the actual training process

And reverse mapping

. After the training of a CycleGAN model (namely an image style conversion model) is finished, source domain data are converted into a style of a target domain by using the CycleGAN model to obtain a generated sample data set, and then the training is carried out by using source domain labels as pseudo labels of generated samples, so that the pixel level domain is adaptive to a target detection process.

The present application does not limit the model type of the image style conversion model, and may be an arithmetic model such as UNIT or MUNIT in addition to the above CycleGAN model.

Through the training, an object detection model based on the image style can be obtained, and the object detection model based on the image style is used for carrying out object detection on the image of the target field.

For example, the target field is a shooting game application (FPS game), and the target object is a game role, by using the model training mode based on image style conversion provided by this embodiment, the dependence of the object detection model on the label training samples in the target field in the training process can be reduced. For a new FPS game, only image style conversion is carried out on the real domain data with the label to obtain a style training sample, and domain adaptive training is carried out on the style training sample, so that the migration effect of target object detection on the game data can be greatly improved. Wherein the real domain data is an annotated image sample of the source domain. The specific application scenarios can be at least two of the following:

the object detection service provided by the image style-based object detection model can provide robust high-level semantic understanding for AI of the FPS game under different scenes. The game character position information in the object detection result not only can help the game AI to sense the current game opponent's enemy situation, but also can provide important basis for the next decision. For example, the AI corresponding capability in a man-machine scene can be effectively improved.

Secondly, the object detection service provided by the image style-based object detection model can also be applied to an image-based game scene automatic test framework. By the image style-based object detection model determined by the embodiment, the implemented cross-domain self-adaptive FPS game character detection algorithm does not depend on any API (application program interface) provided by a game developer, and for different FPS games, the domain adaptive training is carried out by taking a game image as input, so that the performance of the object detection algorithm can be greatly improved, and a service scene of automatic testing of a game scene can be effectively supported.

Therefore, by acquiring the labeled image sample of the source field and obtaining the style training sample converted into the style of the image of the target field through the style conversion model, when the initial detection model is subjected to model training through the style training sample, model parameter adjustment can be performed on the initial detection model according to the detection result and the sample label, so that the object detection model obtained through training can accurately detect the target object of the image of the target field, and a large amount of labeled data of the source field can be effectively used in object detection of the target field.

For example, after the style conversion of the target object in the annotation image sample, the obtained style training sample also has the target object in the style of the target field, but the position, the shape, and the like of the target object may change along with the style conversion. However, the sample label of the style training sample is also the sample label of the original labeled image sample, so that the position information of the target object labeled by the sample label may not be consistent with the actual position of the target object in the style training sample, and the position recognition capability of the trained object detection model based on the image style for the target object is not sufficiently learned when the target object is detected.

To this end, in one possible implementation, S603 includes: and performing model training on the initial detection model according to the style training sample and the labeled image sample to obtain an image style-based object detection model.

In the training of the initial detection model, the learning of the initial detection model to the position recognition capability of the target object is enhanced by adding the labeled image sample of the source field of the unconverted image style, so that the positioning capability of the image style-based object detection model to the target object is improved through model training.

When the initial detection model is trained, the style training samples and the labeled image samples from the source field are mixed to form a mixed data set, and the initial detection model is trained. The image style-based object detection model obtained by training has better target object detection capability aiming at the target field, and can also keep better positioning capability by utilizing the labeled image sample in the source field, thereby effectively improving the detection performance of the object detection model.

The calculation formula of the algorithm total loss function is as follows.

Wherein

For an annotated image sample from the source domain,

a exemplar label for an annotated image exemplar from a source domain,

for the number of samples from the source domain in one training (batch),

the sample is trained for the style of the training,

sample labels (also denoted as pseudo-labels) for style training samples,

the number of style training samples in this training, F is the parameter of the initial detection model,

for all losses of the initial detection model: including classification loss and bounding box regression loss,

the overall loss function is also the final optimization objective.

The overall algorithm calculation process is as follows:

and converting the labeled image sample of the source field into a target field style by using a CycleGAN model to obtain a style training sample.

And mixing the style training sample and the labeled image sample in the source field to form a mixed data set.

An initial detection model is trained on the mixed data set.

Data requirements are as follows:

when image style conversion is carried out to obtain a style training sample, a data set needing to be prepared comprises two parts of an annotated image sample in a source field and a style training sample in a style conversion target field: source domain annotated image samples have sample labels, typically from realistic data with rich labels, such as VOCs; and the style training sample of the target field is converted into a style training sample of the target field, and the labeled image sample of the corresponding source field is used as a pseudo label.

When the object detection model is trained, the style training samples obtained in the first step and the labeled image samples in the source field need to be prepared and mixed to form a mixed data set.

Training process: during training, a group of samples (batch) are extracted from the mixed data set each time and are sent to an initial detection model for training, and the training loss generally comprises two parts, one part is classification loss, and the other part is boundary box regression loss.

In order to test the cross-domain detection effect of the two algorithms in the application, a VOC-person data set is used as source field data, three game data sets of CFM/CJZC/CODM are used as target fields, and the training of the domain adaptive target detection algorithm is carried out. The detection model used in the experiment was YOLOv5 m. Baseline in the table below represents the results of the testing model trained directly on VOC-person and tested directly on the game data as the Baseline results for the cross-domain test. It can be seen that after training is carried out by matching with the two domain adaptive algorithms, the obtained detection model can bring stable detection performance improvement on three game data sets of CFM/CJZC/CODM.

TABLE 1 Effect of implementation on CFM data set

TABLE 2 Effect on CJZC data set

TABLE 3 Effect of implementation on the CODM-thre-0.005 data set

On the basis of the embodiments corresponding to fig. 1 to fig. 6, fig. 7 is a device structure diagram of an object detection model determining device according to an embodiment of the present application, where the object detection model determining device 700 includes an obtaining unit 701 and a training unit 702:

the acquiring unit 701 is configured to acquire a training sample set, where a training sample in the training sample set includes an annotated image sample in a source field and an unlabeled image in a target field, and a sample label of the annotated image sample is used to identify position information of a target object in the annotated image sample;

the training unit 702 is configured to perform model training on an initial detection model according to the training sample, wherein if the training sample is the labeled image sample, performing model parameter adjustment on the initial detection model according to a detection result of the target object and the sample label to obtain an object detection model, where the initial detection model includes a feature extractor for extracting image features of the training sample, and the object detection model is used to perform detection on the target object on an image in the target field;

the training unit 702 is further configured to, during the model training, determine, by using a first domain classifier, a first prediction domain corresponding to a first output feature of a first intermediate layer of the feature extractor;

the training unit 702 is further configured to determine a first loss function according to a difference between an actual belonging field of the training sample and the first prediction field;

the training unit 702 is further configured to adjust model parameters of the first domain classifier based on the first loss function, and adjust model parameters of the first intermediate layer according to a negative value of the first loss function.

In a possible implementation manner, the training unit is further configured to, during the model training:

determining a second prediction domain corresponding to a second output feature through a second domain classifier according to the second output feature of a second middle layer of the feature extractor;

determining a second loss function according to a difference between an actual domain to which the training sample belongs and the second prediction domain;

and adjusting the model parameters of the second domain classifier based on the second loss function, and adjusting the model parameters of the second middle layer through the negative value of the second loss function.

In a possible implementation manner, if the first intermediate layer is a local feature extraction layer, the training unit is further configured to:

determining the first prediction domain from the pixel prediction domain.

In one possible implementation, the training unit is further configured to:

acquiring a first intermediate layer characteristic of the first domain classifier;

and determining input features of an object detection layer in the initial detection model according to the first intermediate layer features and the image features, and determining a detection result aiming at the target object through the object detection layer.

In one possible implementation, the training unit is further configured to:

acquiring a first intermediate layer characteristic of the first domain classifier and a second intermediate layer characteristic of the second domain classifier;

determining input features of an object detection layer in the initial detection model according to the first intermediate layer features, the second intermediate layer features and the image features, and determining a detection result for the target object through the object detection layer.

In one possible implementation, the training unit is further configured to:

determining an object detection loss function from the detection result for the target object and the sample label;

and adjusting model parameters of the initial detection model, the first domain classifier and the second domain classifier according to the object detection loss function.

In a possible implementation manner, the initial detection model includes an object detection layer for determining the detection result according to the image feature, and the training unit is further configured to, during the model training:

acquiring detection frame characteristics output by a middle layer of the object detection layer, wherein the detection frame characteristics comprise a detection frame for object detection;

determining a third prediction domain corresponding to the detection frame features according to a third domain classifier;

determining a third loss function according to a difference between an actual domain to which the training sample belongs and the third prediction domain;

and adjusting the model parameters of the third domain classifier based on the third loss function, and adjusting the model parameters of the object detection layer through the negative value of the third loss function.

In one possible implementation, the training unit is further configured to:

In one possible implementation, the apparatus further includes a conversion unit:

the obtaining unit is further used for obtaining image style conversion models aiming at the source field and the target field;

the conversion unit is used for converting the image style of the marked image sample from a source field to the target field through the image style conversion model to obtain a style training sample, and a sample label of the style training sample is a sample label of the marked image sample before conversion;

the training unit is further used for performing model training on the initial detection model according to the style training samples to obtain an image style-based object detection model, and the image style-based object detection model is used for performing object detection on the images in the target field.

In a possible implementation manner, the training unit is further configured to perform model training on the initial detection model according to the style training sample and the labeled image sample, so as to obtain an image style-based object detection model.

Therefore, the labeled image sample with the sample label in the source field and the unlabeled image in the target field are both used as training samples, and model training is carried out on the initial detection model. The initial detection model comprises a feature extractor for extracting the image features of the training samples, in the model training process, according to the first output features of the first middle layer of the feature extractor, the first prediction field corresponding to the first output features is determined through the first field classifier, the first loss function is determined based on the difference of the actual field of the training samples, the resolution capability of the first field classifier on the source field and the target field can be improved based on the adjustment of the first loss function, and the feature distance between the features of the source field and the target field is reduced when the features are extracted by the first middle layer based on the adjustment of the negative value of the first loss function, so that the purpose of confusing the source field and the target field is achieved. Therefore, through completely opposite optimization directions, parameter adjustment is carried out on the first field classifier and the first intermediate layer based on the thought of the countermeasure training, the characteristic extractor is guided to weaken respective unique information under the source field and the target field when the characteristics of the training samples are extracted, and the information which can be used for distinguishing the source field from the target field in the characteristics is reduced, so that the information which is relevant to the field in the image characteristics extracted by the characteristic extractor is weakened, and the field confusion effect is realized. When the training sample is the labeled image sample, model parameter adjustment can be carried out on the initial detection model according to the detection result and the sample label, so that the image feature of the image in the target field can be effectively extracted by the trained object detection model, the target object can be accurately detected, a large amount of labeled data in the source field can be effectively used in object detection in the target field, and the training efficiency and the detection performance of the object detection model in the target field are greatly improved.

An embodiment of the present application further provides a computer device, where the computer device is the computer device described above, and may include a terminal device or a server, and the determination device for the object detection model may be configured in the computer device. The computer apparatus is described below with reference to the drawings.

If the computer device is a terminal device, please refer to fig. 8, an embodiment of the present application provides a terminal device, taking the terminal device as a mobile phone as an example:

fig. 8 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 8, the handset includes: a Radio Frequency (RF) circuit 1410, a memory 1420, an input unit 1430, a display unit 1440, a sensor 1450, an audio circuit 1460, a Wireless Fidelity (WiFi) module 1470, a processor 1480, and a power supply 1490. Those skilled in the art will appreciate that the handset configuration shown in fig. 8 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 8:

RF circuit 1410 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for processing received downlink information of a base station to processor 1480; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1410 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1410 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 1420 may be used to store software programs and modules, and the processor 1480 executes various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 1420. The memory 1420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, memory 1420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 1430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. In particular, the input unit 1430 may include a touch panel 1431 and other input devices 1432. The touch panel 1431, also referred to as a touch screen, may collect touch operations performed by a user on or near the touch panel 1431 (for example, operations performed by the user on or near the touch panel 1431 by using any suitable object or accessory such as a finger or a stylus pen), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 1431 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device and converts it to touch point coordinates, which are provided to the processor 1480 and can receive and execute commands from the processor 1480. In addition, the touch panel 1431 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 1431, the input unit 1430 may also include other input devices 1432. In particular, other input devices 1432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1440 may be used to display information input by or provided to the user and various menus of the mobile phone. The Display unit 1440 may include a Display panel 1441, and optionally, the Display panel 1441 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, touch panel 1431 can overlay display panel 1441, and when touch panel 1431 detects a touch operation on or near touch panel 1431, it can transmit to processor 1480 to determine the type of touch event, and then processor 1480 can provide a corresponding visual output on display panel 1441 according to the type of touch event. Although in fig. 8, the touch panel 1431 and the display panel 1441 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1431 and the display panel 1441 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1450, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1441 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1441 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping) and the like, and can also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor and the like, which are not described herein again.

Audio circuitry 1460, speaker 1461, microphone 1462 may provide an audio interface between a user and a cell phone. The audio circuit 1460 can transmit the received electrical signal converted from the audio data to the loudspeaker 1461, and the electrical signal is converted into a sound signal by the loudspeaker 1461 and output; on the other hand, the microphone 1462 converts collected sound signals into electrical signals, which are received by the audio circuit 1460 and converted into audio data, which are then processed by the audio data output processor 1480, and then passed through the RF circuit 1410 for transmission to, for example, another cellular phone, or for output to the memory 1420 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through a WiFi module 1470, and provides wireless broadband internet access for the user. Although fig. 8 shows the WiFi module 1470, it is understood that it does not belong to the essential constitution of the handset and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1480, which is the control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1420 and calling data stored in the memory 1420, thereby integrally monitoring the mobile phone. Alternatively, the processor 1480 may include one or more processing units; preferably, the processor 1480 may integrate an application processor, which handles primarily operating systems, user interfaces, and applications, among others, with a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1480.

The handset also includes a power supply 1490 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1480 via a power management system to provide management of charging, discharging, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 1480 included in the terminal device also has the following functions:

If the computer device is a server, the embodiment of the present application further provides a server, please refer to fig. 9, where fig. 9 is a structural diagram of the server 1500 provided in the embodiment of the present application, and the server 1500 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1522 (e.g., one or more processors) and a memory 1532, and one or more storage media 1530 (e.g., one or more mass storage devices) for storing an application program 1542 or data 1544. Memory 1532 and storage media 1530 may be, among other things, transient or persistent storage. The program stored on the storage medium 1530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 1522 may be provided in communication with the storage medium 1530, executing a series of instruction operations in the storage medium 1530 on the server 1500.

The Server 1500 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input-output interfaces 1558, and/or one or more operating systems 1541, such as a Windows Server^TM，Mac OS X^TM，Unix^TM, Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

In addition, a storage medium is provided in an embodiment of the present application, and the storage medium is used for storing a computer program, and the computer program is used for executing the method provided in the embodiment.

The embodiment of the present application also provides a computer program product including instructions, which when run on a computer, causes the computer to execute the method provided by the above embodiment.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as Read-only Memory (ROM), RAM, magnetic disk, or optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Moreover, the present application can be further combined to provide more implementations on the basis of the implementations provided by the above aspects. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for determining an object detection model, the method comprising:

in the model training process, according to a first output feature of a first intermediate layer of the feature extractor, determining pixel prediction fields corresponding to pixels included in the first output feature respectively through a first field classifier; determining a first prediction domain according to the pixel prediction domain; wherein the first intermediate layer is a local feature extraction layer;

determining whether the input training sample belongs to the source field or the target field according to the first field classifier, determining that the training sample belongs to the source field or the target field, and determining a first loss function based on the difference between the source field or the target field and the first prediction field;

adjusting model parameters of the first domain classifier based on the first loss function, and adjusting model parameters of the first intermediate layer through a negative value of the first loss function;

the adjusting the model parameter of the first intermediate layer by the negative value of the first loss function includes: inputting a negative value of the first loss function into the first intermediate tier such that the first domain classifier and the first intermediate tier are trained with diametrically opposite optimization directions.

2. The method of claim 1, wherein during the model training, the method further comprises:

3. The method of claim 2, wherein the first intermediate layer is a local feature extraction layer and the second intermediate layer is a global feature extraction layer.

4. The method of claim 1, further comprising:

5. The method of claim 2, further comprising:

6. The method of claim 5, wherein the performing model parameter adjustment on the initial detection model according to the detection result for the target object and the sample label to obtain an object detection model comprises:

7. The method of claim 1, wherein the initial detection model comprises an object detection layer for determining the detection result according to the image feature, and during the model training, the method further comprises:

8. The method according to any one of claims 1-7, further comprising:

9. The method of claim 1, further comprising:

acquiring an image style conversion model aiming at the source field and the target field;

converting the image style of the marked image sample from a source field to the target field through the image style conversion model to obtain a style training sample, wherein a sample label of the style training sample is a sample label of the marked image sample before conversion;

and performing model training on the initial detection model according to the style training sample to obtain an image style-based object detection model, wherein the image style-based object detection model is used for performing object detection on the image in the target field.

10. The method of claim 9, wherein the model training the initial detection model according to the style training samples to obtain an image style-based object detection model comprises:

and performing model training on the initial detection model according to the style training sample and the labeled image sample to obtain the image style-based object detection model.

11. An apparatus for determining an object detection model, the apparatus comprising an acquisition unit and a training unit:

the training unit is further used for determining pixel prediction fields corresponding to pixels included in the first output features respectively through a first field classifier according to the first output features of the first middle layer of the feature extractor in the model training process; determining a first prediction domain according to the pixel prediction domain; wherein the first intermediate layer is a local feature extraction layer;

the training unit is further configured to determine, according to the first domain classifier, whether the input training sample belongs to the source domain or the target domain, determine that the training sample belongs to the source domain or the target domain, and then determine a first loss function based on a difference between the source domain or the target domain and the first prediction domain;

the training unit is further used for adjusting model parameters of the first domain classifier based on the first loss function and adjusting the model parameters of the first intermediate layer through a negative value of the first loss function;

12. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-10 according to instructions in the program code.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any one of claims 1-10.

14. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-10.