CN113901998A

CN113901998A - Model training method, device, equipment, storage medium and detection method

Info

Publication number: CN113901998A
Application number: CN202111153483.5A
Authority: CN
Inventors: 邹智康; 叶晓青; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-07

Abstract

The disclosure provides a model training method, a model training device, model training equipment, a model storage medium and a model detection method, relates to the field of artificial intelligence, particularly relates to the field of computer vision and deep learning, and can be applied to intelligent robots and automatic driving scenes. The specific implementation scheme is as follows: carrying out first-stage training on an initial model to be trained by utilizing first supervised data to obtain a preselected detection model; performing second-stage training on the preselected detection model by using second supervised data and unsupervised data to obtain a target detection model; the target detection model is used for outputting 3D object information in the image to be detected according to the input image to be detected. According to the technology disclosed by the invention, the target detection model with higher detection precision and generalization performance can be trained, the data volume of the first supervised data and the second supervised data is reduced, the labor cost and the time cost of manual labeling are reduced, and the training efficiency of the model is improved.

Description

Model training method, device, equipment, storage medium and detection method

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of computer vision and deep learning, which can be applied in intelligent robots and automatic driving scenarios.

Background

Aiming at 3D detection of a monocular image, a data set is traversed in advance to generate a 3D candidate frame mainly depending on prior information of a 3D target surrounding frame, then an image detection model is used for processing an input monocular image to output a 3D offset, a real 3D surrounding frame of an object is obtained by combining the 3D candidate frame and the 3D offset, and a 3D detection task of the monocular image is realized.

In the related art, supervised learning of the image detection model is usually performed by using manually labeled label data in a training process of the image detection model, but the positioning accuracy of the image detection model is limited by the data amount and the labeling accuracy of manually labeled samples due to large workload of manual labeling and long labeling period.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, storage medium, and detection method for model training.

According to an aspect of the present disclosure, there is provided a training method of a model, including:

carrying out first-stage training on an initial model to be trained by utilizing first supervised data to obtain a preselected detection model;

performing second-stage training on the preselected detection model by using second supervised data and unsupervised data to obtain a target detection model;

the target detection model is used for outputting 3D object information in the image to be detected according to the input image to be detected.

According to another aspect of the present disclosure, there is provided a method of detecting an image, including:

inputting an image to be detected into a target detection model;

receiving 3D object information in an image to be detected from a target detection model;

the target detection model is obtained by adopting the model training method according to the embodiment of the disclosure.

According to another aspect of the present disclosure, there is provided a training apparatus of a model, including:

the first-stage training module is used for carrying out first-stage training on an initial model to be trained by utilizing first supervised data to obtain a preselected detection model;

the second-stage training module is used for carrying out second-stage training on the preselected detection model by utilizing second supervised data and unsupervised data to obtain a target detection model;

According to another aspect of the present disclosure, there is provided an image detection apparatus including:

the input module is used for inputting the image to be detected into the target detection model;

the receiving module is used for receiving the 3D object information in the image to be detected from the target detection model;

the target detection model is obtained according to the model training device of the above embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method in any of the embodiments of the present disclosure.

According to the training method of the model, the target detection model with high detection precision and generalization can be obtained through training, moreover, the unsupervised data is used for training the preselected detection model, the data quantity of the first supervised data and the second supervised data is reduced, the labor cost and the time cost of manual labeling are reduced, and the training efficiency of the model is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of training a model according to an embodiment of the present disclosure;

FIG. 2 is a detailed flow chart of a method of training a model according to an embodiment of the present disclosure, with a second stage of training;

FIG. 3 is a detailed flow chart of a method of training a model for data pre-processing according to an embodiment of the present disclosure;

FIG. 4 is a detailed flow chart of a method of training a model for self-supervised training in accordance with an embodiment of the present disclosure;

FIG. 5 is a detailed flow chart of a method of training a model for self-supervised training in accordance with an embodiment of the present disclosure;

FIG. 6 is a detailed flow chart of a method of training a model according to an embodiment of the present disclosure, with a second stage of training;

FIG. 7 is a flow chart of a method of detecting an image according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a training apparatus for a model according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of an apparatus for detecting an image according to an embodiment of the present disclosure;

FIG. 10 is a block diagram of an electronic device for implementing a method of training a model and/or a method of detecting an image according to embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A training method of an object detection model according to an embodiment of the present disclosure is described below with reference to fig. 1 to 6.

As shown in fig. 1, the training method of the target detection model in the embodiment of the present disclosure specifically includes the following steps:

s101: carrying out first-stage training on an initial model to be trained by utilizing first supervised data to obtain a preselected detection model;

s102: performing second-stage training on the preselected detection model by using second supervised data and unsupervised data to obtain a target detection model;

Illustratively, the first and second supervised data may be data that are manually or machine-labeled with respect to the sample image, and specifically may include the sample image and real 3D object information of the target object in the sample image. The real 3D object information may include classification information, position information, size information, angle information, and the like of the target object. Unsupervised data may be an image of a sample that has not been manually labeled.

More specifically, in an application scenario where the target detection model is 3D object detection for a monocular visual image, the sample image may be a monocular visual image acquired with a monocular visual sensor.

It should be noted that the first supervised data used in the first stage training and the second supervised data used in the second stage training may be the same data or different data.

Exemplarily, in step S101, the first stage training may be supervised training.

In a specific example, the supervised training of the initial model to be trained by using the first supervised data may specifically include the following steps:

and inputting the first supervised data into the initial model to be trained to obtain an initial detection result corresponding to the first supervised data. And determining the difference between the labeling information corresponding to the first supervised data and the initial detection result, adjusting the model parameters of the initial model to be trained according to the difference, and performing multiple iterations until a preselected detection model meeting the preset convergence condition is obtained. The difference between the initial detection result and the labeling information can be understood as an output error of the initial model, and can be specifically determined by a first loss function constructed in advance. Wherein the first loss function is a function with the model parameters as arguments.

And adjusting the model parameters of the initial model according to the difference between the initial detection result and the labeling information, wherein the error is propagated reversely in each layer of the initial model to be trained, and the parameters of each layer of the initial model to be trained are adjusted according to the error until the output result of the initial model to be trained is converged or the expected effect is achieved, so that the preselected detection model is obtained.

It can be understood that the pre-selection detection model obtained after the first-stage training is only a rough model, and although the pre-training detection model has certain 3D object information detection capability for the input image to be detected, the detection precision and the generalization cannot meet the detection requirement of the final target detection model, so that the second-stage training needs to be performed on the pre-selection detection model to improve the detection precision and the generalization of the pre-selection detection model.

Illustratively, in step S102, the second-stage training for the preselected detection model may include both supervised and unsupervised training. In particular, the preselected detection model may be supervised trained using the second supervised data, and may be self-supervised trained using the unsupervised data.

It is understood that, in the second stage training process, the supervised training may be performed first and then the self-supervised training may be performed, or the supervised training may be performed first and then the supervised training may be performed, or the supervised training and the self-supervised training may be performed alternately.

In a specific application example, the training method of the model according to the embodiment of the present disclosure may be applied to scenes involving 3D object detection, such as intelligent robots or automatic driving.

For example, in an automatic driving scene, a monocular vision image of a target area is acquired by using a monocular vision sensor of an automatic driving vehicle, and the monocular vision image is input into a target detection model obtained through training of the model training method according to the embodiment of the present disclosure, so that 3D object information of a target object such as a vehicle, a pedestrian, a road sign, etc. in the monocular vision image can be obtained, and an automatic driving function of the automatic driving vehicle is realized based on the 3D object information of the target object.

According to the model training method disclosed by the embodiment of the disclosure, the initial model to be trained is subjected to supervised training by utilizing the first supervised data to obtain the preselected detection model, and the second supervised data and the unsupervised data are respectively used for carrying out supervised training and self-supervised training on the preselected detection model, so that information in unsupervised data can be mined by utilizing a self-supervision mechanism to improve the detection precision and the generalization of the preselected detection model, and the target detection model meeting the preset conditions is obtained. Moreover, the unsupervised data is used for training the preselected detection model, so that the data volume of the first supervised data and the second supervised data is reduced, the labor cost and the time cost of manual labeling are reduced, and the training efficiency of the model is improved.

As shown in fig. 2, in one embodiment, step S102 includes:

s201: preprocessing unsupervised data to obtain first enhanced data and second enhanced data;

s202: inputting the first enhancement data and the second enhancement data into a preselected detection model respectively to obtain a first unsupervised detection result and a second unsupervised detection result, wherein the confidence coefficient of the first unsupervised detection result is greater than that of the second unsupervised detection result;

s203: and performing self-supervision training on the preselected detection model according to the difference between the first unsupervised detection result and the second unsupervised detection result.

Exemplarily, in step S201, different degrees of data enhancement processing may be performed on the unsupervised data to obtain first enhanced data and second enhanced data, respectively. The enhancement degrees of the first enhancement data and the second enhancement data are different, and the enhancement degree of the first enhancement data is greater than that of the second enhancement data; correspondingly, the confidence degrees of the detection result obtained by detecting the first enhanced data and the detection result obtained by detecting the second enhanced data by the preselected detection model are different, namely, the confidence degree of the first unsupervised detection result corresponding to the first enhanced data is greater than the confidence degree of the second unsupervised detection result corresponding to the second enhanced data.

It is understood that the preprocessing of the unsupervised data is understood as an interference processing of the unsupervised data, that is, the detection difficulty of the preselected detection model for the input data is increased, so that the detection capability and generalization of the preselected detection model are continuously strengthened.

For example, in step S203, a difference between the first unsupervised detection result and the second unsupervised detection result may be calculated by using a second loss function constructed in advance, and the model parameters of the preselected detection model may be adjusted according to the difference.

It can be understood that, in the self-supervised training process, since the confidence of the first unsupervised detection result is greater than the confidence of the second unsupervised detection result, the first unsupervised detection result can be used as the pseudo-label information of the unsupervised data, and the detection process of the preselected detection model on the second enhanced data can be supervised-trained.

According to the embodiment, the unsupervised data is preprocessed in different enhancement degrees to obtain the first unsupervised detection result and the second unsupervised detection result with different detection result confidence degrees, and the difference between the first unsupervised detection result and the second unsupervised detection result is utilized to carry out the self-supervision training on the preselected detection model, so that the learnable part in the unsupervised data can be fully mined, the generalization capability and the detection capability of the preselected detection model are further enhanced, the use of the supervised data is greatly saved, and the labor cost and the time cost of manual labeling are reduced.

As shown in fig. 3, in one embodiment, step S201 includes:

s301: performing first enhancement processing on the unsupervised data to obtain first enhancement data, wherein the first enhancement processing comprises illumination change processing and/or color change processing; and the number of the first and second groups,

s302: and performing second enhancement processing on the unsupervised data to obtain second enhanced data, wherein the second enhancement processing comprises at least one of stretching processing, cutting processing, translation processing and random shielding processing.

It is to be understood that, in step S301, the first enhancement processing may be processing that weakly interferes with original information of the sample image contained in the unsupervised data, that is, processing that does not affect the resolution and original scale of the sample image. For example, the illumination intensity of the sample image included in the unsupervised data may be changed to obtain image data different from the illumination intensity of the sample image; and/or changing the value of RGB components in the RGB (Red Green Blue ) color space of the unsupervised data to obtain image data different from the RGB color value of the sample image.

In step S302, the second enhancement processing may be processing that strongly interferes with original information of the sample image contained in the unsupervised data, that is, processing that affects the resolution and original scale of the sample image. For example, the interference processing may be performed on the resolution, size, position, and local occlusion of the sample image included in the unsupervised data.

Through the embodiment, the first enhancement data and the second enhancement data with different enhancement degrees can be obtained, so that the preselected detection model can output the first unsupervised detection result and the second unsupervised detection result with different confidence degrees for the same unsupervised data.

As shown in fig. 4, in an embodiment, the first unsupervised detection result includes a first prediction box and a prediction box confidence and a classification confidence corresponding to the first prediction box, and the step S203 includes:

s401: calculating the comprehensive confidence of the first prediction frame according to the confidence of the prediction frame and the classification confidence;

s402: selecting a first prediction box with the comprehensive confidence coefficient meeting a preset condition from the plurality of first prediction boxes as a pseudo label;

s403: and inputting the second enhancement data into the preselected detection model according to the pseudo label, and performing self-supervision training on the preselected detection model.

The confidence coefficient of the prediction frame is used for representing the accuracy of the prediction frame output by the pre-selection detection model, and the classification confidence coefficient is used for representing the accuracy of the classification result output by the pre-selection detection model and related to the prediction frame.

Illustratively, the prediction box confidence and the classification confidence may be obtained by a prediction box confidence branch network and a classification confidence branch network of a preselected detection model, respectively.

The prediction frame confidence degree branch network can be trained according to the difference between an initial prediction frame of first supervised data generated by the initial network and a real frame of a target object contained in the first supervised data in the first stage training process, so as to obtain the prediction frame confidence degree branch network meeting preset conditions.

More specifically, the confidence branch network of the prediction box may obtain the confidence of the initial prediction box output by the initial network by calculating a ratio of an intersection to a union between the initial prediction box and the real box.

The classification confidence degree branch network can be trained according to the difference between the classification result of the target object in the prediction frame and the real classification information output by the pre-selection detection model in the first stage of training process, so as to obtain the classification confidence degree branch network meeting the preset conditions.

Illustratively, in step S401, the integrated confidence may be obtained by calculating a product or a sum of the prediction box confidence and the classification confidence.

In step S402, a first prediction box with a combined confidence greater than or equal to the confidence threshold may be selected as a pseudo tag.

It can be understood that, in step S402, the first prediction frames are filtered, so as to screen out the first prediction frames meeting the preset condition from the plurality of first prediction frames as the pseudo labels, that is, the first prediction frames with a relatively accurate detection result and a relatively high confidence level are selected as the pseudo labels, and the first prediction frames not meeting the confidence level condition are filtered.

In step S403, the second prediction frame may be self-supervised trained by using the pseudo tag, that is, the preselected detection model may be trained by using a difference between the pseudo tag and a second unsupervised detection result of the preselected detection model and the second enhanced data.

By the implementation mode, the filtering of the first prediction frame can be realized, the pseudo label with higher comprehensive confidence coefficient is selected from the first prediction frame, the self-supervision training is carried out on the pre-selection detection model, the pre-selection detection model is enabled to approach the pseudo label continuously aiming at the second prediction frame output by the second enhanced data, and therefore the detection precision and the generalization of the pre-selection detection model aiming at the image data with higher interference intensity are improved continuously.

As shown in fig. 5, in one embodiment, step S403 includes:

s501: inputting the second enhanced data into a preselected detection model to obtain a second unsupervised detection result;

s502: and adjusting the model parameters of the preselected detection model according to the difference between the pseudo label and the second unsupervised detection result.

Illustratively, the difference between the pseudo tag and the second unsupervised detection result may be determined by calculating an intersection-to-intersection ratio between the second unsupervised detection result and the pseudo tag. It is understood that the intersection-union ratio can be obtained by calculating the intersection and union of the second unsupervised detection result and the pseudo label, and then calculating the ratio between the intersection and the union, i.e. the intersection-union ratio is used for characterizing the overlapping degree between the two.

The larger the cross-over ratio is, the higher the overlapping degree between the second unsupervised detection result and the pseudo label is, and the smaller the difference between the second unsupervised detection result and the pseudo label is; the smaller the cross-over ratio, the lower the degree of overlap between the second unsupervised detection result and the pseudo label, and the larger the difference between the two.

Illustratively, the difference between the two can be calculated by a second loss function set in advance. Further, the model parameters of the preselected detection model may be adjusted according to the second loss function.

By the implementation mode, the automatic supervision training of the preselected detection model by the pseudo label is realized, and the parameters of the preselected detection model are adjusted according to the difference between the pseudo label and the second unsupervised detection result, so that the detection result output by the preselected detection model can continuously approach the pseudo label until the preselected detection model meets the preset convergence condition, and the target detection model is obtained.

As shown in fig. 6, in one embodiment, step S102 includes:

s601: inputting the second supervised data into a preselected detection model to obtain a supervised detection result;

s602: and adjusting the model parameters of the preselected detection model according to the difference between the supervised detection result and the labeled information corresponding to the second supervised data.

In the supervised training process of the second stage training, for example, the difference between the supervised detection result corresponding to the second supervised data and the labeled information may be calculated by using a third loss function constructed in advance, and the model parameter of the preselected detection model may be adjusted according to the difference until the preselected detection model reaches the preset convergence condition.

The convergence condition of the supervised training in the second stage training process can be stricter than that of the supervised training in the first stage training process, so that the detection accuracy of the model after the supervised training in the second stage training process is higher.

It will be appreciated that the second supervised data used in the second stage training may be the same as or different from the first supervised data used in the first stage training.

Through the implementation mode, the second supervised data can be used for conducting regularization training on the preselected detection model in the second-stage training process, and therefore the detection accuracy of the preselected detection model is further improved.

According to an embodiment of the present disclosure, the present disclosure also provides a method for detecting an image.

As shown in fig. 7, the image detection method according to the embodiment of the present disclosure includes:

s701: inputting an image to be detected into a target detection model;

s702: receiving 3D object information in an image to be detected from a target detection model; the target detection model is obtained by adopting the model training method according to the embodiment of the disclosure.

Illustratively, the image detection method of the embodiment of the disclosure can be applied to the technical field of intelligent robots or automatic driving, and is used for performing 3D object detection on an image to be detected of a target area.

More specifically, the image to be detected in the embodiment of the present disclosure may be a monocular visual image acquired by using a monocular visual sensor, and the target detection model may be used to detect a 3D object in the monocular visual image.

According to the image detection method disclosed by the embodiment of the disclosure, the target detection model obtained by training through the model training method disclosed by the embodiment of the disclosure can realize high-precision detection of 3D object information contained in the image, and the target detection model has high generalization, can be applied to various scenes such as intelligent robots and automatic driving, and has a wide application range.

In one embodiment, the 3D object information includes at least one of classification information, position information, size information, and angle information.

In one particular example, the target detection model includes a feature extraction layer, a 2D head detection network, and a 3D head detection network. The 2D head detection network is used for outputting a prediction frame of a target object in an image to be detected and classification information and position information related to the prediction frame according to high-level semantic features extracted by the feature extraction layer; the 3D head detection network is used for outputting size information and angle information related to the prediction frame according to the high-level semantic features extracted by the feature extraction layer.

Through the embodiment, multi-dimensional information about the target object in the image can be output, and the positioning accuracy can be effectively improved for the 3D object in the image.

According to an embodiment of the present disclosure, the present disclosure also provides a training apparatus of a model.

As shown in fig. 8, the training apparatus for the model includes:

a first-stage training module 801, configured to perform first-stage training on an initial model to be trained by using first supervised data to obtain a preselected detection model;

a second-stage training module 802, configured to perform second-stage training on the preselected detection model by using second supervised data and unsupervised data to obtain a target detection model;

In one embodiment, the second stage training module 802 includes:

the preprocessing module is used for preprocessing the unsupervised data to obtain first enhanced data and second enhanced data;

the detection result acquisition module is used for respectively inputting the first enhanced data and the second enhanced data into a preselected detection model to obtain a first unsupervised detection result and a second unsupervised detection result, wherein the confidence coefficient of the first unsupervised detection result is greater than that of the second unsupervised detection result;

and the training module is used for carrying out self-supervision training on the preselected detection model according to the difference between the first unsupervised detection result and the second unsupervised detection result.

In one embodiment, the pre-processing module comprises:

the first preprocessing unit is used for performing first enhancement processing on the unsupervised data to obtain first enhancement data, and the first enhancement processing comprises illumination change processing and/or color change processing; and the number of the first and second groups,

and the second preprocessing unit is used for performing second enhancement processing on the unsupervised data to obtain second enhanced data, and the second enhancement processing comprises at least one of stretching processing, cutting processing, translation processing and random shielding processing.

In one embodiment, the first unsupervised detection result includes a first prediction box and a prediction box confidence and a classification confidence corresponding to the first prediction box;

the training module comprises:

the comprehensive confidence coefficient calculation sub-module is used for calculating the comprehensive confidence coefficient of the first prediction frame according to the confidence coefficient of the prediction frame and the classification confidence coefficient;

the pseudo label selection submodule is used for selecting a first prediction frame of which the comprehensive confidence coefficient accords with a preset condition from the plurality of first prediction frames as a pseudo label;

and the self-supervision training sub-module is used for inputting the second enhancement data into the pre-selection detection model according to the pseudo label and carrying out self-supervision training on the pre-selection detection model.

In one embodiment, the self-supervised training submodule comprises:

the second unsupervised detection result acquisition unit is used for inputting second enhanced data into the preselected detection model to obtain a second unsupervised detection result;

and the parameter adjusting unit is used for adjusting the model parameters of the preselected detection model according to the difference between the pseudo label and the second unsupervised detection result.

In one embodiment, the second stage training module 802 includes:

the supervised detection result acquisition submodule is used for inputting second supervised data into the preselected detection model to obtain a supervised detection result;

and the model parameter adjusting submodule is used for adjusting the model parameters of the preselected detection model according to the difference between the supervised detection result and the labeled information corresponding to the second supervised data.

According to the embodiment of the disclosure, the disclosure further provides an image detection device.

As shown in fig. 9, the apparatus for detecting an image according to an embodiment of the present disclosure includes:

an input module 901, configured to input an image to be detected into a target detection model;

a receiving module 902, configured to receive, from a target detection model, 3D object information in an image to be detected;

wherein, the target detection model is obtained according to the training device of the model of the above embodiment of the present disclosure.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 performs the respective methods and processes described above, such as a training method of a model and/or a detection method of an image. For example, in some embodiments, the training method of the model and/or the detection method of the image may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the training method of the model and/or the detection method of the image described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured by any other suitable means (e.g. by means of firmware) to perform a training method of the model and/or a detection method of the image.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training a model, comprising:

2. The method of claim 1, wherein the second stage training of the preselected detection model using second supervised and unsupervised data comprises:

preprocessing the unsupervised data to obtain first enhanced data and second enhanced data;

inputting the first enhancement data and the second enhancement data into the preselected detection model respectively to obtain a first unsupervised detection result and a second unsupervised detection result, wherein the confidence coefficient of the first unsupervised detection result is greater than that of the second unsupervised detection result;

performing an auto-supervised training of the preselected detection model based on a difference between the first unsupervised detection result and the second unsupervised detection result.

3. The method of claim 2, wherein preprocessing the unsupervised data to obtain first enhanced data and second enhanced data comprises:

performing first enhancement processing on the unsupervised data to obtain first enhancement data, wherein the first enhancement processing comprises illumination change processing and/or color change processing; and the number of the first and second groups,

and performing second enhancement processing on the unsupervised data to obtain second enhanced data, wherein the second enhancement processing comprises at least one of stretching processing, cutting processing, translation processing and random shielding processing.

4. The method of claim 2, wherein the first unsupervised detection result comprises a first prediction box and a prediction box confidence and a classification confidence corresponding to the first prediction box;

performing an auto-supervised training of the preselected detection model based on a difference between the first unsupervised detection result and the second unsupervised detection result, comprising:

calculating the comprehensive confidence of the first prediction frame according to the confidence of the prediction frame and the classification confidence;

selecting a first prediction box with comprehensive confidence coefficient meeting a preset condition from the plurality of first prediction boxes as a pseudo label;

and inputting the second enhancement data into the preselected detection model according to the pseudo label and carrying out self-supervision training on the preselected detection model.

5. The method of claim 4, wherein inputting the second enhancement data into and self-supervised training of the preselected detection model in accordance with the pseudo-tag comprises:

inputting the second enhanced data into the preselected detection model to obtain a second unsupervised detection result;

adjusting model parameters of the preselected detection model based on a difference between the pseudo-tag and the second unsupervised detection result.

6. The method of claim 1, wherein the second stage training of the preselected detection model using second supervised and unsupervised data comprises:

inputting the second supervised data into the preselected detection model to obtain a supervised detection result;

and adjusting the model parameters of the preselected detection model according to the difference between the supervised detection result and the labeling information corresponding to the second supervised data.

7. A method of detecting an image, comprising:

inputting an image to be detected into a target detection model;

receiving 3D object information in the image to be detected from the target detection model;

wherein the target detection model is obtained by using the model training method according to any one of claims 1 to 6.

8. The method of claim 7, wherein the 3D object information includes at least one of classification information, position information, size information, and angle information.

9. An apparatus for training a model, comprising:

10. The apparatus of claim 9, wherein the second stage training module comprises:

a detection result obtaining module, configured to input the first enhanced data and the second enhanced data into the preselected detection model respectively to obtain a first unsupervised detection result and a second unsupervised detection result, where a confidence of the first unsupervised detection result is greater than a confidence of the second unsupervised detection result;

11. The apparatus of claim 10, wherein the preprocessing module comprises:

12. The apparatus of claim 10, wherein the first unsupervised detection result comprises a first prediction box and a prediction box confidence and a classification confidence corresponding to the first prediction box;

the training module comprises:

a comprehensive confidence coefficient calculation sub-module, configured to calculate a comprehensive confidence coefficient of the first prediction frame according to the prediction frame confidence coefficient and the classification confidence coefficient;

the pseudo label selection submodule is used for selecting a first prediction frame with comprehensive confidence coefficient meeting a preset condition from the plurality of first prediction frames as a pseudo label;

and the self-supervision training sub-module is used for inputting the second enhancement data into the preselected detection model according to the pseudo label and carrying out self-supervision training on the preselected detection model.

13. The apparatus of claim 12, wherein the self-supervised training sub-module comprises:

the second unsupervised detection result acquisition unit is used for inputting the second enhanced data into the preselected detection model to obtain a second unsupervised detection result;

14. The apparatus of claim 9, wherein the second stage training module comprises:

the supervised detection result acquisition submodule is used for inputting the second supervised data into the preselected detection model to obtain a supervised detection result;

15. An apparatus for detecting an image, comprising:

wherein the object detection model is derived by a training apparatus of a model according to any one of claims 9 to 14.

16. The apparatus of claim 15, wherein the 3D object information includes at least one of classification information, position information, size information, and angle information.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.