CN116681123B

CN116681123B - Perception model training method, device, computer equipment and storage medium

Info

Publication number: CN116681123B
Application number: CN202310950761.2A
Authority: CN
Inventors: 洪伟; 李帅君; 朱子凌
Original assignee: Foss Hangzhou Intelligent Technology Co Ltd
Current assignee: Foss Hangzhou Intelligent Technology Co Ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-11-14
Anticipated expiration: 2043-07-31
Also published as: CN116681123A

Abstract

The application relates to a perception model training method, a device, computer equipment and a storage medium. The method comprises the following steps: obtaining unlabeled data determined based on a terminal model; inputting the unlabeled data into a plurality of cloud models, generating first pseudo-labeled data, and performing self-training on the cloud models based on the first pseudo-labeled data; inputting the unlabeled data into a terminal model to generate second pseudo-labeled data; and updating the terminal model through distillation training according to the first pseudo labeling data and the second pseudo labeling data to obtain a terminal updating model. By adopting the method, the training efficiency of the perception model can be improved.

Description

Perception model training method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of intelligent sensing technologies, and in particular, to a sensing model training method, apparatus, computer device, and storage medium.

Background

The current perception strategy in the automatic driving field is mostly developed and iterated based on a deep learning algorithm, the effect of the method depends on the size of an algorithm model and the quantity and quality of labeling data, and the scale of a perception model actually used at a vehicle end by automatic driving is far smaller than that of a cloud perception model in consideration of main factors such as cost, so that the effect of the cloud large model is far better than that of a vehicle end small model.

In order to improve the effect of the small model at the vehicle end, the main stream optimization direction is knowledge distillation, and the main idea is to transfer the information of the large cloud model to the small model at the vehicle end. For example, the middle feature of the cloud large model or the generated result thereof is used as a pseudo tag to train the small model at the vehicle end so as to improve the perception effect of the small model at the vehicle end.

However, the AI model training in such a split data closed loop system only optimizes the vehicle end model, resulting in a low training efficiency of the overall model training process.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a perception model training method, apparatus, computer device, and storage medium that can improve the efficiency of perception model training.

In a first aspect, the present application provides a method for training a perception model. The method comprises the following steps:

obtaining unlabeled data determined based on a terminal model;

inputting the unlabeled data into a plurality of cloud models, generating first pseudo-labeled data, and performing self-training on the cloud models based on the first pseudo-labeled data;

inputting the unlabeled data into a terminal model to generate second pseudo-labeled data;

and updating the terminal model through distillation training according to the first pseudo labeling data and the second pseudo labeling data to obtain a terminal updating model.

In one embodiment, the unlabeled data comprises a plurality of consecutive frames of unlabeled data.

In one embodiment, the inputting the unlabeled data into the plurality of cloud models, generating the first pseudo-labeled data includes: identifying the unlabeled data through a plurality of cloud models to generate a plurality of first pseudo tags and a plurality of first intermediate features; tracking a plurality of first pseudo tags in continuous multi-frame unlabeled data through a target tracking algorithm to generate first tracking pseudo tags; performing association processing on the plurality of first pseudo tags and the plurality of first intermediate features according to the time sequence of the first tracking pseudo tags and the continuous multi-frame unlabeled data to generate first continuous frame pseudo tags and first continuous frame intermediate features; and taking the first continuous frame pseudo tag and the first continuous frame intermediate feature as first pseudo labeling data.

In one embodiment, the inputting the unlabeled data into the terminal model, generating the second pseudo-labeled data includes: identifying the unlabeled data through the terminal model to generate a plurality of second pseudo tags and a plurality of second intermediate features; tracking a plurality of second pseudo tags in continuous multi-frame unlabeled data through a target tracking algorithm to generate second tracking pseudo tags; performing association processing on the plurality of second pseudo tags and the plurality of second intermediate features according to the time sequence of the second tracking pseudo tags and the continuous multi-frame unlabeled data to generate second continuous frame pseudo tags and second continuous frame intermediate features; and taking the second continuous frame pseudo tag and the second continuous frame intermediate feature as second pseudo labeling data.

In one embodiment, the updating the terminal model according to the first pseudo-annotation data and the second pseudo-annotation data through distillation training, and obtaining the terminal update model includes: determining difficult-case pseudo tag data according to the first pseudo tag data and the second pseudo tag data; and updating the terminal model through distillation training according to the difficult-to-case pseudo tag data to obtain a terminal updating model.

In one embodiment, the determining the difficult-to-case pseudo tag data according to the first pseudo tag data and the second pseudo tag data includes: calculating Euclidean distance and/or loss value between the first pseudo labeling data and the second pseudo labeling data; and taking the Euclidean distance and/or the loss value exceeding a preset threshold value as difficult-to-case pseudo tag data corresponding to the first pseudo tag data.

In one embodiment, updating the terminal model according to the difficult-to-case pseudo tag data through distillation training, and obtaining a terminal update model includes: and training a center point by taking the second continuous frame pseudo tag as a tag of a detection head of the terminal model, and training Euclidean distance between the intermediate feature of the first continuous frame and the main network feature of the terminal model to update the terminal model so as to obtain a terminal update model.

In one embodiment, the distillation training comprises a single-frame distillation training and a multi-frame time sequence distillation training, wherein the single-frame distillation training is performed based on intermediate features or pseudo tags of a cloud perception model; and the multi-frame time sequence distillation training is performed based on multi-frame results or pseudo labels of the cloud perception model or the cloud pre-labeling model.

In one embodiment, the self-training the cloud models based on the first pseudo-annotation data includes: determining difference values among a plurality of first pseudo tags in the first pseudo tag data according to the first pseudo tag data; performing data rejection on the first pseudo labeling data based on the difference value to obtain rejected first pseudo labeling data; obtaining pre-labeling data; and performing self-training on the cloud models according to the first pseudo-annotation data and the pre-annotation data after being removed.

In a second aspect, the application further provides a device for training the perception model. The device comprises:

the acquisition module is used for acquiring unlabeled data determined based on the terminal model;

the self-training module is used for inputting the unlabeled data into a plurality of cloud models, generating first pseudo-labeled data and self-training the cloud models based on the first pseudo-labeled data;

The calculation module is used for inputting the unlabeled data into a terminal model to generate second pseudo-labeled data;

and the distillation training module is used for updating the terminal model through distillation training according to the first pseudo-annotation data and the second pseudo-annotation data to obtain a terminal updating model.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

obtaining unlabeled data determined based on a terminal model;

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

Obtaining unlabeled data determined based on a terminal model;

The perception model training method, the perception model training device, the perception model training computer equipment, the perception model training storage medium and the perception model training computer program product acquire unlabeled data determined based on a terminal model; inputting the unlabeled data into a plurality of cloud models, generating first pseudo-labeled data, and performing self-training on the cloud models based on the first pseudo-labeled data; inputting the unlabeled data into a terminal model to generate second pseudo-labeled data; and updating the terminal model through distillation training according to the first pseudo labeling data and the second pseudo labeling data to obtain a terminal updating model. The problem of the perception model training inefficiency is solved, the technical effect that improves perception model training efficiency is realized.

Drawings

FIG. 1 is a diagram of an application environment for a perception model training method in one embodiment;

FIG. 2 is a flow chart of a method of training a perception model in one embodiment;

FIG. 3 is a schematic diagram of cloud end model self-training in one embodiment;

FIG. 4 is a schematic diagram of a method of training a perception model in a preferred embodiment;

FIG. 5 is a diagram of a vehicle end perception model optimization architecture in one embodiment;

FIG. 6 is a schematic diagram of a cloud multi-model optimization method for a 2D vehicle-end perception minimodel according to another embodiment;

FIG. 7 is a block diagram of a perception model training apparatus in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Intelligent driving, which refers to a technique that a robot helps to drive and completely replaces human driving under special conditions, is essentially a cognitive engineering problem of attention attraction and distraction. Intelligent driving utilizes a computer system to achieve a state where automatic driving is possible with little manual intervention. Intelligent driving is an important component of strategically emerging industry as an important product of industrial revolution and informatization, and is an important branch in the artificial intelligence era today. The target perception technology is an important component of intelligent driving, is a premise and a foundation for realizing intelligent control, and in the related technology, a perception model is mostly trained by a method of combining an intelligent algorithm with deep learning, but is limited by the data processing capability of a vehicle end, a cloud model training and a vehicle end model training separation mode is often adopted, and the cloud model is trained by a vehicle end model migration learning mode. However, this separate training method also brings the problems of slow optimization iteration of the model and low model training efficiency. Based on the above, the embodiment of the application provides a perception model training method to improve the training efficiency of a perception model.

The perception model training method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be a vehicle-mounted computer, a vehicle machine or an internet of things device, and the internet of things device may be an intelligent sound box, an intelligent television, an intelligent air conditioner, an intelligent vehicle-mounted device and the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a method for training a perception model is provided, and the method is applied to the terminal in fig. 1 for illustration, and includes the following steps:

step S201, unlabeled data determined based on a terminal model is obtained.

Specifically, the terminal model refers to a perception model arranged at a vehicle end, for example, a perception model arranged on a vehicle. The model can be obtained by training a deep learning model through fine annotation data. The fine labeling data refers to picture data obtained by manual labeling or high-precision machine labeling. The unlabeled data are unlabeled data, and the data can be acquired through vehicle-end sensors such as laser radars, vehicle-mounted cameras and the like.

Step S202, inputting the unlabeled data into a plurality of cloud models, generating first pseudo-labeled data, and performing self-training on the cloud models based on the first pseudo-labeled data.

Specifically, the cloud model is a perception model obtained by training a training set with a large data volume and depending on precise annotation data, such as a target detection model, a lane line perception model and the like, and preferably, on the basis, a perception algorithm can be combined with other perception models such as a sami-rpn tracking network, a weather recognition network and the like to obtain the cloud model. After the unlabeled data is input into the cloud model, the cloud model performs reasoning, identification and labeling on the unlabeled data, and first pseudo-labeled data is generated. Since the accuracy of the annotation depends on the accuracy of the cloud model, a situation of misannotation may occur, and thus is called pseudo annotation data. After the pseudo-annotation data are obtained, the cloud model can be subjected to semi-supervised learning training by combining other pre-annotated data and non-annotated data, so that the updating iteration efficiency of the cloud model is improved, and the recognition accuracy of the cloud model is improved. In the related art, a separated data closed-loop system is mostly adopted, only a vehicle end model is usually optimized in an AI model training process, iteration and evolution efficiency is very low, and in the embodiment, a tightly-coupled data closed-loop system is realized by self-training a cloud model. The cloud model is based on the continuous evolution and iteration of the data screened by the small model at the vehicle end, so that the accuracy of the marked data and the requirement of the dynamic support training service are improved continuously.

And step S203, inputting the unlabeled data into a terminal model to generate second pseudo-labeled data.

And step S204, updating the terminal model through distillation training according to the first pseudo-annotation data and the second pseudo-annotation data to obtain a terminal updating model.

Specifically, distillation training, also known as knowledge distillation. Because of the limitation of the computing capacity of the terminal, the terminal model often adopts a lightweight network, but aiming at an intelligent driving scene, the terminal model needs extremely high precision to ensure driving safety, so knowledge needs to be obtained from a large model and transferred to a small model, and the small model can achieve the effect related to the large model, namely the effect of distillation training. And after the first pseudo labeling data and the second pseudo labeling data of the terminal model are obtained, the evolution updating of the terminal model is realized through distillation training.

In the perception model training method, the perception model training does not directly train the terminal model through data, but optimizes the cloud model firstly, and then optimizes the performance of the terminal model together through distillation training, so that the training quality and efficiency of the terminal model are improved.

Specifically, since the conventional knowledge distillation method does not consider the continuous frame characteristics, there are problems that the accuracy of the labeling result is low and the stability is poor. In this embodiment, the unlabeled data is adjusted to be continuous multi-frame unlabeled data, so that the conditions of continuous frames in the perception information such as the camera data and the radar data can be extracted, and the abnormal labeling result is extracted by a data analysis method, so that the accuracy and the stability of the labeling result are further improved. Compared with the traditional knowledge distillation method based on single frame comparison, the perception model training method based on continuous frame feature design can correct pseudo labels with unstable cloud model class jump according to continuous frame results, the situation that terminal model precision is improved in limited mode due to inaccurate cloud model single frame detection is prevented, and middle knowledge distillation training is characterized by continuous frames, stability of feature distribution is improved, and therefore performance of a terminal model can be improved more efficiently and more specifically.

Specifically, the cloud model comprises a cloud perception large model and a cloud pre-labeling large model, wherein the cloud perception large model mainly comprises a target detection model, a lane line detection model and the like, and the cloud pre-labeling large model is a classified network such as a sami-rpn tracking network, weather identification and scene identification and the like based on the cloud perception large model. Training N cloud large models using fine-scale continuous frame data, denoted bigmodels= { BM ₁ 、BM ₂ … BM _N A small model of the vehicle end, the terminal model, was trained using fine-labeled continuous frame data, denoted SmallModel. Reasoning untrimmed continuous frame data by using cloud big model BigModels to generate pseudo tag PseLabels= { PseLabel ₁ 、PseLabel ₂ … PseLabel _N The first dummy tag and the first intermediate feature are = { biggest 1, biggest 2 … biggest n } and the intermediate feature biggest. Generating a first tracking pseudo tag TrackPseLabels= { TrackPseLabel based on PseLabels using a SimpleTack Multi-target tracking method ₁ 、TrackPseLabel ₂ … TrackPseLabel _N }. A tracksmallpsa label was generated based on smallpsa label using the simpleTack multi-target tracking method. Generating a first continuous frame pseudo tag conPseLabels= { conPseLabel according to the TrackPseLabels by using PseLabels and Bigfeatures according to the sequence of frames and the relation of the front frame and the back frame of the target ₁ 、conPseLabel ₂ … conPseLabel _N And the first continuous frame intermediate feature confugeature= { confugeature ₁ 、conBigfeature ₂ … conBigfeature _N }. Generating a second tracking pseudo tag T based on a second pseudo tag smallpseibel of the terminal model using a simpleTack multi-target tracking methodrackSmallPseLabel. Each successive frame pseudo tag, confselameLabel, is cut into multiple combinations per n frames (frames) denoted confselameLabel = { frame ₁ _conPseLabel、frame ₂ _conPseLabel、… frame _n _conPseLabel }、{frame ₂ _conPseLabel、frame ₃ _conPseLabel、… frame _n+1 _conPseLabel }、… {frame _T-N _conPseLabel、frame _T-N+1 _conPseLabel、… frame _T -con pseibel }; each first continuous frame intermediate feature, confugeature, is cut into a plurality of combinations per N frames (frames) denoted confugeature = { { frame ₁ _conBigfeature、frame ₂ _conBigfeature、… framen_conBigfeature }、{frame ₂ _conBigfeature、frame ₃ _conBigfeature、… frame _N+1 _conBigfeature }、… {frame _T-n _conBigfeature、frame _T-n+1 _conBigfeature、frameT_conBigfeature } }。

Specifically, a target detection pseudo tag is generated by reasoning unlabeled data through a small model of a vehicle end, and a continuous frame pseudo tag with tracking information is generated by using a tracking algorithm SimTacker; and comparing the pseudo tags generated by the small model of the vehicle end with a plurality of groups of pseudo tags generated by the large cloud model, such as Euclidean distance and loss function CenterHeadLoss, extracting topN with the largest difference, determining the corresponding first pseudo labeling data and second pseudo labeling data as difficult case pseudo tag data, and updating the terminal model through distillation training based on the difficult case pseudo tag data to obtain a terminal updating model.

Specifically, randomly extracting n continuous frame data from unlabeled continuous frame data, denoted as frame _t~t+N = {frame _t 、frame _t+1 、……, frame _t+n }. Using SmallModel for untagged data frame _t~t+n Reasoning to generate pseudo tag frames _t~t+n SmallPseLabel and intermediate feature frame _t~t+n Smallfeatute. Frame based using target tracking methods _t~t+n SmallPseLabel Generation frame _t~t+n ConsmallPseLabel and frame _t~t+n Consmallfeature. { frames } are extracted from conBigfeatures, conPseLabels separately _t 、frame _t+1 、…frame _t+n N cloud big model results frames corresponding to _t~t+n ConPseLabels and frames _t~t+n ConBigfeatures. Frame calculation using mAP metrics _t~t+n ConsmallPseLabel and frame _t~t+n The results in condonelabels are noted as mAP _n ={mAP ₁ 、mAP ₂ 、…mAP _N First pseudo-annotation data corresponding to topN-large result is selected as frame _t~t+n conPseLabeltop _1~N ={ frame _t~t+n conPseLabeltop1、frame _t~t+n conPseLabel top2 … frame _t~t+ _n ConPseLabeltopN } and frame _t~t+n conBigfeature top1~N={ frame _t~t+n conBigfeature top1、frame _t~t+n conBigfeature top2 … frame _t~t+n conBigfeature topN}。

Specifically, the pseudo tag is proposed as a label for pseudo tag distillation training, and the main used method is Euclidean distance regression of intermediate features and pseudo tag label training. Using the student Net versus frame _t~t+n Reasoning to generate frames _t~t+n Is the frame of the reasoning result of (a) _t~t+n Output and intermediate feature frame _t~t+n Smallfree. Using frames _t~t+ _n The condoneLabeltop 1-N is used as a label of the SmallModel detection head and is trained by using a CenterPoint mode. frame _t~t+n conBigfeature top 1N is trained on Euclidean distance (L2 Loss) with Smallbore intermediate results Smallbore of Smallbore model.

Specifically, in the knowledge distillation process, the advantages and characteristics of each teacher network are highlighted to pertinently improve the performance of the student network, and if a single frame is the case, the middle characteristic of a cloud large model or a pseudo tag thereof is used; the multi-frame condition is a multi-frame result using a cloud large model/a pre-labeling large model or a pseudo tag thereof. The method mainly used is European with intermediate characteristics Distance regression and pseudo tag label training. The training process specifically comprises the following steps: using the student Net versus frame _t~t+n Reasoning to generate frames _t~t+n Is the frame of the reasoning result of (a) _t~t+n Output and intermediate feature frame _t~t+ _n Smallfree. For the case of consecutive frames, the consecutive frames use frames _t~t+n conBigfeature _top1~N And frame _t~t+n conPseLabel _top1~N As rnn_loss tags. Where rnn_loss is the sum of the multi-frame features and after addition, training is performed using L2 Loss. For the case of a single frame, the single frame uses frames _t~t+n PseLabel _top1~N As a label of the SmallModel detection head, training was performed using a centrpoint method. And frame _t~t+n Bigfeature _top1~N Then the euclidean distance (L2 Loss) training is performed with Smallfeature, a back bone intermediate result of SmallModel.

Specifically, as shown in fig. 3, the main idea of training the large model of the pseudo tag is to clean up the pseudo tag of the continuous frames, and then add the continuous frames into the original data set to retrain the large model of the cloud end so as to improve the effect of the large model of the cloud end. For example, differences between each other in the condonelabels are calculated in a manner mainly of euclidean distance (L2). Samples with the greatest difference from other classes in the condonelabels are selected and removed, and the reason for this treatment is that samples with poor similarity to other labels are false-standard samples with high probability. And inputting original data, namely unlabeled data, carrying out cloud large model training according to the original data, continuing to determine pseudo labeling data through model reasoning and tracking calculation, determining continuous frame pseudo labels, and removing the pseudo labels with the largest distance through measuring the distance between the pseudo labels to form a training closed loop.

The traditional knowledge distillation teacher-student network can not continuously optimize the cloud large model, but the model training method provided by the embodiment of the application can continuously optimize the cloud large model so as to improve the labeling efficiency and accuracy, provide continuous frame time sequence information for the small model and improve the optimization upper limit. The traditional knowledge distillation teacher-student network is mostly trained based on single frame data, and the perception model training method of the embodiment corrects the false label with unstable cloud end large model class jump according to continuous frame results, so that the problem that the accuracy of a small model at a vehicle end is limited due to inaccurate single frame detection of the cloud end large model is prevented; and the middle knowledge distillation training is characterized by continuous frames, so that the stability of feature distribution is improved, and the performance of the small model at the vehicle end is improved more efficiently and more pertinently.

The traditional knowledge distillation teacher-student network is mostly trained based on a single cloud big model, and the perception model training method in the embodiment of the application uses a plurality of teacher networks to generate a plurality of pseudo labels, compares the results of the middle characteristics and the vehicle end small model, extracts a plurality of results with larger variability to train the vehicle end small model, and aims to dig out the part with larger variability of the vehicle end small model and the teacher network to perform targeted optimization, so that the model optimization effect is improved more efficiently.

In a preferred embodiment, as shown in fig. 4, a tightly coupled data closed-loop system is provided, iteration evolution is performed on a cloud perception large model and a cloud pre-labeling large model by using fine labeling data and unlabeled data, and a vehicle end small model is trained by using knowledge migration methods such as knowledge distillation and the like, so that the effect of the vehicle end small model is improved.

Specifically, high-value data screened by a vehicle-end perception small model can be marked by using a cloud pre-marking large model, and fine marking data are generated. The high-value data refers to scene data of a target result and special scene data. Scenes with more target results, such as scenes with more pedestrians or vehicles, and the small model perceived by the vehicle end is easy to identify errors on the scenes. Special scenes refer to rainy or black night scenes, etc., and the probability of perception errors in these scenes is high, so that the part of the scenes is said to be high-value data. And reasoning to generate pseudo-annotation data through a plurality of old cloud perception large models and cloud pre-annotation large models. The cloud end perception model and the cloud end pre-labeling model are optimized simultaneously by using pseudo labeling data and fine labeling data, a new cloud end perception model and a cloud end pre-labeling model with higher precision are obtained, then a teacher network and a student network are used, a cloud end model is used as a teacher network, a vehicle end perception model is used as a student network, a new vehicle end perception model is obtained based on an old vehicle end perception model by means of difficult-case pseudo tag mining and distillation training, the new vehicle end perception model is deployed at a vehicle end, efficient self-training closed loop is achieved, and the perception performance of the cloud end model is improved.

And the teacher-student network method based on difficult-to-sample pseudo-label mining is used for migrating the knowledge of the optimized cloud model to the small car-end model, so that the perception capability is improved more efficiently aiming at the places where the small car-end perception model has insufficient performance.

As shown in fig. 5, the cloud large model and the knowledge distillation process are small models for optimizing the student vehicle end perception together through a plurality of teacher networks, and specifically include: generating a continuous frame pseudo tag; self-training of a large pseudo tag model; difficult-case pseudo tag mining and distillation training are used as core teacher and student network optimization algorithms.

In another preferred embodiment, as shown in fig. 6, a cloud multi-model optimization method is provided for a 2D vehicle-end perception small model, specifically, in a vehicle-end operation process, the 2D vehicle-end perception small model generates some problems of inaccurate detection caused by environment transformation verification, such as: and the vehicle or pedestrian detection under strong light is lost, the lane line detection is inaccurate, and the error detection is generated in rainy days with heavy fog. Therefore, the data needs to be manually marked for the scenes, and the effect of the small model at the vehicle end under the scenes is enhanced. And collecting the data of the environments, uploading the data serving as fine annotation image data and route mining unlabeled data to the cloud, and generating a pseudo tag by using cloud large model reasoning. Because the cloud end large model effect is far better than the vehicle end small model, the generated pseudo tag can cover targets in the scenes. The generated pseudo tag uses a cloud large model self-training strategy to optimize the cloud large model, so that the effect of the cloud large model can be further improved. And taking out the intermediate result and the pseudo tag generated by the cloud large model, and optimizing by combining the optimized knowledge distillation method. And finally deploying the optimized vehicle end small model to form a closed loop.

According to the perception model training method, in the tightly-coupled data closed-loop system, the cloud large model/pre-labeling model can continuously evolve and iterate through high-value data screened by the vehicle-end small model, so that the accuracy of labeling data and the requirement of dynamic support training business are continuously improved.

AI model training is not directly through data training car end small model, but through optimizing high in the clouds model and high in the clouds mark large model earlier, and the car end small model performance is jointly optimized through teachers and students' network to promote the quality and the efficiency of training.

The lifting of the student network is assisted by a plurality of teacher networks, and the main mode is to compare a plurality of cloud end large model reasoning results with vehicle end small model reasoning results, and take a plurality of results with larger difference as pseudo tag training models.

A brand new teacher-student distillation method is designed by utilizing continuous frame characteristics, and the main idea is to pair IDs of continuous frame results by using a tracking algorithm, extract results within a plurality of frames of the continuous frame results as pseudo tags to train a vehicle end small model, and reduce the defect that the accuracy of the vehicle end small model is limited due to inaccurate labeling results.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a perception model training device for realizing the above related perception model training method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the one or more perception model training devices provided below may be referred to the limitation of the perception model training method hereinabove, and will not be described herein.

In one embodiment, as shown in fig. 7, there is provided a perception model training apparatus, including: acquisition module 10, self-training module 20, calculation module 30, and distillation training module 40, wherein:

an obtaining module 10, configured to obtain unlabeled data determined based on a terminal model;

the self-training module 20 is configured to input the unlabeled data into a plurality of cloud models, generate first pseudo-labeled data, and perform self-training on the plurality of cloud models based on the first pseudo-labeled data;

the calculation module 30 is configured to input the unlabeled data into a terminal model, and generate second pseudo-labeled data;

and the distillation training module 40 is configured to update the terminal model through distillation training according to the first pseudo-annotation data and the second pseudo-annotation data, so as to obtain a terminal update model.

The self-training module 20 is further configured to identify the unlabeled data through a plurality of cloud models, and generate a plurality of first pseudo tags and a plurality of first intermediate features; tracking a plurality of first pseudo tags in continuous multi-frame unlabeled data through a target tracking algorithm to generate first tracking pseudo tags; performing association processing on the plurality of first pseudo tags and the plurality of first intermediate features according to the time sequence of the first tracking pseudo tags and the continuous multi-frame unlabeled data to generate first continuous frame pseudo tags and first continuous frame intermediate features; and taking the first continuous frame pseudo tag and the first continuous frame intermediate feature as first pseudo labeling data.

The computing module 30 is further configured to identify the unlabeled data through the terminal model, and generate a plurality of second pseudo tags and a plurality of second intermediate features; tracking a plurality of second pseudo tags in continuous multi-frame unlabeled data through a target tracking algorithm to generate second tracking pseudo tags; performing association processing on the plurality of second pseudo tags and the plurality of second intermediate features according to the time sequence of the second tracking pseudo tags and the continuous multi-frame unlabeled data to generate second continuous frame pseudo tags and second continuous frame intermediate features; and taking the second continuous frame pseudo tag and the second continuous frame intermediate feature as second pseudo labeling data.

The distillation training module 40 is further configured to determine difficult-to-case pseudo tag data according to the first pseudo tag data and the second pseudo tag data; and updating the terminal model through distillation training according to the difficult-to-case pseudo tag data to obtain a terminal updating model.

The distillation training module 40 is further configured to calculate a euclidean distance and/or a loss value between the first pseudo-annotation data and the second pseudo-annotation data; and taking the Euclidean distance and/or the loss value exceeding a preset threshold value as difficult-to-case pseudo tag data corresponding to the first pseudo tag data.

The distillation training module 40 is further configured to perform center point training with the second continuous frame pseudo tag as a tag of the detection head of the terminal model, and perform euclidean distance training with the intermediate feature of the first continuous frame and the backbone network feature of the terminal model, so as to update the terminal model, and obtain a terminal update model.

The self-training module 20 is further configured to determine, according to the first pseudo tag data, a difference value between a plurality of first pseudo tags in the first pseudo tag data; performing data rejection on the first pseudo labeling data based on the difference value to obtain rejected first pseudo labeling data; obtaining pre-labeling data; and performing self-training on the cloud models according to the first pseudo-annotation data and the pre-annotation data after being removed.

The modules in the above-described perception model training apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing fine labeling data, unlabeled data and pseudo standard data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a perception model training method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

obtaining unlabeled data determined based on a terminal model;

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

obtaining unlabeled data determined based on a terminal model;

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

obtaining unlabeled data determined based on a terminal model;

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of training a perception model, the method comprising:

acquiring unlabeled data determined based on a terminal model, wherein the unlabeled data comprises continuous multi-frame unlabeled data;

Inputting the unlabeled data into a plurality of cloud models, and generating first pseudo-labeled data comprises:

identifying the unlabeled data through a plurality of cloud models to generate a plurality of first pseudo tags and a plurality of first intermediate features;

tracking a plurality of first pseudo tags in continuous multi-frame unlabeled data through a target tracking algorithm to generate first tracking pseudo tags;

performing association processing on the plurality of first pseudo tags and the plurality of first intermediate features according to the time sequence of the first tracking pseudo tags and the continuous multi-frame unlabeled data to generate first continuous frame pseudo tags and first continuous frame intermediate features;

taking the first continuous frame pseudo tag and the first continuous frame intermediate feature as first pseudo labeling data;

the self-training of the cloud models based on the first pseudo-annotation data includes:

determining difference values among a plurality of first pseudo tags in the first pseudo tag data according to the first pseudo tag data;

performing data rejection on the first pseudo labeling data based on the difference value to obtain rejected first pseudo labeling data;

obtaining pre-labeling data; performing self-training on the cloud models according to the first pseudo-annotation data and the pre-annotation data after being removed;

updating the terminal model through distillation training according to the first pseudo labeling data and the second pseudo labeling data to obtain a terminal updating model;

updating the terminal model through distillation training according to the first pseudo-annotation data and the second pseudo-annotation data, wherein obtaining the terminal updating model comprises the following steps:

determining difficult-case pseudo tag data according to the first pseudo tag data and the second pseudo tag data;

updating the terminal model through distillation training according to the difficult-to-case pseudo tag data to obtain a terminal updating model;

the difficult case pseudo tag data is determined based on the following manner: the unlabeled data is subjected to reasoning to generate a target detection pseudo tag through the terminal model, and a tracking algorithm is used for generating a continuous frame pseudo tag with tracking information; determining a plurality of difference values between the target detection pseudo tag, the continuous frame pseudo tag with tracking information and the first pseudo tags respectively based on Euclidean distance and a loss function; determining a maximum value in the multiple difference values, and determining first pseudo labeling data and second pseudo labeling data corresponding to the maximum value as difficult pseudo labeling data;

Updating the terminal model through distillation training according to the difficult-to-case pseudo tag data, wherein obtaining the terminal updating model comprises the following steps:

and training a center point by taking the second continuous frame pseudo tag as a tag of a detection head of the terminal model, and training Euclidean distance between the intermediate feature of the first continuous frame and the main network feature of the terminal model to update the terminal model so as to obtain a terminal update model.

2. The method of claim 1, wherein inputting the unlabeled data into a terminal model to generate second pseudo-labeled data comprises:

identifying the unlabeled data through the terminal model to generate a plurality of second pseudo tags and a plurality of second intermediate features;

tracking a plurality of second pseudo tags in continuous multi-frame unlabeled data through a target tracking algorithm to generate second tracking pseudo tags;

performing association processing on the plurality of second pseudo tags and the plurality of second intermediate features according to the time sequence of the second tracking pseudo tags and the continuous multi-frame unlabeled data to generate second continuous frame pseudo tags and second continuous frame intermediate features;

and taking the second continuous frame pseudo tag and the second continuous frame intermediate feature as second pseudo labeling data.

3. The method of claim 1, wherein determining difficult-to-case pseudo tag data from the first pseudo tag data and the second pseudo tag data comprises:

calculating Euclidean distance and/or loss value between the first pseudo labeling data and the second pseudo labeling data;

and taking the Euclidean distance and/or the loss value exceeding a preset threshold value as difficult-to-case pseudo tag data corresponding to the first pseudo tag data.

4. The perception model training method according to claim 1, wherein the cloud model comprises a cloud perception model and a cloud pre-labeling model, and the distillation training comprises single-frame distillation training and multi-frame time sequence distillation training, wherein the single-frame distillation training is performed based on intermediate features or pseudo tags of the cloud perception model; and the multi-frame time sequence distillation training is performed based on multi-frame results or pseudo labels of the cloud perception model or the cloud pre-labeling model.

5. A perception model training apparatus, the apparatus comprising:

the acquisition module is used for acquiring unlabeled data determined based on the terminal model, wherein the unlabeled data comprises continuous multi-frame unlabeled data;

the self-training module is further used for identifying the unlabeled data through a plurality of cloud models to generate a plurality of first pseudo tags and a plurality of first intermediate features; tracking a plurality of first pseudo tags in continuous multi-frame unlabeled data through a target tracking algorithm to generate first tracking pseudo tags; performing association processing on the plurality of first pseudo tags and the plurality of first intermediate features according to the time sequence of the first tracking pseudo tags and the continuous multi-frame unlabeled data to generate first continuous frame pseudo tags and first continuous frame intermediate features; taking the first continuous frame pseudo tag and the first continuous frame intermediate feature as first pseudo labeling data; determining difference values among a plurality of first pseudo tags in the first pseudo tag data according to the first pseudo tag data; performing data rejection on the first pseudo labeling data based on the difference value to obtain rejected first pseudo labeling data; obtaining pre-labeling data; performing self-training on the cloud models according to the first pseudo-annotation data and the pre-annotation data after being removed;

the distillation training module is used for updating the terminal model through distillation training according to the first pseudo-annotation data and the second pseudo-annotation data to obtain a terminal updating model;

the distillation training module is further used for determining difficult-to-case pseudo tag data according to the first pseudo tag data and the second pseudo tag data; updating the terminal model through distillation training according to the difficult-to-case pseudo tag data to obtain a terminal updating model; the unlabeled data is subjected to reasoning to generate a target detection pseudo tag through the terminal model, and a tracking algorithm is used for generating a continuous frame pseudo tag with tracking information; determining a plurality of difference values between the target detection pseudo tag, the continuous frame pseudo tag with tracking information and the first pseudo tags respectively based on Euclidean distance and a loss function; determining a maximum value in the multiple difference values, and determining first pseudo labeling data and second pseudo labeling data corresponding to the maximum value as difficult pseudo labeling data; and training a center point by taking the second continuous frame pseudo tag as a tag of a detection head of the terminal model, and training Euclidean distance between the intermediate feature of the first continuous frame and the main network feature of the terminal model to update the terminal model so as to obtain a terminal update model.

6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 4 when the computer program is executed.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 4.