CN116129224A

CN116129224A - Training method, classifying method and device for detection model and electronic equipment

Info

Publication number: CN116129224A
Application number: CN202310107213.3A
Authority: CN
Inventors: 陈文俊; 蒋宁; 夏粉; 肖冰; 李宽
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2023-02-13
Filing date: 2023-02-13
Publication date: 2023-05-16

Abstract

The embodiment of the application discloses a training method, a classifying method, a device and electronic equipment of a detection model, wherein the training method of the detection model comprises the following steps: acquiring an image sample set, wherein the image sample set comprises image samples of N category labels, and each category label corresponds to a prompt learning text; inputting the image sample into a first detection model, and performing image processing on the image sample through the first detection training model to obtain first classification features corresponding to N class labels; acquiring text characteristics of N prompt learning texts; calculating the similarity between each first classification feature and N text features to obtain N similarity sets; determining a prediction category label of each first classification feature, wherein the prediction category label is a category label of a text feature corresponding to the maximum similarity in a similarity set corresponding to the first classification feature; and optimizing parameters of the first detection model according to the predicted class labels and the class labels corresponding to the first classification features.

Description

Training method, classifying method and device for detection model and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a training method, a classification method, a device, and an electronic apparatus for a detection model.

Background

The object detection technology is to find out an object of interest in a picture when the picture or the video is acquired, wherein the object detection technology needs to complete two tasks, namely classification and positioning, and finally determines the type and the position of the object to be detected.

In some scenes, in order to ensure the performance of target detection, a large amount of labeling data is required, if the labeling data is less, the performance of target detection is insufficient, so that large manpower and material resources are required to be spent for data labeling, and the target detection technology in the related technology can only detect labeled categories, and the recognition accuracy of unlabeled categories is low.

Disclosure of Invention

The application provides a classification method, a training method and device for a detection model and electronic equipment, and aims to solve the problem of low recognition accuracy of image categories.

In a first aspect, the present application provides a training method for a detection model, including: acquiring an image sample set, wherein the image sample set comprises image samples of N category labels, each category label corresponds to a prompt learning text, and the prompt learning text is used for describing the category label; inputting the image sample into a first detection model, and performing image processing on the image sample through the first detection model to obtain first classification features corresponding to the N class labels; acquiring text characteristics of N prompt learning texts; calculating the similarity between each first classification feature and N text features to obtain N similarity sets, wherein each similarity set corresponds to one first classification feature; determining a prediction category label of each first classification feature, wherein the prediction category label is a category label of a text feature corresponding to the maximum similarity in a similarity set corresponding to the first classification feature; and optimizing parameters of the first detection model according to the prediction category labels and the category labels corresponding to the first classification features.

In a second aspect, embodiments of the present application provide a classification method, including: acquiring an image to be detected, wherein the image to be detected carries M category labels; determining prompt learning texts corresponding to the M category labels; acquiring text features corresponding to M prompt learning texts; inputting the image to be detected into a first detection model for image processing to obtain classification features corresponding to M class labels of the image to be detected; respectively calculating the similarity between each classification feature and the M text features to obtain M similarity sets, wherein each similarity set corresponds to one classification feature; and for each classification feature, determining the class label of the text feature corresponding to the maximum similarity in the similarity set as the prediction class label of the image to be detected.

In a third aspect, the present application provides a training device for a detection model, including: the acquisition module is used for acquiring an image sample set, wherein the image sample set comprises image samples of N category labels, each category label corresponds to a prompt learning text, and the prompt learning text is used for describing the category label; the processing module is used for inputting the image sample into a first detection model, and performing image processing on the image sample through the first detection model to obtain first classification features corresponding to the N class labels; the acquisition module is also used for acquiring text characteristics of the N prompt learning texts; the computing module is used for computing the similarity between each first classification feature and N text features to obtain N similarity sets, and each similarity set corresponds to one first classification feature; the determining module is used for determining a prediction category label of each first classification feature, wherein the prediction category label is a category label of a text feature corresponding to the largest similarity in a similarity set corresponding to the first classification feature; and the optimization module is used for optimizing the parameters of the first detection model according to the prediction category labels and the category labels corresponding to the first classification features.

In a fourth aspect, embodiments of the present application provide a classification apparatus, including: the acquisition module is used for acquiring an image to be detected, wherein the image to be detected carries M category labels; the determining module is used for determining prompt learning texts corresponding to the M category labels; the acquisition module is also used for acquiring text features corresponding to the M prompt learning texts; the processing module is used for inputting the image to be detected into a first detection model for image processing to obtain classification characteristics corresponding to M class labels of the image to be detected; the computing module is used for computing the similarity between each classification feature and the M text features respectively to obtain M similarity sets, and each similarity set corresponds to one classification feature; and the determining module is further used for determining, for each classification feature, a category label of the text feature corresponding to the maximum similarity in the similarity set as a prediction category label of the image to be detected.

In a fifth aspect, the present application provides an electronic device, comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method according to the first or second aspect.

In a sixth aspect, the present application provides a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of the first or second aspect.

It can be seen that, in the embodiment of the present application, the image samples input into the first detection model carry category labels, the category labels correspond to prompt learning text describing the category labels, the input image samples are subjected to image processing through the first detection model to obtain first classification features corresponding to each category label, then text features of each prompt learning text are obtained, for each first classification feature, similarity between each first classification feature and each text feature can be calculated, the category label corresponding to the text feature with the largest similarity is used as a prediction category label of the first classification feature, so that the category of each classification feature can be predicted through the prompt learning text, and finally parameters of the first detection model are optimized according to the prediction category labels and the category labels corresponding to the first classification features.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

fig. 1 is a flow chart of a training method of a detection model according to an embodiment of the present application;

fig. 2 is a schematic diagram of a specific scenario for calculating similarity according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an overall training process of a first detection model according to an embodiment of the present application;

fig. 4 is a flow chart of a classification method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a training device for a detection model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a sorting device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein. In addition, in the present specification and claims, "and/or" means at least one of the connected objects, and the character "/" generally means a relationship in which the associated objects are one or.

As before, in order to ensure the performance of target detection, a large amount of labeling data is required, if the labeling data is less, the performance of target detection is insufficient, so that a large amount of manpower and material resources are required to be spent for data labeling, and the target detection technology in the related technology can only detect labeled categories, and has low recognition accuracy for unlabeled categories.

In order to solve the above technical problems, an embodiment of the present application provides a classification method and a training method for a detection model, where the training method for the detection model includes: acquiring an image sample set, wherein the image sample set comprises image samples of N category labels, each category label corresponds to a prompt learning text, and the prompt learning text is used for describing the category label; inputting the image sample into a first detection model, and performing image processing on the image sample through the first detection training model to obtain first classification features corresponding to N class labels; acquiring text characteristics of N prompt learning texts; calculating the similarity between each first classification feature and N text features to obtain N similarity sets, wherein each similarity set corresponds to one first classification feature; determining a prediction category label of each first classification feature, wherein the prediction category label is a category label of a text feature corresponding to the maximum similarity in a similarity set corresponding to the first classification feature; and optimizing parameters of the first detection model according to the predicted class labels and the class labels corresponding to the first classification features.

According to the technical scheme provided by the embodiment of the application, the class labels are carried in the image samples input into the first detection model, the class labels correspond to prompt learning texts which describe the class labels, the first detection model is used for carrying out image processing on the input image samples to obtain first classification features corresponding to each class label, then the text features of each prompt learning text are acquired, for each first classification feature, the similarity between each first classification feature and each text feature can be calculated, the class label corresponding to the text feature with the largest similarity is used as a prediction class label of the first classification feature, so that the class of each classification feature can be predicted through the prompt learning text, finally, the parameters of the first detection model are optimized according to the prediction class labels and the class labels corresponding to the first classification features, and the classification feature is predicted through the prompt learning texts, so that the classification performance of the first detection model is improved, and the recognition accuracy of the class of the unlabeled sample is improved.

For the classification method, it includes: acquiring an image to be detected, wherein the image to be detected carries M category labels; determining prompt learning texts corresponding to the M category labels; acquiring text features corresponding to M prompt learning texts; inputting the image to be detected into a first detection model for image processing to obtain classification features corresponding to M class labels of the image to be detected; respectively calculating the similarity between each classification feature and M text features to obtain M similarity sets, wherein each similarity set corresponds to one classification feature; and for each classification feature, determining the class label of the text feature corresponding to the maximum similarity in the similarity set as the prediction class label of the image to be detected.

According to the technical scheme provided by the embodiment of the application, the class labels are carried in the images to be detected in the first detection model, the class labels correspond to prompt learning texts describing the class labels, the first detection model is used for carrying out image processing on the input images to be detected to obtain classification features corresponding to each class label, then the text features of each prompt learning text are acquired, for each classification feature, the similarity of each classification feature and each text feature can be calculated, the class label corresponding to the text feature with the largest similarity is used as the prediction class label of the images to be detected, so that the classes of the images to be detected can be guided to be predicted through the prompt learning texts, the classification performance of the first detection model is improved due to the fact that the prompt learning texts are used for guiding the first detection model to carry out class prediction on the images to be detected, and the recognition accuracy of the first detection model to the images to be detected without labels is improved due to the fact that the first detection model is also used for guiding the classification prediction of the images to be detected with the non-labels through the prompt learning texts.

It should be understood that, the training method or the classifying method of the detection model provided in the embodiments of the present application may be executed by an electronic device or software installed in the electronic device, and specifically may be executed by a terminal device or a server device. The text processing method may be executed by the same electronic device, or may be executed by a different electronic device.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a training method of a detection model according to an embodiment of the present disclosure is provided and applied to an electronic device, where the method may include:

step S101, an image sample set is acquired.

The image sample set comprises image samples of N category labels, each category label corresponds to a prompt learning text, and the prompt learning text is used for describing the category label.

Specifically, the category label refers to a category to which an image sample belongs, and for one image sample, for different constituent objects in a pattern sample, different constituent objects correspond to different category labels. For example, the constituent objects in the image sample include: dice, stop mark and piece, the class label that this image sample corresponds includes: dice category, stop sign category, and pawn category. That is, one image sample corresponds to at least one category label, and the image sample set may include at least one image sample, and the total number of category labels of each image sample is N.

Further, the text for prompting learning is a text mode for describing category labels, for example, the category labels corresponding to the image samples include: dice category, stop mark category and chess category, the prompt learning text corresponding to each category label may be: "a dice in a picture", "a stop mark in a picture", "a chess piece in a picture".

Step S103, inputting the image sample into a first detection model, and performing image processing on the image sample through the first detection model to obtain first classification features corresponding to the N class labels.

In particular, the first detection model may be a small model YOLOX-s, the process of image processing of which is: the small model YOLOX-s identifies different categories in the image sample to learn classification features corresponding to the different categories, and the classification features corresponding to the respective category labels of the image sample are output by the small model YOLOX-s.

Step S105, obtaining text features of N prompt learning texts.

Specifically, the specific process of acquiring the text feature of the prompt learning text may be: performing first coding on the prompt learning text to obtain a corresponding text coding vector; the text encoding vector is then fed into a pre-trained self-encoding language model (Bidirectional Encoder Representations from Transformers, BERT) for a second encoding to obtain text features corresponding to the prompt learning text. The first coding refers to converting text into a digital form according to a dictionary, wherein the dictionary is predefined according to a training set and comprises the corresponding relation between the text and the digital. For example, the prompt learning text is "this is an apple", and the text encoding vector obtained by encoding it is "102 108356 674 829 172".

Step S107, calculating the similarity between each first classification feature and N text features to obtain N similarity sets.

Wherein each similarity set corresponds to a first classification feature.

Specifically, for each first classification feature, there is a set of similarities, where each set of similarities includes N similarities, and the similarity refers to a degree of similarity between the first classification feature and the text feature, and the higher the similarity, the more consistent the first classification feature and the text feature are, where the similarity may be cosine similarity.

In one possible implementation, the specific implementation of calculating the similarity is:

inputting the first classification characteristic into a region generation network of a first detection model, and predicting the position information of a foreground target corresponding to the first classification characteristic and the confidence that the foreground target is a target object through the region generation network; determining a foreground target with the confidence coefficient larger than a threshold value as a positive sample; extracting a third classification feature corresponding to the positive sample on the first classification feature according to the position information corresponding to the positive sample, and converting the third classification feature into a positive sample feature vector consistent with the text feature dimension; similarity is calculated from the positive sample feature vector and each text feature.

Specifically, the first detection model may be a small model YOLOX-s, and the region generating network (Region Proposal Network, RPN) is a module in the small model YOLOX-s, and the purpose of the region generating network is to predict the position information of the foreground object corresponding to the first classification feature and identify the foreground object as the confidence of the object to be segmented, where the position information of the foreground object may be expressed in the form of (xmin, ymin, xmax, ymax), the confidence value is (0, 1), the closer the confidence is to 1, the greater the probability that the foreground object is the object, the closer the confidence is to 0, and the probability that the foreground object is the object is smaller. According to the prediction result of the foreground object, a threshold value (the threshold value can be 0.5) is set to further screen the foreground object, and if the confidence coefficient of the foreground object is larger than 0.5, the foreground object is considered to be a positive sample, and m positive samples are obtained.

After obtaining m positive samples, extracting m positive sample features (third classification features) corresponding to the m positive samples on the first classification feature through the ROIAlign branch. And converting the third classification feature into a vector dimension (positive sample feature vector) consistent with the text feature dimension through a convolution layer, and calculating similarity according to the positive sample feature vector and each text feature, wherein the similarity can be cosine similarity, and specifically, the similarity is calculated by adopting the following formula:

Where cos θ is cosine similarity, a is text feature, and b is positive sample feature vector. Wherein the result of cos θ is between (-1, 1), and if the result is closer to 1, the text feature and the positive sample feature vector are more similar, and the text feature and the positive sample feature vector belong to the same category.

The implementation of the similarity will be described in detail below with reference to fig. 2 by taking a certain first classification feature as an example:

as shown in fig. 2, for a certain classification feature, the classification feature is input into an RPN network, and a foreground target is predicted through the RPN network to obtain a foreground target prediction result, where the foreground target prediction result includes position information of the foreground target and a confidence that the foreground target is considered as a target object to be segmented. If the confidence is greater than the threshold, it is taken as positive samples by the ROIAlign branch to extract m positive sample features (third classification features) corresponding to the m positive samples on the first classification feature. And converting the third classification feature into a vector dimension (positive sample feature vector) consistent with the text feature dimension through a convolution layer, for example, the positive sample 1 image feature to the positive sample m image feature, for n text features from category 1 to category n, performing similarity calculation on the positive sample 1 image feature to the positive sample m image feature and the n text features from category 1 to category n respectively, for each positive sample image feature, n similarity from category 1 to category n is corresponding, and selecting the text feature with the largest similarity from the n similarity as the largest category of each positive sample score.

Step S109, determining a prediction category label of each first classification feature.

The prediction category label is a category label of a text feature corresponding to the maximum similarity in the similarity set corresponding to the first classification feature.

Specifically, for each first classification feature, there is one similarity set corresponding to each first classification feature, and each similarity set has a plurality of similarities, and the classification label of the text feature corresponding to the maximum similarity is used as the prediction classification label of the first classification feature.

And step S111, optimizing parameters of the first detection model according to the predicted class labels and the class labels corresponding to the first classification features.

Specifically, for the class label of the first classification feature, the class label can be labeled in advance after the image sample is checked manually, the class label can be used as a real class label of the first classification feature, the predicted class label is a predicted label of the predicted first classification feature, and the predicted class label and the class label corresponding to the first classification feature are sent to a cross entropy loss function for optimization, so that parameters of the first detection model are optimized.

The cross entropy loss function may be specifically represented by the following formula:

Wherein H (p, q) refers to the cross entropy loss function, p (x) _i ) To predict class labels, q (x _i ) A true class label for the first class feature.

In one possible implementation manner, inputting the image sample into a pre-trained second detection model, and performing image processing on the image sample through the second detection model to obtain second classification features corresponding to the N classification labels; and carrying out classification knowledge distillation learning on the third detection model according to the second classification characteristic until a first loss function of the third detection model is converged to obtain a first detection model, wherein the first loss function is determined according to the first classification characteristic and the second classification characteristic, and the parameter quantity of the second detection model is larger than that of the first detection model.

In particular, the second detection model may be a large model YOLOX-l, the process of image processing of which is: the large model YOLOX-l identifies different categories in the image sample to learn classification features corresponding to the different categories, and the classification features corresponding to the respective category labels of the image sample are output by the large model YOLOX-l. Wherein a large model YOLOX-l has more convolutional layers, deeper network depth, uses more neurons and has more parameter than a small model YOLOX-s, but the second detection model has more dependence on hardware.

After the large model YOLOX-l identifies different categories in an image sample so as to learn second classification features corresponding to the different categories, based on classification knowledge distillation learning, optimizing parameters of a third detection model to be trained by using a first loss function through the second classification features obtained by the large model YOLOX-l and the first classification features obtained by the small model YOLOX-s, and obtaining a trained first detection model after the first loss function converges, wherein the third detection model to be trained refers to the small model YOLOX-s to be trained. Therefore, the classification characteristics learned by the large model YOLOX-l and the small model YOLOX-s are enabled to be as close as possible to be consistent, and therefore the classification performance of the small model YOLOX-s is improved.

The first loss function may be an L2 loss function, that is, a mean square error loss function, which is the most commonly used regression loss function, and specifically may be expressed by the following formula:

wherein MSE represents a first loss function, y _i Representing a second classification characteristic, y _i ^p The first classification feature is represented, and n represents the number of class labels. Therefore, based on classification knowledge distillation learning, the first detection model can be guided to learn by using the pre-trained second detection model, so that the classification performance of the first detection model under zero sample learning is further improved, and the dependence of the model on hardware is reduced.

In one possible implementation, after acquiring the image sample set, the method further comprises:

inputting the image sample into a first detection model, and predicting a first target position of a target object in the image sample through the first detection model; inputting the image sample into a second detection model, and predicting a second target position of a target object in the image sample through the second detection model, wherein the parameter quantity of the second detection model is larger than that of the first detection model; and carrying out positioning knowledge distillation learning on the first detection model according to the second target position until a second loss function of the first detection model is converged, wherein the second loss function is determined according to the first target position and the second target position.

Specifically, the target object refers to a target to be located in the image sample, for example, the target object may be a person, a tree, a car, a house, etc. in the image sample, the first target position and the second target position refer to position distributions of the target object in the image sample, for example, the position distributions may be BBox distributions, where the BBox distributions are generally represented by boxes as (x, y, w, h), where (x, y) represents a center point of the boxes, w, h represents a width and a height of the boxes, respectively, and the first target position and the second target position may be different. The first detection model may be a small model YOLOX-s, the regression feature learned when the target object in the image sample is located through the small model YOLOX-s, and the regression feature is passed through the regression head layer of the small model YOLOX-s to obtain a first target position corresponding to the target object in the image sample. The second detection model may be a large model YOLOX-l, the regression feature learned when the target object in the image sample is located through the large model YOLOX-l, and the regression feature is passed through the regression head layer of the large model YOLOX-l to obtain a second target position corresponding to the target object in the image sample.

The parameter quantity of the second detection model is larger than that of the first detection model, and based on positioning knowledge distillation learning, the second target position learned by the second detection model can be used for guiding the first detection model to learn until the second loss function of the first detection model converges.

The second loss function may be a KL loss function, i.e. a divergence loss function, for minimizing the proximity of the two probability distributions, even if the first target position predicted by the first detection model and the second target position predicted by the second detection model tend to coincide as much as possible. Therefore, the first target position learned by the first detection model tends to be consistent with the second target position learned by the second detection model as much as possible through the second loss function, the first detection model can be guided to learn based on positioning knowledge distillation learning, the capability of the first detection model for positioning a target object in an image is improved, positioning capability of the first detection model can be consistent with that of a large model, and dependence of the model on hardware is reduced.

The second loss function may be expressed by the following expression:

Wherein D is _KL (p||q) denotes a second loss function, p (x) denotes a first target position predicted by the first detection model, q (x) denotes a second target position predicted by the second detection model, and x denotes a target object in the image sample.

The overall training process of the first detection model in the above embodiment will be described with reference to fig. 3, in which the first detection model is taken as a small model YOLOX-s, and the second detection model is taken as a large model YOLOX-l as an example:

as shown in FIG. 3, labeled images are respectively input into a small model Yolox-s and a pre-trained large model Yolox-L, the labeled images are processed by the pre-trained large model Yolox-L and the pre-trained small model Yolox-s to obtain classification features and regression features, and the classification features processed by the large model Yolox-L and the small model Yolox-s are optimized through L2 loss function to optimize parameters of the small model Yolox-s. After the regression characteristics of the large model YOLOX-l and the small model YOLOX-s after being processed are input into the corresponding regression head for processing, the large model YOLOX-l and the small model YOLOX-s output the BBox distribution of the target in the image respectively, and the large model YOLOX-l and the small model YOLOX-s output the BBox distribution of the target in the image respectively to optimize the KLloss. In this way, small models YOLOX-s can be guided to learn by large models YOLOX-l based on localization knowledge distillation learning, thereby improving the localization ability of small models YOLOX-s to targets in images and reducing the dependence of models on hardware.

Further, the categories in the image samples and the corresponding prompt learning texts are input into the pre-trained BERT, text features in the prompt learning texts are extracted through the BERT, cosine similarity calculation is carried out on the text features and all classification features processed by the small model YOLOX-s, labels corresponding to the classification features with the largest similarity are selected as prediction classification labels, and then the prediction classification labels and real labels corresponding to the classification features are input into a cross entropy loss function so as to optimize parameters of the small model, so that the optimized target classification labels of the small model prediction images are utilized. Therefore, based on classification knowledge distillation learning, the first detection model can be guided to learn by using the pre-trained second detection model, so that the classification performance of the first detection model under zero sample learning is further improved, and the dependence of the model on hardware is reduced.

Referring to fig. 4, a flow chart of a classification method provided in an embodiment of the present disclosure is applied to an electronic device, and the method may include:

step S401, an image to be detected is acquired.

The image to be detected carries M category labels.

Specifically, the image to be detected refers to an image to be classified, and for the image to be detected, the M class labels carried by the image to be detected include class labels used in training the first detection model and untrained new class labels. The class labels refer to classes to which the image samples belong, and for one image sample, different component objects in the pattern sample correspond to different class labels. For example, the constituent objects in an image sample include: dice, stop mark and piece, the class label that this image sample corresponds includes: dice category, stop sign category, and pawn category.

Step S403, determining the prompt learning text corresponding to the M category labels.

Specifically, the prompt learning text is used for describing the category label, the prompt learning text describes the category label in a text manner, for example, the category label corresponding to the image sample includes: dice category, stop mark category and chess category, the prompt learning text corresponding to each category label may be: "a dice in a picture", "a stop mark in a picture", "a chess piece in a picture".

Step S405, obtaining text features corresponding to M prompt learning texts.

Step S407, inputting the image to be detected into a first detection model for image processing to obtain classification features corresponding to M class labels of the image to be detected.

And S409, calculating the similarity between each classification feature and M text features respectively to obtain M similarity sets.

Wherein each similarity set corresponds to a classification feature.

In one possible implementation, the specific implementation of calculating the similarity is: inputting the classification features into a region generation network of a first detection model, and predicting the position information of a foreground target corresponding to the classification features and the confidence level of the foreground target as a target object through the region generation network; determining a foreground target with the confidence coefficient larger than a threshold value as a positive sample; extracting target classification features corresponding to the positive samples on the classification features according to the position information corresponding to the positive samples, and converting the target classification features into positive sample feature vectors consistent with the text feature dimensions; similarity is calculated from the positive sample feature vector and each text feature.

After obtaining m positive samples, extracting m positive sample features (target classification features) corresponding to the m positive samples on the first classification feature through the ROIAlign branch. Converting the target classification feature into a vector dimension (positive sample feature vector) consistent with the text feature dimension through a convolution layer, and calculating similarity according to the positive sample feature vector and each text feature, wherein the similarity can be cosine similarity, and specifically adopting the following formula to calculate:

Step S411, for each classification feature, determining the category label of the text feature corresponding to the maximum similarity in the similarity set as the prediction category label of the image to be detected.

Specifically, for each classification feature, there is one similarity set corresponding to each classification feature, and each similarity set has a plurality of similarities, and the classification label of the text feature corresponding to the maximum similarity is used as the prediction classification label of the classification feature.

In one possible implementation manner, after obtaining the text features corresponding to the M prompt learning texts, the method further includes: inputting an image to be detected into a first detection model, predicting the position information of a target object in the image to be detected through the first detection model, wherein the first detection model is obtained through the positioning knowledge distillation learning of a second detection model, and the parameter quantity of the second detection model is larger than that of the first detection model.

Specifically, the target object refers to a target to be located in the image sample, for example, the target object may be a person, a tree, a car, a house, etc. in the image sample, the position information refers to a position distribution of the target object in the image sample, for example, the position distribution may be a BBox distribution, and the BBox distribution is generally represented by a box as (x, y, w, h), where (x, y) represents a center point of the box, and w, h represent a width and a height of the box, respectively.

The first detection model may be a small model YOLOX-s, the regression feature learned when the target object in the image sample is located through the small model YOLOX-s, and the regression feature is passed through the regression head layer of the small model YOLOX-s to obtain the position information corresponding to the target object in the image sample. The second detection model may be a large model YOLOX-l, when the second detection model guides the first detection model, the regression feature learned when the target object in the image sample is located through the large model YOLOX-l, the regression feature is passed through the regression head layer of the large model YOLOX-l to obtain the position information corresponding to the target object in the image sample, and the small model YOLOX-s learning is guided based on the position information obtained by processing the large model YOLOX-l.

Therefore, based on positioning knowledge distillation learning, the first detection model can be guided to learn by using the pre-trained second detection model, and the first detection model guided to learn by using the second detection model can improve the positioning capability of the first detection model on a target object in an image, so that the positioning capability of the first detection model can have the positioning capability consistent with that of a large model, and the dependence of the model on hardware is reduced.

In addition, corresponding to the training method of the detection model shown in fig. 1, the embodiment of the application further provides a training device of the detection model. Fig. 5 is a schematic structural diagram of a training device 500 for a detection model according to an embodiment of the present application, including: the acquiring module 501 is configured to acquire an image sample set, where the image sample set includes image samples of N category labels, each category label corresponds to a prompt learning text, and the prompt learning text is used to describe the category labels; the processing module 502 is configured to input an image sample into a first detection model, and perform image processing on the image sample through the first detection training model to obtain first classification features corresponding to N class labels; the acquiring module 501 is further configured to acquire text features of N prompt learning texts; a calculating module 503, configured to calculate similarities between each first classification feature and N text features, to obtain N similarity sets, where each similarity set corresponds to one first classification feature; a determining module 504, configured to determine a predicted class label of each first classification feature, where the predicted class label is a class label of a text feature corresponding to a maximum similarity in a similarity set corresponding to the first classification feature corresponding to the predicted class label; and the optimization module 505 is configured to optimize parameters of the first detection model according to the prediction category label and the category label corresponding to the first classification feature.

In a possible implementation manner, the processing module 502 is further configured to input the image sample into a pre-trained second detection model, and perform image processing on the image sample through the second detection model to obtain second classification features corresponding to the N class labels; and carrying out classification knowledge distillation learning on the third detection model according to the second classification characteristic until a first loss function of the third detection model is converged to obtain a first detection model, wherein the first loss function is determined according to the first classification characteristic and the second classification characteristic, and the parameter quantity of the second detection model is larger than that of the first detection model.

In one possible implementation, the method further includes: the prediction module is used for inputting the image sample into the first detection model, and predicting a first target position of a target object in the image sample through the first detection model; inputting the image sample into a second detection model, and predicting a second target position of a target object in the image sample through the second detection model, wherein the parameter quantity of the second detection model is larger than that of the first detection model; and carrying out positioning knowledge distillation learning on the first detection model according to the second target position until a second loss function of the first detection model is converged, wherein the second loss function is determined according to the first target position and the second target position.

In a possible implementation manner, the calculating module 503 is further configured to input the first classification feature into a region generating network of the first detection model, predict, through the region generating network, location information of a foreground object corresponding to the first classification feature, and a confidence level that the foreground object is a target object; determining a foreground target with the confidence coefficient larger than a threshold value as a positive sample; extracting a third classification feature corresponding to the positive sample on the first classification feature according to the position information corresponding to the positive sample, and converting the third classification feature into a positive sample feature vector consistent with the text feature dimension; similarity is calculated from the positive sample feature vector and each text feature.

Obviously, the training device for the detection model disclosed in the embodiment of the present application may be used as an execution subject of the training method for the detection model shown in the foregoing embodiment, so that the functions implemented by the training method for the detection model in the foregoing embodiment can be implemented. Since the principle is the same, the description is not repeated here.

Corresponding to the classification method shown in fig. 4, the embodiment of the application also provides a classification device. Fig. 6 is a schematic structural diagram of a sorting device 600 according to an embodiment of the present application, including: the acquiring module 601 is configured to acquire an image to be detected, where the image to be detected carries M category labels; a determining module 602, configured to determine prompt learning text corresponding to M category labels; the acquiring module 601 is further configured to acquire text features corresponding to M prompt learning texts; the processing module 602 is configured to input an image to be detected into the first detection model for image processing, so as to obtain classification features corresponding to M class labels of the image to be detected; the calculating module 603 is configured to calculate similarities between each classification feature and M text features, respectively, to obtain M similarity sets, where each similarity set corresponds to one classification feature; the determining module 602 is further configured to determine, for each classification feature, a class label of a text feature corresponding to a maximum similarity in the similarity set as a predicted class label of the image to be detected.

According to the technical scheme disclosed by the embodiment of the application, the class labels are carried in the images to be detected in the first detection model, the class labels correspond to prompt learning texts describing the class labels, the first detection model is used for carrying out image processing on the input images to be detected to obtain classification features corresponding to each class label, then the text features of each prompt learning text are acquired, for each classification feature, the similarity of each classification feature and each text feature can be calculated, the class label corresponding to the text feature with the largest similarity is used as the prediction class label of the images to be detected, so that the classes of the images to be detected can be guided to be predicted through the prompt learning texts, the classification performance of the first detection model is improved due to the fact that the prompt learning texts are used for guiding the first detection model to carry out class prediction on the images to be detected, and the recognition accuracy of the first detection model to the images to be detected without labels can be guided through the prompt learning texts.

In one possible implementation, the method further includes: the prediction module is used for inputting the image to be detected into a first detection model, predicting the position information of the target object in the image to be detected through the first detection model, wherein the first detection model is obtained through the positioning knowledge distillation learning of a second detection model, and the parameter quantity of the second detection model is larger than that of the first detection model.

In a possible implementation manner, the calculating module 603 is further configured to input the classification feature into a region generating network of the first detection model, predict, through the region generating network, location information of a foreground object corresponding to the classification feature, and a confidence that the foreground object is a target object; determining a foreground target with the confidence coefficient larger than a threshold value as a positive sample; extracting target classification features corresponding to the positive samples on the classification features according to the position information corresponding to the positive samples, and converting the target classification features into positive sample feature vectors consistent with the text feature dimensions; similarity is calculated from the positive sample feature vector and each text feature.

Obviously, the classification device disclosed in the embodiment of the present application may be used as an execution subject of the classification method shown in the above embodiment, so that the classification method can implement the functions implemented in the above embodiment. Since the principle is the same, the description is not repeated here.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Referring to fig. 7, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.

And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form a training device or a classifying device of the detection model on a logic level. The processor executes the program stored in the memory, and is specifically configured to execute the training method or the classifying method of the detection model mentioned in any one of the method embodiments.

The training device or the classifying device of the detection model disclosed in the embodiment shown in the specification or the executing method can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It should be understood that the electronic device according to the embodiments of the present application may implement the functions of the training device or the classifying device of the detection model in the embodiments shown in the present specification. Because the principles are the same, the embodiments of the present application are not described herein.

Of course, in addition to the software implementation, the electronic device in this specification does not exclude other implementations, such as a logic device or a combination of software and hardware, that is, the execution subject of the following process is not limited to each logic unit, but may also be hardware or a logic device.

The present application also proposes a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the training method or the classification method of the detection model of any of the embodiments described above.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In summary, the foregoing is merely a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the protection scope of the present specification.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims

1. A method of training a test model, comprising:

acquiring an image sample set, wherein the image sample set comprises image samples of N category labels, each category label corresponds to a prompt learning text, and the prompt learning text is used for describing the category label;

Inputting the image sample into a first detection model, and performing image processing on the image sample through the first detection model to obtain first classification features corresponding to the N class labels;

acquiring text characteristics of N prompt learning texts;

calculating the similarity between each first classification feature and N text features to obtain N similarity sets, wherein each similarity set corresponds to one first classification feature;

determining a prediction category label of each first classification feature, wherein the prediction category label is a category label of a text feature corresponding to the maximum similarity in a similarity set corresponding to the first classification feature;

and optimizing parameters of the first detection model according to the prediction category labels and the category labels corresponding to the first classification features.

2. The method of claim 1, wherein after the acquiring the set of image samples, the method further comprises:

inputting the image sample into a pre-trained second detection model, and performing image processing on the image sample through the second detection model to obtain second classification features corresponding to the N classification labels;

And carrying out classification knowledge distillation learning on a third detection model according to the second classification characteristic until a first loss function of the third detection model converges to obtain the first detection model, wherein the first loss function is determined according to the first classification characteristic and the second classification characteristic, and the parameter quantity of the second detection model is larger than the parameter quantity of the first detection model.

3. The method according to claim 1 or 2, wherein after the acquiring of the image sample set, the method further comprises:

inputting the image sample into a first detection model, and predicting a first target position of a target object in the image sample through the first detection model;

inputting the image sample into a second detection model, and predicting a second target position of a target object in the image sample through the second detection model, wherein the parameter quantity of the second detection model is larger than that of the first detection model;

and carrying out positioning knowledge distillation learning on the first detection model according to the second target position until a second loss function of the first detection model converges, wherein the second loss function is determined according to the first target position and the second target position.

4. The method of claim 1, wherein calculating the similarity is performed by:

inputting the first classification feature into a region generation network of the first detection model, and predicting the position information of a foreground target corresponding to the first classification feature and the confidence that the foreground target is a target object through the region generation network;

determining that the foreground object with the confidence coefficient larger than a threshold value is a positive sample;

extracting a third classification feature corresponding to the positive sample on the first classification feature according to the position information corresponding to the positive sample, and converting the third classification feature into a positive sample feature vector consistent with the text feature dimension;

and calculating the similarity according to the positive sample feature vector and each text feature.

5. A method of classification, comprising:

acquiring an image to be detected, wherein the image to be detected carries M category labels;

determining prompt learning texts corresponding to the M category labels;

acquiring text features corresponding to M prompt learning texts;

inputting the image to be detected into a first detection model for image processing to obtain classification features corresponding to M class labels of the image to be detected;

Respectively calculating the similarity between each classification feature and the M text features to obtain M similarity sets, wherein each similarity set corresponds to one classification feature;

and for each classification feature, determining the class label of the text feature corresponding to the maximum similarity in the similarity set as the prediction class label of the image to be detected.

6. The classification method according to claim 5, wherein after said obtaining the text features corresponding to the M pieces of the prompt learning text, the method further comprises:

inputting the image to be detected into the first detection model, predicting the position information of the target object in the image to be detected through the first detection model, wherein the first detection model is obtained through the positioning knowledge distillation learning of a second detection model, and the parameter quantity of the second detection model is larger than that of the first detection model.

7. The classification method according to claim 5, wherein the calculating of the similarity is performed by:

inputting the classification features into a region generation network of the first detection model, and predicting the position information of a foreground target corresponding to the classification features and the confidence that the foreground target is a target object through the region generation network;

extracting target classification features corresponding to the positive samples on the classification features according to the position information corresponding to the positive samples, and converting the target classification features into positive sample feature vectors consistent with the text feature dimensions;

8. A training device for a test model, comprising:

the acquisition module is used for acquiring an image sample set, wherein the image sample set comprises image samples of N category labels, each category label corresponds to a prompt learning text, and the prompt learning text is used for describing the category label;

the processing module is used for inputting the image sample into a first detection model, and performing image processing on the image sample through the first detection model to obtain first classification features corresponding to the N class labels;

the acquisition module is also used for acquiring text characteristics of the N prompt learning texts;

the computing module is used for computing the similarity between each first classification feature and N text features to obtain N similarity sets, and each similarity set corresponds to one first classification feature;

The determining module is used for determining a prediction category label of each first classification feature, wherein the prediction category label is a category label of a text feature corresponding to the largest similarity in a similarity set corresponding to the first classification feature;

and the optimization module is used for optimizing the parameters of the first detection model according to the prediction category labels and the category labels corresponding to the first classification features.

9. A sorting apparatus, comprising:

the acquisition module is used for acquiring an image to be detected, wherein the image to be detected carries M category labels;

the determining module is used for determining prompt learning texts corresponding to the M category labels;

the acquisition module is also used for acquiring text features corresponding to the M prompt learning texts;

the processing module is used for inputting the image to be detected into a first detection model for image processing to obtain classification characteristics corresponding to M class labels of the image to be detected;

the computing module is used for computing the similarity between each classification feature and the M text features respectively to obtain M similarity sets, and each similarity set corresponds to one classification feature;

and the determining module is further used for determining, for each classification feature, a category label of the text feature corresponding to the maximum similarity in the similarity set as a prediction category label of the image to be detected.

10. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the training method of the detection model of any one of claims 1 to 4 or the classification method of any one of claims 5 to 7.