CN116543261A

CN116543261A - Model training method for image recognition, image recognition method device and medium

Info

Publication number: CN116543261A
Application number: CN202310538955.1A
Authority: CN
Inventors: 史晓丽; 张震国; 吴剑平
Original assignee: Shanghai Lingshi Communication Technology Development Co ltd; Suzhou Keda Technology Co Ltd
Current assignee: Shanghai Lingshi Communication Technology Development Co ltd; Suzhou Keda Technology Co Ltd
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-08-04

Abstract

The application relates to a model training method for image recognition, image recognition method equipment and medium, belonging to the technical field of computers, wherein the method comprises the following steps: training a preset first machine learning network based on a training data set to obtain a feature extraction model, wherein the feature extraction model is used for extracting features of an input target image in an image recognition process to obtain image features so as to determine template features matched with the image features in a preset template feature library; training a preset second machine learning network based on the N sample images and the feature extraction model to obtain a quality evaluation model for N sample images corresponding to each type of labels in the training data set; the quality evaluation model is used for determining the quality score of the target image in the image recognition process so as to determine the recognition result of the target image by combining the quality score and the template characteristic; the accuracy of the output image recognition result is ensured, and the human resources consumed in labeling the quality scores are saved.

Description

Model training method for image recognition, image recognition method device and medium

Technical Field

The application relates to a model training method for image recognition, image recognition method equipment and medium, and belongs to the technical field of computers.

Background

The image recognition technology is a technology of analyzing an image with a computer to acquire effective recognition information therefrom. The image recognition technology is widely applied to scenes such as face recognition, vehicle recognition, license plate recognition and the like.

Conventional image recognition techniques are implemented by training neural network models. In the model training process, a large number of sample images with targets to be identified are collected, and a neural network model is trained based on the sample images and target labels corresponding to the sample images, so that an image identification network is obtained. And when the image recognition is carried out, inputting the target image into an image recognition network to obtain a target recognition result.

However, images acquired during image recognition are greatly affected by the environment, such as: on one hand, the image is influenced by illumination, background and the like to change; on the other hand, the object to be identified in the image may be freely movable, and at this time, the object in the image may have a problem of large variation in illumination or posture. In some extreme scenarios, blurred images may even be acquired due to the motion of the object to be identified. However, inputting low quality images into the image recognition network may reduce the accuracy of image recognition.

Disclosure of Invention

The model training method, the image recognition method equipment and the medium for image recognition can ensure the accuracy of the output image recognition result, save the manpower resources consumed when labeling the image quality scores and improve the training efficiency of the quality evaluation model. The application provides the following technical scheme:

in a first aspect, a model training method for image recognition is provided, the method comprising:

acquiring a training data set, wherein the training data set comprises a plurality of sample images and category labels corresponding to each sample image;

training a preset first machine learning network based on the training data set to obtain a feature extraction model; the feature extraction model is used for extracting features of an input target image in the image recognition process to obtain image features so as to determine template features matched with the image features in a preset template feature library;

training a preset second machine learning network based on the N sample images and the feature extraction model to obtain a quality evaluation model for N sample images corresponding to each type of tag in the training data set; the quality evaluation model is used for determining the quality score of the target image in the image recognition process so as to determine the recognition result of the target image by combining the quality score and the template characteristic; wherein N is a positive integer.

Optionally, the feature extraction model includes a feature extraction layer and a blocking layer connected with the feature extraction layer, where the blocking layer is used to divide a feature graph output by the feature extraction layer into at least two sub-feature graphs according to a preset rule; correspondingly, the quality evaluation model outputs the quality score corresponding to each sub-feature map;

training a preset second machine learning network based on the N sample images and the feature extraction model to obtain a quality evaluation model, wherein the training comprises the following steps:

inputting each sample image into the feature extraction model and the second machine learning network respectively to obtain at least two sub-feature images corresponding to the sample image and a prediction result of quality scores corresponding to each sub-feature image;

weighting the sub-feature images based on the prediction result of the quality score corresponding to each sub-feature image, and connecting the sub-features of the same sample image after different weighting to obtain the weighted fusion feature of the sample image;

and training the second machine learning network based on the weighted fusion characteristics to obtain the quality assessment model.

Optionally, before each sample image is input into the feature extraction model and the second machine learning network, the method further includes:

Generating a first sample image set, a second sample image set, and a third sample image set based on the N Zhang Yangben image; the third sample image set covers sample images acquired by each image acquisition scene;

correspondingly, the training the second machine learning network based on the weighted fusion feature to obtain the quality assessment model includes:

determining a first distance distribution between the first sample image set and the second sample image set based on the weighted fusion feature;

determining a second distance distribution between the first sample image set and the third sample image set based on the weighted fusion feature;

and training the second machine learning network based on the distance distribution difference between the first distance distribution and the second distance distribution to obtain the quality evaluation model.

Optionally, the blocking layer is further connected with at least two classification networks, the classification networks are in one-to-one correspondence with the sub-feature graphs, and weights of the sub-feature graphs in the classification networks corresponding to the classifications are different;

the training the second machine learning network based on the distance distribution difference between the first distance distribution and the second distance distribution to obtain the quality evaluation model includes:

Acquiring the weight of a classification network corresponding to each sub-feature map in each sample image;

determining a prediction category of the sample image based on the sub-feature map and the weights;

and training the second machine learning network based on the classification difference between the prediction category and the category label corresponding to the sample image and the distance distribution difference to obtain the quality evaluation model.

Optionally, after generating the first sample image set, the second sample image set, and the third sample image set based on the N Zhang Yangben image, the method further includes:

generating a teacher-over-the-counter in a teacher distribution based on sample images in the first sample image set and sample images in the second sample image set, and generating a teacher-under-the-counter in the teacher distribution based on sample images of other categories in the training data set;

generating student pairs in a student distribution based on sample images in the first sample image set and sample images in the third sample image set, and generating student negative pairs in the student distribution based on sample images of other categories in the training data set;

accordingly, the determining a first distance distribution between the first sample image set and the second sample image set based on the weighted fusion feature comprises:

Determining the similarity between the corresponding weighted fusion features of the opposite pairs of the teacher and the similarity between the corresponding weighted fusion features of the opposite pairs of the teacher, and obtaining the first distance distribution;

accordingly, the determining a second distance distribution between the first sample image set and the third sample image set based on the weighted fusion feature comprises:

and determining the similarity between the corresponding weighted fusion features of the student opposite pairs and the similarity between the corresponding weighted fusion features of the student opposite pairs, and obtaining the second distance distribution.

Optionally, the generating the first, second, and third sample image sets based on the N Zhang Yangben image includes:

dividing the N sample images into three sample image sets, wherein two sample image sets are the first sample image set and the second sample image set respectively;

and performing image expansion on the sample image sets except the two sample image sets according to different image acquisition scenes to obtain the third sample image set.

In a second aspect, there is provided an image recognition method, the method comprising:

acquiring a target image to be identified;

Respectively inputting the target image into a feature extraction model and a quality evaluation model to output image features of the target image through the feature extraction model and output quality scores of the target image through the quality evaluation model; the feature extraction model and the quality assessment model are trained based on the model training method provided by the first aspect;

determining template features matched with the image features in a preset template feature library;

and determining a recognition result of the target image based on the quality score and the template feature.

accordingly, the determining the recognition result of the target image based on the quality score and the template feature includes:

and if the similarity between each sub-feature image and the corresponding template feature is greater than a first similarity threshold and the quality score of at least one sub-feature image is greater than a first score threshold, outputting a classification result corresponding to the template feature.

Optionally, the method further comprises:

and if the similarity between each sub-feature image and the corresponding template feature is greater than a second similarity threshold and the quality score of each sub-feature image is smaller than a second score threshold, not outputting a classification result corresponding to the template feature.

In a third aspect, an electronic device is provided, the device comprising a processor and a memory; the memory stores a program that is loaded and executed by the processor to implement the model training method for image recognition provided in the first aspect; or the image recognition method provided in the second aspect is implemented.

In a fourth aspect, there is provided a computer readable storage medium having stored therein a program which, when executed by a processor, is adapted to carry out the model training method for image recognition provided in the first aspect; or the image recognition method provided in the second aspect is implemented.

Training a preset first machine learning network based on a training data set to obtain a feature extraction model, wherein the feature extraction model is used for extracting features of an input target image in an image recognition process to obtain image features so as to determine template features matched with the image features in a preset template feature library; training a preset second machine learning network based on the N sample images and the feature extraction model to obtain a quality evaluation model for N sample images corresponding to each type of labels in the training data set; the quality evaluation model is used for determining the quality score of the target image in the image recognition process so as to determine the recognition result of the target image by combining the quality score and the template characteristic; the problem of reduced accuracy of image recognition when inputting low quality images into an image recognition network can be solved. Because the quality of the input image can be evaluated based on the quality evaluation model, the accuracy of the image recognition result can be judged to be not high under the condition of low quality, so that the image recognition result is not output, and the accuracy of the output image recognition result is ensured. Meanwhile, the quality assessment model is obtained without training by using a pre-marked quality score label, so that human resources consumed in marking the image quality score can be saved, and the training efficiency of the quality assessment model is improved.

In addition, the feature images output by the feature extraction layer are divided into at least two sub-feature images according to a preset rule by the partitioning layer, and even if the area corresponding to one sub-feature image is blocked, the image recognition can be performed through other sub-feature images, so that the success rate of the image recognition can be improved.

In addition, the quality evaluation model is obtained through training based on the distance distribution difference, so that the quality evaluation model obtained through training has the evaluation capability on images acquired by each image acquisition scene, and the network performance is improved.

In addition, by determining the distance distribution differences based on the distributed distillation function, the performance gap between simple and difficult samples can be reduced, and the distance between similar distribution expectations from negative pairs to positive pairs can be minimized to control overlap, thereby improving the network performance of the quality assessment model.

In addition, by combining the distance distribution difference and the classification difference to jointly train the quality evaluation model, the performance of the simple sample can be maintained, the performance gap between the simple sample and the difficult sample can be reduced, and the network performance of the quality evaluation model can be further improved.

In addition, by enabling the third sample image set in the student distribution to cover each image acquisition scene, the evaluation performance of the quality evaluation model on the images under each image acquisition scene can be ensured.

The foregoing description is only an overview of the technical solutions of the present application, and in order to make the technical means of the present application more clearly understood, it can be implemented according to the content of the specification, and the following detailed description of the preferred embodiments of the present application will be given with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a model training method for image recognition provided in one embodiment of the present application;

FIG. 2 is a schematic diagram of a training process for a feature extraction model provided in one embodiment of the present application;

FIG. 3 is a schematic diagram of a training process for a quality assessment model provided in one embodiment of the present application;

FIG. 4 is a flow chart of an image recognition method provided by one embodiment of the present application;

FIG. 5 is a block diagram of a model training apparatus for image recognition provided in one embodiment of the present application;

FIG. 6 is a block diagram of an image recognition device provided in one embodiment of the present application;

fig. 7 is a block diagram of an electronic device provided in one embodiment of the present application.

Detailed Description

The detailed description of the present application is further described in detail below with reference to the drawings and examples. The following examples are illustrative of the present application, but are not intended to limit the scope of the present application.

In the conventional image recognition method, an image recognition network is generally used to recognize an input target image, so as to obtain a recognition result. However, the target image is greatly affected by the environment, which may cause the image recognition network to output an erroneous recognition result, affecting the accuracy of image recognition.

Based on the above technical problem, in one possible implementation manner, the quality evaluation model may be trained in advance, so as to obtain the quality score of the target image by inputting the target image into the quality evaluation model, so that the quality of the input image is determined in combination with the quality score, and thus, in the case of lower quality, it may be determined that the accuracy of the image recognition result is not high, so that the image recognition result is not output, and the accuracy of the output image recognition result is ensured.

However, in the above embodiment, it is necessary to train to obtain a quality evaluation model using a sample image and a quality score label corresponding to the sample image; and training to obtain an image recognition network by using the sample image and the category label corresponding to the sample image. At this time, a large number of image labeling operations need to be performed on the sample image, and model training is inefficient.

Based on the technical problems described above, the present embodiment provides a model training method for image recognition, where after a feature extraction model is obtained by training using a sample image and a class label corresponding to the sample image, a quality evaluation model is obtained by training based on the feature extraction model and the sample image, and a quality score label of the sample image is not required to be labeled, so that human resources consumed when labeling an image quality score can be saved, and training efficiency of the quality evaluation model is improved. Meanwhile, the quality evaluation model obtained through training can be used for screening out images suitable for the feature extraction model, so that the accuracy of image identification is improved.

The model training method for image recognition provided by the application is described in detail below. Optionally, the model training method for image recognition provided in each embodiment is used for example in an electronic device, where the electronic device is a terminal or a server, and the terminal may be a mobile phone, a computer, a tablet computer, a scanner, an electronic eye, a monitoring camera, etc., and the embodiment does not limit the type of the electronic device.

FIG. 1 is a flow chart of a model training method for image recognition according to one embodiment of the present application, the method comprising at least the following steps:

step 101, acquiring a training data set, wherein the training data set comprises a plurality of sample images and category labels corresponding to each sample image.

In this embodiment, the sample image is an image including the object to be identified, and the sample image may be a single frame image extracted from the video stream, or may be a photograph taken by the object to be identified, and the type of the sample image is not limited in this embodiment. Wherein the objects to be identified include, but are not limited to: the type of the object to be identified is not limited in this embodiment, such as a face, a vehicle, or a license plate.

The class of the sample image is used to distinguish between different objects to be identified. Category labels may be represented by an identification of the object to be identified, such as: when the target to be identified is a face, the class label is the identity ID of the face.

Each type of label in the training data set corresponds to at least one sample image of the same target to be identified, and the number of sample images corresponding to different types of labels is the same or different.

Optionally, the training data set includes sample images acquired of different image acquisition scenes to improve the ability of the network model to adapt to different image acquisition scenes. The images obtained when shooting the same object to be identified are different in different image acquisition scenes, including but not limited to: the embodiment does not limit the implementation manner of the image acquisition scene, and the illumination intensity is different, the acquisition angle is different, the shutter speed is different, the acquisition distance is different, and/or the exposure is different.

Optionally, the class label is obtained by manually labeling the sample image and/or is read from the labeled public set, and the implementation manner of the class label is not limited in this embodiment.

Optionally, the sample images in the training dataset are uniform in size. At this time, after the original image is obtained, performing target detection on the original image to obtain a target detection frame; positioning key points of the target to be identified based on the target detection frame; and normalizing the original image to the same size according to the positions of the key points to obtain a sample image. Such as: the normalized size is 112×112, and other sizes may be used in actual implementation, and the normalized size is not limited in this embodiment.

Step 102, training a preset first machine learning network based on the training data set to obtain a feature extraction model.

The feature extraction model is used for extracting features of an input target image in the image recognition process to obtain image features so as to determine template features matched with the image features in a preset template feature library. Specifically, the feature extraction model is used for extracting image features of an object to be identified in the object image.

Optionally, the first machine learning network is a neural network model, which may be built based on a residual network, and in other embodiments, the neural network model may be built based on other types of network structures, and the implementation of the first machine learning network is not limited in this embodiment.

In the model training stage, the first machine learning network is connected with a classification network, and the classification network is used for predicting the probability that the characteristic data output by the first machine learning network belong to each category. The feature data in the classification network is weighted differently for each class.

Correspondingly, training a preset first machine learning network based on a training data set to obtain a feature extraction model, wherein the feature extraction model comprises the following steps:

Inputting the sample image into a first machine learning network to obtain characteristic data; inputting the feature data into a classification network to predict the probability that the feature data belongs to each category; acquiring the weight of the classification network during probability prediction; determining a prediction category of the sample image based on the weight and the feature data; based on the classification difference between the prediction category and the category label corresponding to the sample image, training the first machine learning network to update the model parameters of the first learning model and obtain the feature extraction model. At this time, the feature extraction model is obtained by minimizing the classification difference training.

Optionally, the sample image is input to the first machine learning network in an image-sampled manner. The image sampling mode can be realized by randomly selecting sample images.

Alternatively, the classification difference is determined by an additive angular interval loss function (ArcFace loss: additive Angular Margin Loss, arcFace loss), which is expressed by the following formula:

wherein N is the total number of sample images; i is the i Zhang Yangben image, i is more than or equal to 1 and less than or equal to N; y is _i A category label for the ith sample image; n is the total number of categories; j is the j-th category; s is a scaling factor, θ _yi To classify y in a network _i The angle interval between the corresponding weight and the characteristic data; θ _j The method comprises the steps of classifying an angle interval between a weight corresponding to a category j in a network and characteristic data; t is the angular edge.

The classification network may be a fully connected network or may be other types of network structures, and the implementation manner of the classification network is not limited in this embodiment.

In one example, since there may be partial occlusion of the object to be identified in the sample image, the conventional feature extraction model may not be able to extract the feature data of the sample image due to the incomplete object to be identified in the sample image. Based on this, the first machine learning network may specifically include a feature extraction layer, and a chunking layer connected to the feature extraction layer. The feature extraction layer is used for extracting a feature map of an input image; the partitioning layer is used for dividing the feature map output by the feature extraction layer into at least two sub-feature maps according to a preset rule; correspondingly, the feature extraction model obtained by training also comprises a feature extraction layer after training and a blocking layer after training. In this way, even if the region corresponding to a certain sub-feature map is blocked, image recognition can be performed by other sub-feature maps.

Wherein the preset rules include, but are not limited to: dividing the feature map into an upper sub-feature map and a lower sub-feature map with the same size; or dividing the feature map into a left sub-feature map and a right sub-feature map which are the same in size; alternatively, the feature map is divided into 4 sub-feature maps with the same size along the horizontal center line and the vertical center line, and the implementation of the preset rule is not limited in this embodiment.

Correspondingly, the blocking layer is also connected with at least two classification networks, and the classification networks are in one-to-one correspondence with the sub-feature graphs so as to predict the category of each sub-feature graph. In the above-mentioned loss function, θ _yi Can be adaptively modified to y in a classification network _i Average value of angle interval between corresponding weight and characteristic data of each sub-characteristic diagram; θ _j The average value of the angle interval between the weight corresponding to the category j in the classification network and the characteristic data of each sub-characteristic diagram is obtained; alternatively, θ _yi Can be adaptively modified to y in a classification network _i The angle interval between the corresponding weight and the characteristic data of the same sub-characteristic diagram; θ _j For classifying the angle interval between the weight corresponding to the category j and the feature data of the same sub-feature map in the network, the average value of the loss function values corresponding to different sub-feature maps is the final loss function value, or the weighted value of the loss function values corresponding to different sub-feature maps is the final loss The loss function value is not limited to the calculation mode of the loss function value corresponding to the sample image when the sample image corresponds to the plurality of sub-feature images.

In other embodiments, the classification layer may not be set in the training stage of the feature extraction model, but may be set in the training stage and the image recognition stage of the quality evaluation model described below, and the training mode of the feature extraction model is not limited in this embodiment.

The training process of the feature extraction model will be described below by taking the example that the first machine learning network is established based on the residual network and includes a blocking layer and the classification network is a fully connected network. Referring to fig. 2, the model structure in the training process includes a first machine learning network 21 to be trained, a fully connected network 22 connected to the first machine learning network 21, a feature fusion layer connected to the fully connected network 22, and a loss function 23 connected to the feature fusion layer.

Wherein the first machine learning network 21 comprises: an input layer 201, a convolutional (Conv) layer 202, and a plurality of repeated convolutional layers 202, a Pooling (Pooling) layer 203, and a residual unit (block) 204, and a blocking layer 205. The roles of the various network layers are referenced in the following table. In the training process, a sample image is sequentially processed by an input layer 201, a convolution layer 202, a plurality of repeated convolution layers 202, a pooling layer 203 and a residual error unit 204 to obtain a feature map, and the feature map is segmented by a segmentation layer 205 to obtain at least two sub-feature maps; each sub-feature map is input into a corresponding fully-connected network 22 for classification, so that the weight corresponding to different categories of each sub-feature map is obtained; after each sub-feature map is connected through the feature fusion layer, the weight and the connected feature map are input into a loss function 23, and classification loss is obtained, wherein the classification loss represents classification difference. Then, the model parameters of the first machine learning network 21 are iteratively updated based on the classification loss, resulting in a feature extraction model.

In fig. 2, the number of sub-feature diagrams is 2, and accordingly, the number of fully connected networks 22 is two, and in actual implementation, the number of sub-feature diagrams and fully connected networks may be more, which is not limited in this embodiment.

Table one:

and step 103, training a preset second machine learning network based on the N sample images and the feature extraction model for the N sample images corresponding to each type of labels in the training data set to obtain a quality evaluation model.

The quality evaluation model is used for determining the quality score of the target image in the image recognition process so as to determine the recognition result of the target image by combining the quality score and the template characteristic; wherein N is a positive integer. In the training process of the quality evaluation model, model parameters of the feature extraction model are fixed.

In one example, the feature extraction model includes a feature extraction layer and a blocking layer connected to the feature extraction layer, and accordingly, the quality assessment model is used to output a quality score corresponding to each sub-feature map.

At this time, training a preset second machine learning network based on the N sample images and the feature extraction model to obtain a quality evaluation model, including: inputting each sample image into a feature extraction model and a second machine learning network respectively to obtain at least two sub-feature images corresponding to the sample images and a prediction result of quality scores corresponding to each sub-feature image; weighting the sub-feature images based on the prediction result of the quality score corresponding to each sub-feature image, and connecting the sub-features of the same sample image after different weighting to obtain the weighted fusion feature of the sample image; and training the second machine learning network based on the weighted fusion characteristics to obtain a quality assessment model.

Optionally, the second machine learning network is a neural network model, which may be built based on a residual network, and in other embodiments, the neural network model may also be built based on other types of network structures, and the implementation of the second machine learning network is not limited in this embodiment.

The second machine learning network is trained based on the weighted fusion features to obtain a quality assessment model, including but not limited to the following modes:

first kind: a feature distribution distillation penalty is determined based on the weighted fusion features to train the second machine learning network.

Different sample image sets need to be created in advance when calculating the feature distribution distillation loss to acquire the feature distribution before the different sample image sets. At this time, before inputting each sample image into the feature extraction model and the second machine learning network, respectively, it further includes: generating a first sample image set, a second sample image set, and a third sample image set based on the N Zhang Yangben image; the third sample image set covers sample images acquired by respective image acquisition scenes.

Optionally, generating the first, second, and third sample image sets based on the N Zhang Yangben image includes: dividing the N sample images into three sample image sets, wherein the two sample image sets are a first sample image set and a second sample image set respectively; and performing image expansion on the sample image sets except the two sample image sets according to different image acquisition scenes to obtain a third sample image set. N is an integer greater than or equal to 3.

The image expansion is used to cover the third sample image set with more image acquisition scenes. Image expansion means include, but are not limited to: the sample images in the sample image set are subjected to rotation, blurring processing, brightness adjustment, scaling processing, and the like, and the image expansion method is not limited in this embodiment.

Correspondingly, training the second machine learning network based on the weighted fusion features to obtain a quality assessment model, including: determining a first distance distribution between the first sample image set and the second sample image set based on the weighted fusion feature; determining a second distance distribution between the first sample image set and the third sample image set based on the weighted fusion feature; and training the second machine learning network based on the distance distribution difference between the first distance distribution and the second distance distribution to obtain a quality evaluation model.

The distance distribution (including the first distance distribution and the second distance distribution) is used to characterize the similarity between weighted fusion features corresponding to different sample image sets. In this embodiment, the second distance distribution approximates to the first distance distribution, so that the quality evaluation model obtained by training has the evaluation capability on the images acquired by each image acquisition scene, and network performance is improved.

In one example, the distance distribution difference is determined based on a distributed distillation (Distribution Distillation Loss, DDL) function. At this time, after generating the first, second, and third sample image sets based on the N Zhang Yangben image, it further includes: generating a teacher pair in the teacher distribution based on the sample images in the first sample image set and the sample images in the second sample image set, and generating a teacher negative pair in the teacher distribution based on the sample images of other categories in the training data set; student pairs in the student distribution are generated based on the sample images in the first sample image set and the sample images in the third sample image set, and student negative pairs in the student distribution are generated based on the sample images of other categories in the training data set.

Accordingly, determining a first distance distribution between the first sample image set and the second sample image set based on the weighted fusion feature comprises: and determining the similarity between the corresponding weighted fusion features of the opposite pairs of teachers and the similarity between the corresponding weighted fusion features of the opposite pairs of teachers to obtain first distance distribution.

Accordingly, determining a second distance distribution between the first sample image set and the third sample image set based on the weighted fusion feature comprises: and determining the similarity between the weighted fusion features corresponding to the student pairs and the similarity between the weighted fusion features corresponding to the student pairs, and obtaining the second distance distribution.

Accordingly, training the second machine learning network based on a distance distribution difference between the first distance distribution and the second distance distribution to obtain a quality assessment model, including: determining a dead-facing relative entropy loss value based on the similarity that the teacher is dead-facing to the corresponding and the similarity that the student is dead-facing to the corresponding; determining a negative pair relative entropy loss value based on the similarity corresponding to the teacher negative and the similarity corresponding to the student negative; determining positive and negative distribution distances based on distribution expectations of the similarities corresponding to the opposite directions of the teachers and the similarities corresponding to the opposite directions of the students, and distribution expectations of the similarities corresponding to the negative directions of the teachers and the similarities corresponding to the negative directions of the students; and training the second machine learning network based on the positive relative entropy loss value, the negative relative entropy loss value and the positive and negative distribution distances to obtain a quality evaluation model.

To narrow the performance gap between simple and difficult samples, the similarity distribution of the difficult samples (i.e., student distribution) is constrained to approximate that of the simple samples (i.e., teacher distribution). The teacher distribution consists of two similarity distributions of opposite and negative pairs, which are respectively denoted as P+ and P-, i.e., P+ is the similarity corresponding to the opposite direction of the teacher, and P-is the similarity corresponding to the negative direction of the teacher. The student profile also consists of two similarity profiles, denoted as Q+ and Q-, i.e., Q+ is the similarity profile that the student is facing, and Q-is the similarity profile that the student is negatively corresponding. DDL adopts KL divergence L _KL To constrain the similarity between student and teacher distributions, defined as follows (where λ1, λ2 are weight parameters):

where s represents an image pair, α ₁ And alpha ₂ For presetting weight parameters, D _KL (P ⁺ ||Q ⁺ ) Represents the positive relative entropy loss value, D _KL (P ^- ||Q ^- ) Representing the negative pair of relative entropy loss values.

Since only KL divergence is used, the teacher profile may choose to approximate the student profile, as opposed to the approximation objective described above. Based on this, an order loss is also introduced in the DDL to minimize the distance between the similar distribution expectations from negative to positive pairs, controlling the overlap. order loss L _order Represented by the formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,is just opposite to the corresponding similarity distribution; />A negative corresponding similarity distribution; alpha ₃ E represents the desire for a preset weight parameter.

To sum up, DDL loss function L _DDL ＝L _KL +L _order Specifically, the expression can be represented by the following formula:

wherein K is the number of student distributions, D _KL Is a loss of KL divergence.

In actual implementation, the loss function that approximates the second distance distribution to the first distance distribution may also be other loss functions, such as: minimizing the variance between the second distance distribution and the first distance distribution, etc., the present embodiment does not limit the implementation of the loss function that approximates the second distance distribution to the first distance distribution.

Second kind: and determining a feature distribution distillation loss based on the weighted fusion features, and training the second machine learning network based on the classification loss of the category corresponding to the weighted fusion features.

Wherein, the related content of the characteristic distribution distillation loss is detailed in the first training mode. When the classification loss is calculated, the blocking layer of the feature extraction model is further connected with at least two classification networks, and correspondingly, the second machine learning network is trained based on the distance distribution difference between the first distance distribution and the second distance distribution to obtain a quality evaluation model, which comprises the following steps: acquiring the weight of a classification network corresponding to each sub-feature map in each sample image; determining a prediction category of the sample image based on the sub-feature map and the weights; and training the second machine learning network based on the classification difference between the predicted class and the class label corresponding to the sample image and the distance distribution difference to obtain a quality evaluation model.

Optionally, training the second machine learning network based on the classification differences and the distribution differences, comprising: calculating the sum of the classification loss and the characteristic distribution distillation loss to obtain the loss sum; iterative updating of model parameters of the second machine learning network based on the losses. Or, calculating a weighted sum before the classification loss and the characteristic distribution distillation loss to obtain a weighted loss; the model parameters of the second machine learning network are iteratively updated based on the weighted losses, and the present embodiment does not limit the manner in which the loss values are determined based on the classification differences and the distribution differences.

The related description of the classification differences is detailed in the training process of the feature extraction model, and this embodiment is not described herein.

The quality assessment model is trained in a second manner and a second machine learning network is described below based on a residual network establishment as an example. Referring to fig. 3, the model structure in the training process includes a trained feature extraction model 31, and a second machine learning network 32 to be trained. The blocking layer of the feature extraction model 31 is connected to at least two fully connected networks 34 (two fully connected networks 34 are taken as an example in fig. 3, and accordingly, the number of sub-feature graphs is also two), and at this time, the classification network is implemented as a fully connected network, and the fully connected network classifies the data output by the feature extraction model 31. The second machine learning network 32 includes an input layer 321, a plurality of repeated convolutional layers 322, a pooling layer 323, and a residual unit 324, and a fully connected network 325 connected to the last layer residual unit 324. The fully connected network 34 and the fully connected network 325 are respectively connected to the feature fusion layer 35, and the feature fusion layer 35 is respectively connected to the loss function 36. The roles of the various network layers are referenced in table two below. The output of the fully-connected network 325 is preset to be two, and at this time, the output result of the fully-connected network 325 includes two one-dimensional vectors, which respectively correspond to the prediction results of the quality scores of the sub-feature graphs.

During training, each sample image is input into a feature extraction model 31 and a second machine learning network 32, respectively. The two sub-feature maps output by the blocking layer of the feature extraction model 31 are respectively input into the corresponding fully connected network 34, and the weights of the fully connected network 34 corresponding to the respective categories are obtained. The two sub-feature images output by the partitioning layer and the prediction result of the quality score output by the second machine learning network 32 are input into the feature fusion layer 35, the sub-feature images are weighted based on the two sub-feature images and the prediction result of the quality score corresponding to each sub-feature image, and the sub-features of the same sample image after different weighting are connected, so that the weighted fusion feature of the sample image is obtained. The weighted fusion features are input into the classification loss and the feature distribution distillation loss in the loss function 36, respectively, to obtain loss values. The model parameters of the second machine learning network 32 are then iteratively updated based on the loss values to obtain a quality assessment model.

And (II) table:

in other examples, if the feature extraction model does not include a blocking layer, training a preset second machine learning network based on the N sample images and the feature extraction model to obtain a quality assessment model, including: inputting each sample image into a feature extraction model and a classification network to obtain feature data output by the feature extraction model and weights of the feature data output by the classification network corresponding to various categories; determining a target quality score based on a similarity between the weight and the feature data; inputting the sample image into a second machine learning network to obtain a prediction result of the quality score of the sample image; determining a quality loss based on the target quality score and the predicted result of the quality score; and iteratively updating model parameters of the second machine learning network based on the quality loss to obtain a quality assessment model. The present embodiment does not limit the training manner of the quality evaluation model.

In summary, according to the model training method for image recognition provided by the embodiment, a feature extraction model is obtained by training a preset first machine learning network based on a training data set, and the feature extraction model is used for extracting features of an input target image in an image recognition process to obtain image features so as to determine template features matched with the image features in a preset template feature library; training a preset second machine learning network based on the N sample images and the feature extraction model to obtain a quality evaluation model for N sample images corresponding to each type of labels in the training data set; the quality evaluation model is used for determining the quality score of the target image in the image recognition process so as to determine the recognition result of the target image by combining the quality score and the template characteristic; the problem of reduced accuracy of image recognition when inputting low quality images into an image recognition network can be solved. Because the quality of the input image can be evaluated based on the quality evaluation model, the accuracy of the image recognition result can be judged to be not high under the condition of low quality, so that the image recognition result is not output, and the accuracy of the output image recognition result is ensured. Meanwhile, the quality assessment model is obtained without training by using a pre-marked quality score label, so that human resources consumed in marking the image quality score can be saved, and the training efficiency of the quality assessment model is improved.

Based on the feature extraction model and the quality evaluation model obtained in the foregoing embodiments, the present application further provides an image recognition method, and fig. 4 is a flowchart of the image recognition method provided in one embodiment of the present application, where the method at least includes the following steps:

in step 401, a target image to be identified is acquired.

In one example, an original image is acquired through an image acquisition component, and target detection is carried out on the original image to obtain the position and the size of a target to be identified; and performing key point positioning on the detected target to be identified, and normalizing the target to a preset size to extract characteristic data and quality scores of the target.

Step 402, inputting the target image into the feature extraction model and the quality evaluation model respectively, so as to output the image features of the target image through the feature extraction model and output the quality scores of the target image through the quality evaluation model.

The feature extraction model and the quality assessment model are obtained through training based on the model training method provided by the embodiment.

Step 403, determining template features matched with the image features in a preset template feature library.

Step 404, determining a recognition result of the target image based on the quality score and the template feature.

In one example, the feature extraction model includes a feature extraction layer and a blocking layer connected to the feature extraction layer, where the blocking layer is used to divide a feature map output by the feature extraction layer into at least two sub-feature maps according to a preset rule; correspondingly, the quality evaluation model outputs the quality score corresponding to each sub-feature map. At this time, determining the recognition result of the target image based on the quality score and the template feature includes: if the similarity between each sub-feature image and the corresponding template feature is greater than a first similarity threshold T1 and the quality score of at least one sub-feature image is greater than a first score threshold T3, outputting a classification result corresponding to the template feature; if the similarity between each sub-feature image and the corresponding template feature is greater than a second similarity threshold T2 and the quality score of each sub-feature image is smaller than a second score threshold T4, the classification result corresponding to the template feature is not output.

The second score threshold T4 is less than or equal to the first score threshold T3. The first similarity threshold T1 is greater than or equal to the second similarity threshold T2.

The output modes of the classification result include, but are not limited to: the present embodiment does not limit the output method, and outputs the category information and gives an alarm, or displays only the category information.

In summary, in the image recognition method provided in the present embodiment, the target image to be recognized is obtained; respectively inputting the target image into a feature extraction model and a quality evaluation model to output image features of the target image through the feature extraction model and output quality scores of the target image through the quality evaluation model; determining template features matched with the image features in a preset template feature library; determining a recognition result of the target image based on the quality score and the template feature; the problem of reduced accuracy of image recognition when inputting low quality images into an image recognition network can be solved. Because the quality of the input image can be evaluated based on the quality evaluation model, the accuracy of the image recognition result can be judged to be not high under the condition of low quality, so that the image recognition result is not output, and the accuracy of the output image recognition result is ensured. Meanwhile, the quality assessment model is obtained without training by using a pre-marked quality score label, so that human resources consumed in marking the image quality score can be saved, and the training efficiency of the quality assessment model is improved.

FIG. 5 is a block diagram of a model training apparatus for image recognition provided in one embodiment of the present application. The device at least comprises the following modules: a data acquisition module 510, a first training module 520, and a second training module 530.

A data acquisition module 510, configured to acquire a training data set, where the training data set includes a plurality of sample images and category labels corresponding to each sample image;

the first training module 520 is configured to train a preset first machine learning network based on the training data set to obtain a feature extraction model; the feature extraction model is used for extracting features of an input target image in the image recognition process to obtain image features so as to determine template features matched with the image features in a preset template feature library;

the second training module 530 is configured to train a preset second machine learning network for N sample images corresponding to each type of tag in the training dataset based on the N sample images and the feature extraction model, so as to obtain a quality evaluation model; the quality evaluation model is used for determining the quality score of the target image in the image recognition process so as to determine the recognition result of the target image by combining the quality score and the template characteristic; wherein N is a positive integer.

For relevant details reference is made to the method embodiments described above.

It should be noted that: in the model training device for image recognition provided in the above embodiment, only the division of the above functional modules is used for illustration when performing model training for image recognition, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the model training device for image recognition is divided into different functional modules to complete all or part of the functions described above. In addition, the model training device for image recognition provided in the above embodiment and the model training method embodiment for image recognition belong to the same concept, and detailed implementation processes of the model training device for image recognition are detailed in the method embodiment, which is not described herein.

Fig. 6 is a block diagram of an image recognition apparatus provided in one embodiment of the present application. The device at least comprises the following modules: an image acquisition module 610, a data extraction module 620, a feature matching module 630, and an image recognition module 640.

An image acquisition module 610, configured to acquire a target image to be identified;

a data extraction module 620, configured to input the target image into a feature extraction model and a quality assessment model, respectively, so as to output image features of the target image through the feature extraction model, and output quality scores of the target image through the quality assessment model; the feature extraction model and the quality evaluation model are trained based on the model training method for image recognition in the above embodiment;

The feature matching module 630 is configured to determine a template feature that matches the image feature in a preset template feature library;

an image recognition module 640, configured to determine a recognition result of the target image based on the quality score and the template feature.

It should be noted that: in the image recognition device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the image recognition device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the image recognition device and the image recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 7 is a block diagram of an electronic device provided in one embodiment of the present application. The device comprises at least a processor 501 and a memory 702.

The processor 701 may include one or more processing cores, such as: 4 core processors, 8 core processors, etc. The processor 701 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 701 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 701 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the model training method or image recognition method for image recognition provided by the method embodiments herein.

In some embodiments, the electronic device may further optionally include: a peripheral interface and at least one peripheral. The processor 701, the memory 702, and the peripheral interfaces may be connected by buses or signal lines. The individual peripheral devices may be connected to the peripheral device interface via buses, signal lines or circuit boards. Illustratively, peripheral devices include, but are not limited to: radio frequency circuitry, touch display screens, audio circuitry, and power supplies, among others.

Of course, the electronic device may also include fewer or more components, as the present embodiment is not limited in this regard.

Optionally, the application further provides a computer readable storage medium, in which a program is stored, the program being loaded and executed by a processor to implement the model training method or the image recognition method for image recognition of the above method embodiment.

Optionally, the application further provides a computer product, which includes a computer readable storage medium, where a program is stored, and the program is loaded and executed by a processor to implement the model training method or the image recognition method for image recognition according to the above method embodiment.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A model training method for image recognition, the method comprising:

2. The method according to claim 1, wherein the feature extraction model comprises a feature extraction layer and a blocking layer connected with the feature extraction layer, the blocking layer is used for dividing a feature map output by the feature extraction layer into at least two sub-feature maps according to a preset rule; correspondingly, the quality evaluation model outputs the quality score corresponding to each sub-feature map;

3. The method of claim 2, wherein before inputting each sample image into the feature extraction model and the second machine learning network, respectively, further comprises:

4. A method according to claim 3, wherein the blocking layer is further connected with at least two classification networks, the classification networks are in one-to-one correspondence with the sub-feature maps, and weights of the sub-feature maps in the classification networks corresponding to the classifications are different;

5. The method of claim 3, wherein the generating a first sample image set, a second sample image set, and a third sample image set based on the N Zhang Yangben image comprises:

6. An image recognition method, the method comprising:

acquiring a target image to be identified;

respectively inputting the target image into a feature extraction model and a quality evaluation model to output image features of the target image through the feature extraction model and output quality scores of the target image through the quality evaluation model; the feature extraction model and the quality assessment model are trained based on the model training method of any one of claims 1 to 5;

7. The method according to claim 6, wherein the feature extraction model comprises a feature extraction layer and a blocking layer connected with the feature extraction layer, the blocking layer is used for dividing a feature map output by the feature extraction layer into at least two sub-feature maps according to a preset rule; correspondingly, the quality evaluation model outputs the quality score corresponding to each sub-feature map;

8. The method of claim 7, wherein the method further comprises:

9. An electronic device comprising a processor and a memory; the memory having stored therein a program loaded and executed by the processor to implement the model training method for image recognition as set forth in any one of claims 1 to 5; alternatively, the image recognition method according to any one of claims 6 to 8 is implemented.

10. A computer-readable storage medium, characterized in that the storage medium has stored therein a program which, when executed by a processor, is adapted to carry out the model training method for image recognition according to any one of claims 1 to 5; alternatively, the image recognition method according to any one of claims 6 to 8 is implemented.