CN117935029B - Image processing method, device, equipment and storage medium - Google Patents

Image processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117935029B
CN117935029B CN202410326496.5A CN202410326496A CN117935029B CN 117935029 B CN117935029 B CN 117935029B CN 202410326496 A CN202410326496 A CN 202410326496A CN 117935029 B CN117935029 B CN 117935029B
Authority
CN
China
Prior art keywords
image
model
initial
student model
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410326496.5A
Other languages
Chinese (zh)
Other versions
CN117935029A (en
Inventor
张钟毓
黄余格
丁守鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410326496.5A priority Critical patent/CN117935029B/en
Publication of CN117935029A publication Critical patent/CN117935029A/en
Application granted granted Critical
Publication of CN117935029B publication Critical patent/CN117935029B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses an image processing method, an image processing device, image processing equipment and a storage medium, which are applied to an artificial intelligence technology, wherein the method comprises the following steps: performing object recognition on the sample object image through an initial student model to obtain a first predicted object attribute, performing recognition processing on the first sampling image through an initial teacher model to obtain a second predicted object attribute and a first image characteristic, and performing recognition processing on the second sampling image through the initial student model to obtain a third predicted object attribute and a second image characteristic; updating model parameters of the initial student model; and updating the model parameters of the initial teacher model according to the updated model parameters of the initial student model in the historical time period, repeatedly executing the steps until training is finished, and determining the updated initial student model after training is finished as a target student model. The application can improve the efficiency and accuracy of model training and reduce the resources consumed in the model training process.

Description

Image processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to an image processing method, apparatus, device, and storage medium.
Background
In the field of image recognition, a neural network recognition model can obtain object attributes of objects in a search image according to the feature similarity between the search image and candidate images in a database. In general, the neural network recognition model has higher recognition accuracy for a high-resolution search image, and as the resolution of the search image decreases, the recognition accuracy decreases significantly. In order to improve the recognition accuracy of the low-resolution search image, a model special for high-resolution image recognition needs to be trained in advance as an offline teacher model, and during formal training, an output result of the offline teacher model aiming at the high-resolution image is used as label information to guide a student model to recognize the low-resolution image. In practice, it is found that this model training method requires additional time and resources to train the teacher model in advance, and takes the output result of the high-resolution image as the label of the low-resolution image, and does not consider the information difference between the low-resolution image and the high-resolution image, so that the accuracy of model training is lower.
Disclosure of Invention
The embodiment of the application provides an image processing method, an image processing device, image processing equipment and a storage medium, which can improve the efficiency and accuracy of model training and reduce the resources consumed in the model training process.
An aspect of an embodiment of the present application provides an image processing method, including:
Acquiring a first sampling image and a second sampling image corresponding to the sample object image, and marking object attributes of sample objects in the sample object image; the resolution of the first sampled image is greater than the resolution of the second sampled image;
performing object recognition on the sample object image through an initial student model to obtain a first predicted object attribute, performing recognition processing on the first sampled image through an initial teacher model to obtain a second predicted object attribute and a first image characteristic, and performing recognition processing on the second sampled image through the initial student model to obtain a third predicted object attribute and a second image characteristic;
Updating model parameters of the initial student model according to the first image feature, the second image feature, the first predicted object attribute, the second predicted object attribute, the third predicted object attribute and the labeling object attribute;
And updating the model parameters of the initial teacher model according to the model parameters of the updated initial student model in the historical time period, repeatedly executing the steps until training is finished for the updated initial teacher model and the updated initial student model, and determining the updated initial student model after training is finished as a target student model.
An aspect of an embodiment of the present application provides an image processing apparatus, including:
the acquisition module is used for acquiring a first sampling image and a second sampling image corresponding to the sample object image and reflecting the labeling object attribute of the sample object in the sample object image; the resolution of the first sampled image is greater than the resolution of the second sampled image;
The identification module is used for carrying out object identification on the sample object image through an initial student model to obtain a first predicted object attribute, carrying out identification processing on the first sampling image through an initial teacher model to obtain a second predicted object attribute and a first image characteristic, and carrying out identification processing on the second sampling image through the initial student model to obtain a third predicted object attribute and a second image characteristic;
A first updating module, configured to update model parameters of the initial student model according to the first image feature, the second image feature, the first predicted object attribute, the second predicted object attribute, the third predicted object attribute, and the labeling object attribute;
And the second updating module is used for updating the model parameters of the initial teacher model according to the model parameters of the updated initial student model in the historical time period, repeatedly executing the steps to the end of training for the updated initial teacher model and the updated initial student model, and determining the updated initial student model after the end of training as a target student model.
Optionally, the first updating module is specifically configured to determine a recognition loss of the initial student model according to the first predicted object attribute and the labeling object attribute;
determining a self-distillation loss of the initial student model based on the first image feature, the second predicted object attribute, and the third predicted object attribute;
And updating model parameters of the initial student model according to the identification loss and the self-distillation loss.
Optionally, the first updating module is specifically configured to determine a spatial attention loss of the initial student model according to the first image feature and the second image feature;
determining a channel attention loss of the initial student model based on the first image feature and the second image feature;
Determining an attribute prediction loss of the initial student model according to the second predicted object attribute and the third predicted object attribute;
determining the spatial attention loss, the channel attention loss and the attribute prediction loss as self-distillation losses of the initial student model.
Optionally, the first updating module is specifically configured to perform weighted summation processing on the spatial attention loss, the channel attention loss, and the attribute prediction loss included in the self-distillation loss, so as to obtain a self-distillation total loss of the initial student model;
Summing the self-distillation total loss and the identification loss to obtain the total loss of the initial student model;
and updating model parameters of the initial student model according to the total loss.
Optionally, the first image feature includes feature values corresponding to M pixel points in the first sampled image under C feature channels, and the second image feature includes feature values corresponding to M pixel points in the second sampled image under C feature channels, where M, C are integers greater than 1;
Optionally, the first updating module is specifically configured to perform an averaging process on feature values corresponding to each pixel point in the first sampled image under the C feature channels, so as to obtain an importance degree of the corresponding pixel point in the first sampled image;
Averaging the corresponding characteristic values of each pixel point in the second sampling image under the C characteristic channels to obtain the importance degree of the corresponding pixel point in the first sampling image;
And determining the spatial attention loss of the initial student model according to the importance degrees respectively corresponding to the M pixel points in the first sampling image and the importance degrees respectively corresponding to the M pixel points in the second sampling image.
Optionally, the first updating module is specifically configured to perform a difference processing on the importance level of the f-th pixel in the first sampled image and the importance level of the f-th pixel in the second sampled image, so as to obtain an importance level deviation corresponding to the f-th pixel in the second sampled image; f is a positive integer less than or equal to M;
and squaring and summing the importance degree deviations of the M pixel points in the second sampling image to obtain the spatial attention loss of the initial student model.
Optionally, the first image feature includes feature values corresponding to M pixel points in the first sampled image under C feature channels, and the second image feature includes feature values corresponding to M pixel points in the second sampled image under C feature channels, where M, C are integers greater than 1;
optionally, the first updating module is specifically configured to perform an averaging process on the feature values of the M pixel points in the first sampled image under each feature channel, so as to obtain importance degrees of corresponding feature channels in the first sampled image;
averaging the characteristic values of the M pixel points in the second sampling image under each characteristic channel to obtain the importance degree of the corresponding characteristic channel in the second sampling image;
And determining the channel attention loss of the initial student model according to the importance degrees respectively corresponding to the C characteristic channels in the first sampling image and the importance degrees respectively corresponding to the C characteristic channels in the second sampling image.
Optionally, the first updating module is specifically configured to perform a difference processing on the importance level of the kth feature channel in the first sampled image and the importance level of the kth feature channel in the second sampled image, so as to obtain an importance level deviation of the kth feature channel of the second sampled image; k is a positive integer less than or equal to C;
and squaring and summing the importance degree deviations corresponding to the C characteristic channels in the second sampling image to obtain the channel attention loss of the initial student model.
Optionally, the second updating module is specifically configured to perform smoothing processing on model parameters of the updated initial student model in a historical time period to obtain processed model parameters;
and updating the model parameters of the initial teacher model according to the processed model parameters.
Optionally, the second updating module is specifically configured to obtain, after performing t-step iterative updating on the initial student model, model parameters of the updated initial student model after the t-step iterative updating, and an exponential sliding average value of the updated initial student model after the t-1-step iterative updating; t is an integer greater than 1;
Carrying out smoothing treatment on the model parameters after the iterative updating in the t step and the index sliding average value after the iterative updating in the t-1 step according to the smoothing factors to obtain an index sliding average value of the initial student model after the updating in the t step;
and (3) determining the index sliding average value which is iteratively updated in the step t as the processed model parameter.
Optionally, the acquiring module is specifically configured to perform downsampling processing on the sample object image according to a first downsampling multiple to obtain a first sampled image;
Performing downsampling processing on the sample object image according to a second downsampling multiple to obtain a second sampled image; the first downsampling multiple is smaller than the second downsampling multiple.
Optionally, the acquiring module is specifically configured to acquire an image of a target object to be identified;
And carrying out object recognition on the target object image through a target student model to obtain the object attribute of the object in the target object image.
In one aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.
In one aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, the computer program implementing the steps of the method described above when executed by a processor.
In one aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method described above.
The application comprises at least the following advantages: (1) The model parameters of the initial teacher model are updated through the updated model parameters of the initial student model in the historical time period, so that the initial teacher model is more stable in the training process, noise and fluctuation in the training process are reduced, and the knowledge quantity learned by the student model is improved. (2) Because the model parameters of the updated initial student model in the historical time period are considered in the updating process of the initial teacher model, the initial teacher model can better capture the learned knowledge of the updated initial student model in the training process, so that the initial teacher model has better generalization capability, and the generalization capability of the updated initial student model, namely the training accuracy of the model is improved. (3) The initial teacher model is trained in the process of training the initial student model, and an offline teacher model does not need to be additionally trained, so that a great amount of time and calculation resources are saved, namely, the model training efficiency is improved, and the resources consumed in the model training process are reduced. (4) The initial student model is trained according to the first image feature, the second image feature and the like, the first image feature and the second image feature are obtained by sampling images with different resolutions, namely, information differences among the images with different resolutions are considered, the accuracy of model training is improved, and therefore the training is facilitated, the obtained target student model has the recognition capability of the images with different resolutions, and the applicability of the target student model is improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an image processing system provided by the present application;
fig. 2 is an application scene intent of an image processing method provided by the present application;
FIG. 3 is a schematic flow chart of an image processing method according to the present application;
FIG. 4 is a flow chart of another image processing method provided by the application;
FIG. 5 is a schematic view of a scenario in which self-distillation loss of an initial student model is obtained;
Fig. 6 is a schematic structural view of an image processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Embodiments of the present application may relate to the fields of artificial intelligence technology, autopilot, intelligent transportation, etc., and the artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large image processing technologies, operation/interaction systems, electromechanical integration, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
The application mainly relates to a machine learning technology in an artificial intelligence technology, and discloses a model training method without teacher self-distillation by utilizing the machine learning technology, namely a self-distillation mode is used for guiding an initial student model to acquire effective information from a high-resolution image in the process of identifying a middle-resolution image and a low-resolution image, so that the identification precision of the initial student model is improved. Meanwhile, an offline teacher model does not need to be additionally trained, so that a great deal of time and resources are saved.
Wherein, distillation means that in deep learning, knowledge with better quality is used as supervision information to guide a specific model to obtain better performance; such as training a small model (student model) with the output of a larger model (also called a teacher model) with better performance as additional supervision in an effort to improve the accuracy of the small model.
Wherein, self-distillation means that the teacher model and the student model in distillation have the same model structure, and a larger model is not additionally constructed.
In order to facilitate a clearer understanding of the present application, an image processing system implementing the present application will first be described, and as shown in fig. 1, the image processing system includes a server 10 and a terminal cluster, where the terminal cluster may include one or more terminals, and the number of terminals will not be limited. As shown in fig. 1, taking an example of a terminal cluster including 4 terminals as an illustration, the terminal cluster may specifically include a terminal 1a, a terminal 2a, a terminal 3a, and a terminal 4a; it will be appreciated that terminals 1a, 2a, 3a, 4a may each be in network connection with the server 10, so that each terminal may interact with the server 10 via a network connection.
It can be understood that the server may be an independent physical server, or may be a server cluster or a distributed system formed by at least two physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, networK services, cloud communication, middleware services, domain name services, security services, content Delivery NetworK (CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal may specifically refer to a vehicle-mounted terminal, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent sound box, a screen sound box, an intelligent television, an intelligent watch, and the like, but is not limited thereto. The terminals and the servers may be directly or indirectly connected through wired or wireless communication, and meanwhile, the number of the terminals and the servers may be one or at least two, which is not limited herein.
The terminal is provided with one or more target applications, where the target applications may refer to applications with image processing (e.g., capturing images, generating images), and the target applications may include separate applications, web applications, applets in a host application, and the like. The server 10 refers to a device that provides a back-end service for a target application in a terminal, in one embodiment, the server 10 may obtain a processed model parameter by smoothing model parameters of an updated initial student model in a historical period, update model parameters of an initial teacher model according to the processed model parameter, train an updated initial student model according to the updated teacher model, and obtain a target student model, where the target student model in the server 10 can be invoked by the target application in the terminal to identify an image.
It should be noted that, in the present application, the initial student model and the initial teacher model refer to the model to be trained, and the initial student model and the student teacher model may refer to a deep convolutional neural network model, a cyclic neural network model, and the like. The initial student model and the initial teacher model have the same model structure, namely the number of layers of the initial student model is the same as that of the initial teacher model, and the type of each layer of the initial student model is the same as that of the corresponding layer of the initial teacher model. Taking the cyclic neural network model as an example, the first layers of the initial student model and the initial teacher model are input layers, the second layers of the initial student model and the initial teacher model are hidden layers, and the third layers of the initial student model and the initial teacher model are output layers.
It should be noted that, in the present application, the image used for training the model is referred to as a sample object image, the image to be identified may be referred to as a target object image, and the target object image and the sample object image may each include an object, where the object may refer to a person, an animal, a building, or the like. The object attribute may refer to an attribute for reflecting an object in an image, and when the object in the image is a person, the object attribute may refer to a facial pose, an expression, a positional relationship between organs of the face, a size of the organs, and the like of the person. When the object in the image is an animal, the object attribute may be a category of the animal; when the object in the image is a building, the object attribute may refer to the name, purpose, location, etc. of the building.
The application can be applied to face recognition scenes, animal recognition scenes, article recognition scenes and the like, and the face recognition scenes can be taken as an example as shown in fig. 2, and can be a face payment scene, a gate control scene based on a face, an illegal user recognition scene based on the face and the like. As shown in fig. 2, an initial teacher model 21a and an initial student model 22a may be included in the server 10. When a target student model for identifying a human face needs to be trained, the server 10 may acquire a sample object image including the human face, and a labeling object attribute corresponding to the sample object image, where the labeling object attribute is used to reflect a face attribute of a sample object in the sample object image, the labeling object attribute may be obtained by manually labeling the sample object image, and the number of sample object images may be multiple images including the human face. Taking the sample object image 23a as an example in fig. 2, the labeling object attributes of the sample object image 23a include: 63mm (interocular distance, distance between two eyes), 65mm (distance between two pupils), etc., that is, the average probability corresponding to an interocular distance of 63mm and an interpupillary distance of 65mm of the sample object is 1.
As in fig. 2, the server 10 may input the sample object image 23a into an initial student model, and perform object recognition on the sample object image 23a through the initial student model, to obtain a first predicted object attribute 26a of a sample object in the sample object image 23a, where the first predicted object attribute 26a is used to reflect a face attribute of the sample object in the sample object image 23a, and the first predicted object attribute 26a may be an output of a last layer of the initial student model, that is, a normalized probability. As the first predicted object property 26a may include (0.8, 0.1), (0.5, 0.2, 0.3); 0.8, 0.1 are probabilities of the interocular distances of the sample object being 63mm, 64mm, 65mm, respectively, and 0.5, 0.2, 0.3 are probabilities of the interpupils of the sample object being 65mm, 66mm, 67mm, respectively. The sum of 0.8, 0.1 is 1, i.e. 0.8, 0.1 is normalized probability; the sum of 0.5, 0.2, 0.3 is 1, i.e. 0.5, 0.2, 0.3 is normalized probability.
When the difference between the first predicted object property 26a and the labeling object property is smaller, it indicates that the accuracy of object recognition of the initial student model is higher; conversely, when the difference between the first predicted object property 26a and the labeling object property is relatively large, it indicates that the accuracy of object recognition of the initial student model is relatively low. Accordingly, the server 10 may determine a recognition loss 27a of the initial student model based on the first predicted object property 26a and the labeling object property, i.e., the recognition loss 27a is used to reflect the object recognition accuracy of the initial student model for the sample object image. Typically the sample object image is a high resolution image, and therefore the recognition penalty 27a is referred to as reflecting the object recognition accuracy of the initial student model for the high resolution image.
Further, the server 10 may perform downsampling on the sample object image 23a according to a first downsampling multiple to obtain a first sampled image 24a, and perform downsampling on the sample object image 23a according to a second downsampling multiple to obtain a second sampled image 25a, where the second downsampling multiple is greater than the first downsampling multiple. As can be seen from comparing the first sampled image 24a with the second sampled image 25a in fig. 2, the resolution of the first sampled image 24a is greater than the resolution of the second sampled image 25a, i.e. the facial detail features of the sample objects in the first sampled image 24a are clearer and the facial detail features of the sample objects in the second sampled image 25a are more blurred. The server 10 may input the first sampled image 24a into the initial teacher model 21a, and the initial teacher model 21a performs recognition processing on the first sampled image 24a to obtain the first image feature and the second prediction object attribute. The first image feature is used to reflect the color feature, texture feature, and the like of each pixel in the first sampled image 24a, and the second prediction object attribute may refer to logits output by the initial teacher model 21a for the first sampled image 24a, where logits is used to reflect the facial attribute of the sample object, and logits is a non-normalized probability output by the initial teacher model 21a, typically an output of a penultimate layer of the initial teacher model 21a, and a final layer of the initial teacher model 21a is typically a normalization layer for normalizing the output of the penultimate layer. As the second predicted object property may include (0.9, 0.5, 0.6), (0.5, 0.7, 0.8); 0.9, 0.5, 0.6 are probabilities of the interocular distances of the sample object being 63mm, 64mm, 65mm, respectively, and 0.5, 0.7, 0.8 are probabilities of the interpupils of the sample object being 65mm, 66mm, 67mm, respectively. I.e. 0.5, 0.7, 0.8 is not equal to 1, i.e. 0.5, 0.7, 0.8 is the probability of not being normalized; the sum of 0.9, 0.5, 0.6 is not equal to 1, i.e. 0.9, 0.5, 0.6 is the probability of not being normalized.
Similarly, the server 10 may input the second sampled image 25a into the initial student model 22a, and the initial student model 22a performs recognition processing on the second sampled image 25a to obtain the second image feature and the third predicted object attribute. The second image feature is used to reflect the color feature, texture feature, and the like of each pixel in the second sampled image 25a, and the third prediction object attribute may refer to logits of the output of the initial student model 22a for the second sampled image 25a, and the logits is used to reflect the facial attribute of the sample object, where logits is the non-normalized probability of the output of the initial student model 22a, typically the output of the penultimate layer of the initial student model 22a, and the last layer of the initial student model 22a is typically the normalization layer, for normalizing the output of the penultimate layer. As the third predicted object property may include (0.8, 0.3), (0.6, 0.7, 0.2); 0.8, 0.3 are probabilities of the interocular distances of the sample object being 63mm, 64mm, 65mm, respectively, and 0.6, 0.7, 0.2 are probabilities of the interpupils of the sample object being 65mm, 66mm, 67mm, respectively. I.e. 0.8, 0.3 is not equal to 1, i.e. 0.8, 0.3 is the probability of not being normalized; the sum of 0.6, 0.7, 0.2 is not equal to 1, i.e. 0.6, 0.7, 0.2 is the probability of not being normalized.
The server 10 may then determine a self-distillation loss 28a of the initial student model 22a based on the first image feature, the second predicted object property, and the third object property, and update model parameters of the initial student model 22a based on the self-distillation loss 28a and the identification loss to obtain an updated initial student model 22a. After the initial student model 22a is iteratively updated in t steps based on the sample object image in the server 10, the server 10 may perform smoothing processing on the model parameters of the updated initial student model 22a updated in the t step and the model parameters updated before the t step to obtain processed model parameters, update the model parameters of the initial teacher model 21a according to the processed model parameters, and further, continuously train the updated initial student model 22a according to the updated initial teacher model 21a to obtain the target student model.
Since the recognition loss is obtained based on the high-resolution sample object image 23a, updating the model parameters of the initial student model 22a by the recognition loss is advantageous in that the initial student model 22a has recognition capability for the high-resolution image. Meanwhile, the self-retorting loss is obtained based on the first sampling image 24a and the second sampling image 25a, and the resolution of the second sampling image 25a is smaller than that of the first sampling image 24a, i.e., the first sampling image 24a may be referred to as a high resolution image, and the second sampling image 25a may be referred to as a low resolution or medium resolution image. In other words, updating the model parameters of the initial student model 22a by self-distillation loss is advantageous for the initial student model 22a to have recognition capability for low-resolution, medium-resolution images. In short, the training is facilitated to obtain a multi-scale target student model, and the multi-scale can refer to multiple resolutions, namely the target student model has face recognition capability aiming at images with multiple resolutions.
In addition, the first image characteristic and the second prediction object attribute are obtained by the initial teacher model, so that the initial student model is guided to acquire effective information from a high-resolution image when the medium-resolution and low-resolution images are identified, and the identification accuracy is improved. Specifically, the model parameters of the initial teacher model are updated through the updated model parameters of the initial student model, so that the self-distillation of the initial student model is realized, the characteristic that the accuracy of the updated initial student model in the initial training stage is improved more quickly is fully utilized, and the initial teacher model can obtain information such as middle layer characteristics (namely first image characteristics) and logits of a high-quality first sampling image (namely a high-resolution image) in the initial training stage. The information such as the middle layer characteristics of the high-resolution image, logits and the like is taken as an additional label, the initial student model is guided to pay attention to more important space areas on the middle-resolution image and the low-resolution image, and effective information is extracted from the relatively blurred image (namely the second sampling image) as much as possible. Meanwhile, by using different resolution interval matching strategies, the problem that effective information is blurred in a large amount due to overlarge resolution difference is solved, and the recognition accuracy of a target student model for multiple resolution images is improved.
The above-mentioned different resolution interval matching strategies are used to instruct that the sample object image 23a is subjected to downsampling according to a first downsampling multiple to obtain a first sampled image 24a, and the sample object image 23a is subjected to downsampling according to a second downsampling multiple to obtain a second sampled image 25a, where the difference between the first downsampling multiple and the second downsampling multiple is smaller than a difference threshold, so that the resolution difference between the first sampled image 24a and the second sampled image 25a is not excessive.
After training to obtain the target model, the server 10 may store the target student model locally, and when any terminal (e.g., the terminal 1 a) sends a request for identifying the target object image, the target student model may identify the target object image to obtain the object attribute of the object of the target object image, and return the identified object attribute to the terminal 1a. Or the server 10 may send the target student model to any terminal (such as the terminal 1 a), and the terminal 1a may perform object recognition on the target object image through the target student model to obtain the object attribute of the object of the target object image. Or the server 10 may deploy the target student model in a cloud server, any terminal (such as the terminal 1 a) may send the target object image to be identified to the cloud server, and the cloud server may perform object identification on the target object image through the target student model to obtain an object attribute of an object of the target object image, and return the identified object attribute to the terminal 1a.
Further, please refer to fig. 3, which is a flowchart illustrating an image processing method according to an embodiment of the present application. As shown in fig. 3, the method may be performed by any terminal in the terminal cluster in fig. 1, may be performed by a server in fig. 1, or may be performed cooperatively by a terminal and a server in the terminal cluster in fig. 1, and the apparatus for performing the image processing method in the present application may be collectively referred to as a computer apparatus. Wherein, the method can comprise the following steps:
s101, acquiring a first sampling image and a second sampling image corresponding to a sample object image, and marking object attributes for reflecting sample objects in the sample object image; the resolution of the first sampled image is greater than the resolution of the second sampled image.
In the application, the computer equipment can download or locally store the sample object image from the network, sample the sample object image to obtain the first sample image and the second sample image of the sample object image, and the sampling process can be up-sampling process or down-sampling process, and then the sample object image is marked manually to obtain the marked object attribute of the sample object image.
For example, in a face recognition scenario, the sample object image may refer to an image including a face, and the annotation object attribute may include at least one of a pose, an expression, a positional relationship between organs of the face, a size of the organs, and the like of the face of the sample object in the sample object image. In an animal identification scenario, the sample object image may refer to an image that includes an animal, and the annotation object attribute may refer to a category of the animal in the sample object image. In the building identification scene, the sample object image may refer to an image including a building, and the labeling object attribute may refer to at least one of an address, a name, and the like of the building in the sample object image.
It should be noted that, the difference of the resolutions between the resolution of the first sampling image and the resolution of the second sampling image is smaller than the difference threshold, which is favorable for avoiding the problem that the resolution difference between the first sampling image and the second sampling image is too large, so that the validity information of the low-resolution image (i.e. the first sampling image) is largely blurred, and is favorable for improving the recognition precision of the initial student model for various resolution images.
In one embodiment, the acquiring the first sampled image and the second sampled image corresponding to the sample object image includes: performing downsampling processing on the sample object image according to the first downsampling multiple to obtain a first sampled image; performing downsampling processing on the sample object image according to a second downsampling multiple to obtain a second sampled image; the first downsampling multiple is smaller than the second downsampling multiple.
Specifically, in order to obtain images with different resolutions, the computer device may acquire a first downsampling multiple and a second downsampling multiple, and the computer device may perform downsampling processing on the sample object image according to the first downsampling multiple to obtain a first sampled image, and perform downsampling processing on the sample object image according to the second downsampling multiple to obtain a second sampled image; the method is beneficial to training the initial student model by adopting images with different resolutions.
The first downsampling multiple is smaller than the second downsampling multiple, and the difference between the first downsampling multiple and the second downsampling multiple is smaller than a difference threshold, so that the problem that effective information in a second sampled image is blurred in a large amount due to the fact that the difference between the first downsampling multiple and the second downsampling multiple is overlarge can be avoided.
For example, the downsampling is 2 times, 4 times, 8 times, or 10 times, and the larger the multiple is, the higher the downsampling multiple is; i.e. the first downsampling multiple, the second downsampling multiple may be 2 times, 4 times, 8 times, 10 times. Such as sample object image with resolution ofThe first downsampling multiple is 2 times, and the resolution of the first sampled image obtained based on the first downsampling multiple is; The second downsampling multiple is 4 times, and the resolution of the second sampled image obtained based on the second downsampling multiple is
It should be noted that, the first downsampling multiple and the second downsampling multiple may be fixed values set in advance, for example, the first downsampling multiple is used to indicate that the resolution of the sample object image is reduced by 1/2, that is, the resolution of the first sampled image and the resolution of the sample object image are 1/2; the second downsampling multiple is used to indicate that the resolution of the sample object image is reduced by 3/4, 7/8, i.e., the resolution of the second sample image is 1/4, 1/8 of the resolution of the sample object image. In practical applications, the information loss degree after downsampling of different sample object images is different. For example, downsampling 1/2 of the sample object image results in a large amount of information being lost, and at this time, the problem of insufficient information amount can occur when the other sample images are guided as the first sample image. Thus, the computer device may adaptively generate the first downsampling multiple and the second downsampling multiple according to the amount of information in the sample object image, e.g., the greater the amount of information in the sample object image, the greater the first downsampling multiple and the second downsampling multiple may be generated; the smaller the amount of information in the sample object image, the smaller the first and second downsampling multiples can be generated.
In one embodiment, the acquiring the first sampled image and the second sampled image corresponding to the sample object image includes: in order to obtain images with different resolutions, the computer device may perform upsampling processing on the sample object image according to a first upsampling multiple to obtain a first sampled image; performing up-sampling processing on the sample object image according to a second up-sampling multiple to obtain a second sampling image; the first upsampling multiple is greater than the second upsampling multiple.
S102, performing object recognition on the sample object image through an initial student model to obtain a first predicted object attribute, performing recognition processing on the first sampled image through an initial teacher model to obtain a second predicted object attribute and a first image feature, and performing recognition processing on the second sampled image through the initial student model to obtain a third predicted object attribute and a second image feature.
In the application, the computer equipment can input the sample object image into an initial student model, and the initial student model is used for carrying out object recognition on the sample object image to obtain the object attribute of the sample object in the sample object image, and the object attribute is recorded as a first predicted object attribute. Further, the first sampling image is input into an initial teacher model, and the first sampling image is identified through the initial teacher model, so that the first image feature and the second predicted object attribute are obtained. And inputting the second sampling image into an initial student model, and identifying the second sampling image through the initial student model to obtain a third predicted object attribute and a second image characteristic.
The first image features are used for reflecting at least one of color features, texture features and the like of pixel points in the first sampling image; the second image feature is used to reflect at least one of a color feature, a texture feature, and the like of the pixel point in the second sample image.
For example, in a face recognition scenario, the first predicted object property may be used to reflect at least one of a pose, an expression, a positional relationship between organs of a face, a size of an organ, and the like of a face of a sample object in the sample object image, and the second predicted object property may be used to reflect at least one of a pose, an expression, a positional relationship between organs of a face, a size of an organ, and the like of a face of a sample object in the first sample image; the second predicted object may be used to reflect at least one of a pose, an expression, a positional relationship between organs of the face, a size of the organs, and the like of the face of the sample object in the second sample image. In an animal identification scenario, a first predicted object attribute may be used to reflect a category of an animal in a sample object image and a second predicted object attribute may be used to reflect a category of an animal in a first sample image; the third predicted object property may be used to reflect a category of the animal in the second sampled image. In the building identification scenario, a first predicted object attribute may be used to reflect at least one of an address, a name, etc. of a building in the sample object image, a second predicted object attribute may be used to reflect at least one of an address, a name, etc. of a building in the first sample image, and a third predicted object attribute may be used to reflect at least one of an address, a name, etc. of a building in the second sample image.
It should be noted that, the first predicted object attribute may refer to a normalized probability output by the initial student model for the sample object image, where the probability corresponding to the first predicted object attribute reflects a likelihood that the object attribute of the sample object in the sample object image is a candidate object attribute. The second predicted object attribute may refer to an un-normalized probability output by the initial teacher model for the first sampled image, where a probability corresponding to the second predicted object attribute reflects a property of the object of the sample object in the first sampled image as a candidate object attribute. The third predicted object property may refer to an un-normalized probability that the initial student model outputs for the second sampled image, and the probability corresponding to the third predicted object property reflects the object property of the sample object in the second sampled image as the candidate object property.
For example, taking head pose recognition in a face recognition scenario as an example, candidate object attributes may refer to candidate poses, which may include multiple of heads up, heads down, and so on; the first predicted object property may include probabilities that head poses of sample objects in the sample object image are head up, head down, head shake, and head turn, respectively, and a sum of all probabilities within the first predicted object property is 1. The second predicted object property may include probabilities that the head pose of the sample object in the first sampled image is head up, head down, head shaking, head turning, respectively, and the sum of all probabilities within the second predicted object property is not 1. The third predicted object property may include probabilities that the head pose of the sample object in the second sampled image is head up, head down, head shaking, head turning, respectively, all probabilities within the third predicted object property sum to not 1.
And S103, updating model parameters of the initial student model according to the first image feature, the second image feature, the first predicted object attribute, the second predicted object attribute, the third predicted object attribute and the labeling object attribute.
In the present application, the computer device may update the model parameters of the initial student model according to the first image feature, the second image feature, the first predicted object attribute, the second predicted object attribute, the third predicted object attribute, and the labeling object attribute; therefore, the method is beneficial to guiding the initial student model to pay attention to more important space areas on the middle-resolution image and the low-resolution image, extracting effective information from the relatively blurred image (namely the second sampling image) as much as possible, and improving the recognition accuracy of the initial student model for various resolution images.
S104, updating the model parameters of the initial teacher model according to the updated model parameters of the initial student model in the historical time period.
S105, repeating the steps until training is finished for the updated initial teacher model and the updated initial student model, and determining the updated initial student model after training is finished as a target student model.
In steps S104 to S105, in an example, the number of sample object images may be M, and the computer device may update the model parameters of the initial teacher model according to the updated model parameters of the initial student model after the current round of updating and the model parameters before the current round of updating after each round of updating of the model parameters of the initial student model based on the M sample object images is completed.
The number of the sample object images can be M, the M sample object images can be divided into M batches (batch), and the number of the sample object images in each batch can be M/M; after updating the model parameters of the initial student model based on one batch of sample object images, the model parameters of the initial student model may be referred to as one-step iteration (iteration) update, and the one-step iteration update may also be referred to as one-step iteration update. When model parameters of the initial student model are updated based on m batches of sample object images, the model parameters of the initial student model can be called as a complete round of (epoch) updating.
The computer equipment can circularly execute the steps S101-S103 according to the sample object images of m batches, and perform multi-step iterative updating on the initial student model. After performing the t-step iterative updating, the computer device may acquire model parameters of the initial learning model in a historical time period, where the historical time period may refer to a time period during which the t-step iterative updating is performed on the initial student model, update the model parameters of the initial teacher model according to the model parameters of the updated initial student model in the historical time period, and repeatedly perform the steps S101 to S104 until the training is completed for the updated initial teacher model and the updated initial student model, and determine the updated initial student model after the training is completed as the target student model.
It should be noted that, the training may be ended when the number of the updated rounds of the initial student model reaches the round number threshold, or the training may be ended when the total loss of the updated initial student model is less than a preset loss value, the preset loss value may be a minimum value, or the preset loss value may be manually set.
The application comprises at least the following advantages: (1) The model parameters of the initial teacher model are updated through the updated model parameters of the initial student model in the historical time period, so that the initial teacher model is more stable in the training process, noise and fluctuation in the training process are reduced, and the knowledge quantity learned by the student model is improved. (2) Because the model parameters of the updated initial student model in the historical time period are considered in the updating process of the initial teacher model, the initial teacher model can better capture the learned knowledge of the updated initial student model in the training process, so that the initial teacher model has better generalization capability, and the generalization capability of the updated initial student model, namely the training accuracy of the model is improved. (3) The initial teacher model is trained in the process of training the initial student model, and an offline teacher model does not need to be additionally trained, so that a great amount of time and calculation resources are saved, namely, the model training efficiency is improved, and the resources consumed in the model training process are reduced. (4) The initial student model is trained according to the first image feature, the second image feature and the like, the first image feature and the second image feature are obtained by sampling images with different resolutions, namely, information differences among the images with different resolutions are considered, the accuracy of model training is improved, and therefore the training is facilitated, the obtained target student model has the recognition capability of the images with different resolutions, and the applicability of the target student model is improved.
Further, please refer to fig. 4, which is a flowchart illustrating an image processing method according to an embodiment of the present application. As shown in fig. 4, the method may be performed by any terminal in the terminal cluster in fig. 1, may be performed by a server in fig. 1, or may be performed cooperatively by a terminal and a server in the terminal cluster in fig. 1, and the apparatus for performing the image processing method in the present application may be collectively referred to as a computer apparatus. Wherein, the method can comprise the following steps:
s201, acquiring a first sampling image and a second sampling image corresponding to a sample object image, and marking object attributes for reflecting sample objects in the sample object image; the resolution of the first sampled image is greater than the resolution of the second sampled image.
S202, performing object recognition on the sample object image through an initial student model to obtain a first predicted object attribute, performing recognition processing on the first sample image through an initial teacher model to obtain a second predicted object attribute and a first image feature, and performing recognition processing on the second sample image through the initial student model to obtain a third predicted object attribute and a second image feature.
S203, determining the recognition loss of the initial student model according to the first predicted object attribute and the labeling object attribute.
According to the application, the computer equipment can determine the recognition loss of the initial student model according to the difference value between the first predicted object attribute and the marked object attribute; since the first predicted object property is derived based on a sample object image, which is typically a high resolution image, the recognition penalty can be said to be used to train the initial student model to have recognition capabilities for the high resolution image. In addition, the recognition loss can also be called as basic loss of the initial student model, and can also be used for guiding the object recognition result of the initial student model aiming at the sample object image to be as close to the labeling object attribute as possible, and enlarging the difference of the object recognition results corresponding to different sample object images.
For example, the computer device may employ a normalized exponential function (Softmax loss), an additive cosine interval loss function (CosFace Loss), an additive angle interval loss function (ArcFace), a multiplicative angle interval loss function (SPHEREFACE), and the like to measure the recognition loss of the initial student model. Taking ArcFace as an example, the computer device may calculate the recognition loss of the initial student model using equation (1) as follows:
(1)
Wherein, in the formula (1), For the recognition loss of the initial student model, m is a super-parameter set for further narrowing the first predicted object property differences of the same sample object, and s is a super-parameter set for further increasing the first predicted object number differences of different sample objects. /(I)Can be used to reflect cosine similarity between probabilities of candidate object attributes other than the labeling object attribute; /(I)The cosine similarity between the probability corresponding to the labeling object attribute and the probability corresponding to the candidate object attribute. /(I)N can be N, N can be the category number of candidate object attributes of sample objects, i can be the ith candidate object attribute in the candidate object attributes, the arrangement sequence of the candidate object attributes is preset, the first predicted object attribute of the output of the initial student model is output according to the arrangement sequence of the candidate object attributes,/>, and i can be the index object attribute belongs to the ith candidate object attribute in the candidate object attributesMay refer to the probability that the annotation object attribute of the sample object in the sample object image belongs to the i-th candidate object attribute.
For example, describing the recognition of head gestures in a face recognition scenario, candidate object attributes may include head up, head down, head shake, head turn; n in formula (1) is 4; the first predicted object attribute may include probabilities that head gestures of a sample object in the sample object image are respectively corresponding to head lifting, head lowering, head shaking and head turning; the arrangement order of the head poses (i.e. candidate object attributes) is as follows: raising, lowering, shaking and turning. The labeling object property of the sample image may be used to indicate that the head pose of the sample object in the sample object image is head up, i.e. the labeling object property indicates that the probability of the head pose of the sample object in the sample object image being head up is 1, i.e1 When i=1,/>And 0 when i.noteq.1.
S204, determining self-distillation loss of the initial student model according to the first image feature, the second predicted object attribute and the third predicted object attribute.
In the present application, the computer device may determine the self-distillation loss of the initial student model based on the first image feature, the second predicted object attribute, and the third predicted object attribute. Since the first image feature and the second prediction object property are derived from a first sampled image, the second image feature and the third prediction object property are derived from a second sampled image, the resolution of the first sampled image is greater than the resolution of the second sampled image, the first sampled image may be referred to as a high resolution image, and the second sampled image may be referred to as a medium resolution or low resolution image. In other words, the self-distillation loss is used for training the initial student model to have the recognition capability for low-resolution and medium-resolution images, meanwhile, the information such as middle layer characteristics (first image characteristics) and logits (second predicted object attributes) of the high-resolution images is used as an additional label, the initial student model is guided to pay attention to more important space areas on the medium-resolution and low-resolution images, effective information is extracted from relatively blurred images (namely second sampling images) as much as possible, and the accuracy of model training is improved.
In one embodiment, the step S204 may include the following steps S11 to S14:
s11, determining the spatial attention loss of the initial student model according to the first image feature and the second image feature.
Specifically, the computer device may determine, according to the first image feature and the second image feature, a spatial attention loss of the initial student model, where the spatial attention loss is used to train the initial student model to pay attention to a more important spatial region in the second sampled image, where the more important spatial region may be a region where effective information in the second sampled image is located, and where the effective information is located may be a region where a sample object is located.
Optionally, the first image feature includes feature values corresponding to M pixel points in the first sampled image under C feature channels, and the second image feature includes feature values corresponding to M pixel points in the second sampled image under C feature channels, where M, C are integers greater than 1.
The step S11 may include the following steps S111 to S113:
And S111, carrying out averaging processing on the characteristic values corresponding to each pixel point in the first sampling image under the C characteristic channels, and obtaining the importance degree of the corresponding pixel point in the first sampling image.
Specifically, according to an average algorithm, the computer device may perform an averaging process on the feature values corresponding to each pixel point in the first sampled image under the C feature channels, to obtain an average feature value of each pixel point in the first sampled image, and determine the average feature value of each pixel point in the first sampled image as the importance degree of the corresponding pixel point in the first sampled image. The average algorithm herein may refer to a statistical average algorithm, an arithmetic average algorithm, a weighted average algorithm, etc.; the importance degree of the pixel point reflects the importance of the recognition result of the pixel point for the object attribute of the sample object, namely, the higher the importance degree of the pixel point is, the higher the importance of the recognition result of the pixel point for the object attribute of the sample object is; the lower the importance of a pixel point, the lower the importance of the recognition result reflecting the object attribute of the pixel point for the sample object.
For example, the first sampled image includes a pixel P1, C may be 32, and the computer device may perform an averaging process on the feature values of the pixel P1 under the 32 feature channels to obtain an average feature value of the pixel P1, and determine the average feature value of the pixel P1 as the importance degree of the pixel P1. And repeating the steps until the importance degree of all the pixel points in the first sampling image is obtained.
And S112, carrying out averaging processing on the characteristic values corresponding to each pixel point in the second sampling image under the C characteristic channels to obtain the importance degree of the corresponding pixel point in the first sampling image.
Specifically, the computer device may perform an averaging process on the feature values corresponding to each pixel point in the second sampled image under the C feature channels according to an averaging algorithm, to obtain an average feature value of each pixel point in the first sampled image, and determine the average feature value of each pixel point in the first sampled image as an importance degree of the corresponding pixel point in the first sampled image.
For example, the second sampled image includes a pixel P2, C may be 32, and the computer device may perform an averaging process on the feature values of the pixel P2 under the 32 feature channels to obtain an average feature value of the pixel P2, and determine the average feature value of the pixel P2 as the importance degree of the pixel P2. And repeating the steps until the importance degree of all the pixel points in the second sampling image is obtained.
S113, determining the spatial attention loss of the initial student model according to the importance degrees corresponding to the M pixel points in the first sampling image and the importance degrees corresponding to the M pixel points in the second sampling image.
The computer device may determine a spatial attention loss of the initial student model according to the respective degrees of importance of the M pixels in the first sampled image and the respective degrees of importance of the M pixels in the second sampled image. That is, the principle of the spatial attention loss is that the importance degree of each spatial pixel point in the first sampling image output by the initial teacher model is used to guide the initial student model to enlarge the learning of the important area, that is, the area where the important area is located for the pixel point with the importance degree larger than the degree threshold. Specifically, for the same original image (i.e., a sample object image), a first importance degree matrix of each spatial pixel point is given by the initial teacher model for the weak downsampled image (i.e., a first sampled image), so that the first importance degree matrix guides the initial student model to obtain a second importance degree matrix of each pixel point which is as consistent as possible in the strong downsampled image (i.e., a second sampled image). The first importance degree matrix comprises importance degrees respectively corresponding to M sampling points in the first sampled image, and the second importance degree matrix comprises importance degrees respectively corresponding to M sampling points in the second image.
Optionally, the step S113 may include the following steps S1131 to 1132:
s1131, performing a difference processing on the importance degree of the f-th pixel point in the first sampling image and the importance degree of the f-th pixel point in the second sampling image to obtain an importance degree deviation corresponding to the f-th pixel point in the second sampling image; f is a positive integer less than or equal to M.
Specifically, the computer device may perform a difference processing on the importance degrees corresponding to the pixel points at the same position in the first sampling image and the second sampling image, to obtain an importance degree deviation corresponding to each pixel point in the second sampling image. Specifically, the importance degree of the f-th pixel point in the first sampling image and the importance degree of the f-th pixel point in the second sampling image are subjected to difference processing, so that importance degree deviation corresponding to the f-th pixel point in the second sampling image is obtained. The importance degree deviation corresponding to the f-th pixel point is used for reflecting the importance degree of the f-th pixel point in the second sampling image, the importance degree of the f-th pixel point in the first sampling image is different, the position of the f-th pixel point in the first sampling image is the same, and the position of the f-th pixel point in the second sampling image is the same.
S1132, squaring and summing the importance degree deviations corresponding to the M pixel points in the second sampling image to obtain the spatial attention loss of the initial student model.
Specifically, the computer device may repeat the step S1131 until the importance level deviations corresponding to the M pixel points in the second sampled image are obtained, and the computer device may square and process the importance level deviations corresponding to the M pixel points in the second sampled image, so as to obtain the spatial attention loss of the initial student model.
For example, the computer device may calculate the spatial attention loss of the initial student model using equation (2) as follows:
(2)
Wherein in formula (2) Is the square error loss (L2 loss)/>Representing the spatial attention loss of the initial student model,/>Characteristic values on all characteristic channels for pixel points at (i, j) positions in the first sampled image output for the initial student model,/>And outputting characteristic values of all characteristic channels for the pixel points at the (i, j) position in the second sampling image for the initial teacher model. |. | is the mean operation, and C is the number of channels.
S12, determining the channel attention loss of the initial student model according to the first image feature and the second image feature.
Specifically, the computer device may determine, according to the first image feature and the second image feature, a channel attention loss of the initial student model, where the channel attention loss is used to train the initial student model to pay attention to a more important feature channel in the second sampled image, where the more important feature channel may be a feature channel with a greater degree of influence of the pointer on a prediction result of the object attribute. Such as a feature channel may be a color feature channel, a texture feature channel, or the like.
In an embodiment, the first image feature includes feature values corresponding to M pixel points in the first sampled image under C feature channels, and the second image feature includes feature values corresponding to M pixel points in the second sampled image under C feature channels, where M, C are integers greater than 1.
The step S12 may include the following steps S121 to S123:
S121, carrying out averaging processing on the characteristic values of the M pixel points in the first sampling image under each characteristic channel to obtain the importance degree of the corresponding characteristic channel in the first sampling image.
Specifically, the computer device may perform an averaging process on the feature values of the M pixel points in the first sampled image under the kth feature channel according to an averaging algorithm, to obtain an average feature value of the kth feature channel in the first sampled image, and use the average feature value of the kth feature channel in the first sampled image as an importance degree of the kth feature channel in the first sampled image, that is, an importance degree of the corresponding kth feature channel in the first sampled image. Repeating the steps until the importance degrees of the C characteristic channels in the first sampling image are obtained.
S122, carrying out averaging processing on the characteristic values of the M pixel points in the second sampling image under each characteristic channel to obtain the importance degree of the corresponding characteristic channel in the second sampling image.
Specifically, the computer device may perform an averaging process on the feature values of the M pixel points in the second sampled image under the kth feature channel according to an averaging algorithm, to obtain an average feature value of the kth feature channel in the second sampled image, and use the average feature value of the kth feature channel in the second sampled image as an importance degree of the kth feature channel in the second sampled image, that is, an importance degree of the corresponding kth feature channel in the second sampled image. Repeating the steps until the importance degrees of the C characteristic channels in the second sampling image are obtained.
S123, determining the channel attention loss of the initial student model according to the importance degrees corresponding to the C characteristic channels in the first sampling image and the importance degrees corresponding to the C characteristic channels in the second sampling image.
Specifically, when the importance degrees corresponding to the C feature channels in the first sampled image and the importance degrees corresponding to the C feature channels in the second sampled image are obtained, the computer device may determine the channel attention loss of the initial student model according to the importance degrees corresponding to the C feature channels in the first sampled image and the importance degrees corresponding to the C feature channels in the second sampled image.
Among other things, the purpose of channel attention loss is to: and outputting important programs of the characteristic channels in the first sampling image by the initial teacher model, so as to guide the initial student model to pay more attention to the characteristic channels with high importance. Specifically, for the same original image (i.e., the sample object image), the initial teacher model outputs a third importance matrix of each feature channel for the weak downsampled image (the first sampled image), and the initial student model is directed to output a fourth importance matrix of each feature channel that is as consistent as possible for the strong downsampled image (i.e., the second sampled image) with the third importance matrix. The third importance matrix includes importance levels of respective feature channels in the first sampled image, and the fourth importance matrix includes importance levels of respective feature channels in the second sampled image.
Optionally, the step S123 may include the following steps S1231 to S1232:
S1231, performing a difference processing on the importance degree of the kth characteristic channel in the first sampled image and the importance degree of the kth characteristic channel in the second sampled image to obtain an importance degree deviation of the kth characteristic channel in the second sampled image. k is a positive integer less than or equal to C;
Specifically, the computer device may perform a difference processing on the importance level of the kth feature channel in the first sampled image and the importance level of the kth feature channel in the second sampled image, to obtain an importance level deviation of the kth feature channel in the second sampled image, where the importance level deviation is used to reflect a difference between the importance level of the kth feature channel in the second sampled image and the importance level of the kth feature channel in the first sampled image.
S1232, squaring and summing the importance degree deviations corresponding to the C characteristic channels in the second sampling image to obtain the channel attention loss of the initial student model.
Specifically, the computer device may perform squaring and sum processing on the importance degree deviations corresponding to the C feature channels in the second sampled image, so as to obtain the channel attention loss of the initial student model, so that the initial student model focuses on the feature channel with high importance degree in the second sampled image, and improves the recognition accuracy.
For example, the computer device may calculate the channel attention loss of the initial student model using equation (3) as follows:
(3)
Wherein in formula (3) For channel attention loss, H and W refer to the height and width of the first image feature (or the second image feature), respectively, and may also be used to reflect the number of pixels of the first image feature in the lateral and longitudinal directions,Refers to the characteristic values of M pixel points in the first sampling image, which are output by the initial student model, in the kth characteristic channel respectively,/>The characteristic values/>, output by the initial teacher model, of M pixel points in the second sampling image in the kth characteristic channel respectively
S13, determining the attribute prediction loss of the initial student model according to the second prediction object attribute and the third prediction object attribute.
Specifically, the computer device may determine, according to the second predicted object attribute and the third predicted object attribute, an attribute prediction loss of the initial student model, where the attribute prediction loss may also be referred to as logtis loss, for reducing a difference in expression of the predicted object attribute of the initial student model and the predicted object attribute of the initial teacher model for the same sample object.
For example, the computer device may calculate the predicted loss of attributes for the initial student model using equation (4) as follows:
(4)
wherein N in equation (4) may refer to the number of categories of candidate object attributes, Representing a probability that the initial student model recognizes that the sample object in the second sampled image has the i-th candidate object attribute,/>Representing a probability that the initial teacher model recognizes that the sample object in the first sample image has the i-th candidate object attribute. /(I)For a two-norm operation.
S14, determining the spatial attention loss, the channel attention loss and the attribute prediction loss as self-distillation losses of the initial student model.
Specifically, after the spatial attention loss, the channel attention loss and the attribute prediction loss are obtained, the computer device may determine the spatial attention loss, the channel attention loss and the attribute prediction loss as self-distillation losses of the initial student model, which is beneficial to improving training accuracy of the initial student model by measuring the self-distillation losses from multiple angles.
For example, as shown in fig. 5, the computer device may perform downsampling processing on the sample object image 51a according to a first downsampling multiple to obtain a first sampled image, input the first sampled image into an initial teacher model, and perform recognition processing on the first sampled image through the initial teacher model to obtain a first image feature (i.e., an intermediate feature) and a second predicted object attribute of the first sampled image. And performing downsampling processing on the sample object image 51a according to a second downsampling multiple to obtain a second sample image, inputting the second sample image into an initial student model, and performing recognition processing on the second sample image through the initial student model to obtain a second image feature (i.e. an intermediate feature) and a second predicted object attribute of the second sample image. Further, the computer device may calculate the importance level 52a of each pixel point in the first sampled image according to the first image feature, calculate the importance level 54a of each pixel point in the second sampled image according to the second image feature, and may substitute the importance level 52a and the importance level 54a into the above formula (2), to obtain the spatial attention loss of the initial student model. Next, the importance degree 53a of each feature channel in the first sampled image is calculated according to the first image feature, the importance degree 55a of each feature channel in the second sampled image is calculated according to the second image feature, and the importance degree 53a and the importance degree 55a are substituted into the above formula (3), so as to obtain the channel attention loss of the initial student model. Similarly, the second predicted object attribute and the third predicted object attribute may be substituted into the above formula (4), the attribute prediction loss of the initial student model may be calculated, and the spatial attention loss, the channel attention loss, and the attribute prediction loss may be determined as the self-distillation loss of the initial student model.
And S205, updating model parameters of the initial student model according to the identification loss and the self-distillation loss.
According to the application, the model parameters of the initial student model are updated according to the recognition loss and the self-distillation loss, so that the training of the initial student model is facilitated to have the recognition capability for images with various resolutions, and the recognition precision is improved.
The step S205 includes: performing weighted summation processing on the spatial attention loss, the channel attention loss and the attribute prediction loss contained in the self-distillation loss to obtain a self-distillation total loss of the initial student model; summing the self-distillation total loss and the identification loss to obtain the total loss of the initial student model; and updating model parameters of the initial student model according to the total loss.
Specifically, the computer device may obtain a weight control superparameter corresponding to the spatial attention loss, the channel attention loss, and the attribute prediction loss, and perform weighted summation processing on the spatial attention loss, the channel attention loss, and the attribute prediction loss included in the self-distillation loss according to the weight control superparameter, to obtain a self-distillation total loss of the initial student model. Further, summing the self-distillation total loss and the identification loss to obtain the total loss of the initial student model; and updating model parameters of the initial student model according to the total loss and a gradient descent algorithm so as to enable the self-distillation total loss and the recognition loss of the initial student model to reach minimum values and improve the training accuracy of the initial student model.
For example, the computer device may calculate the total loss of the initial student model using equation (5) as follows:
(5)
Wherein in formula (5) For the total loss of the initial student model,/>,/>,/>The weight control superparameter corresponding to the spatial attention loss, the channel attention loss and the attribute prediction loss respectively can be preset.
S206, updating the model parameters of the initial teacher model according to the model parameters of the updated initial student model in the historical time period, repeatedly executing the steps until training is finished for the updated initial teacher model and the updated initial student model, and determining the updated initial student model after training is finished as a target student model.
When the initial teacher model has multiple layers, the model parameters of the corresponding layers of the initial teacher model can be updated by adopting the processed model parameters corresponding to the layers of the updated initial student model; for example, the model parameters of the first layer of the initial teacher model may be updated using the processed model parameters corresponding to the first layer of the updated initial student model; and updating the model parameters of the second layer of the initial teacher model by adopting the processed model parameters corresponding to the second layer of the updated initial student model, and the like.
It should be noted that, the target student model may refer to an updated initial student model when the total loss is the minimum value; or the target student model may refer to an updated initial student model when the number of steps of the iterative update is greater than a step threshold, which may be preset.
Optionally, the step S206 includes: smoothing the model parameters of the updated initial student model in the historical time period to obtain the processed model parameters; and updating the model parameters of the initial teacher model according to the processed model parameters.
Specifically, the computer device may perform smoothing on the model parameters of the updated initial student model in the historical time period according to a smoothing algorithm to obtain processed model parameters, and sample the processed model parameters to replace the model parameters of the initial teacher model to obtain an updated initial teacher model. The method is beneficial to reducing the change speed of the model parameters of the updated initial student model, so that the initial teacher model is not easily affected by noise in training data, and the risk of overfitting of the initial teacher model is reduced.
It should be noted that, the smoothing algorithm herein may include an exponential sliding average (EMA, exponential Moving Average) algorithm, laplace smoothing, goodwill-Turing (Good-Turing) smoothing, and the like.
Optionally, after performing t-step iterative updating on the initial student model, acquiring model parameters of the updated initial student model after the t-step iterative updating and an index sliding average value of the updated initial student model after the t-1-step iterative updating; t is an integer greater than 1. And according to the smoothing factor, carrying out smoothing treatment on the model parameters after the iteration update in the t step and the index sliding average value after the iteration update in the t-1 step to obtain the index sliding average value of the initial student model after the update in the t step. And (3) determining the index sliding average value which is iteratively updated in the step t as the processed model parameter.
Specifically, after t-step iteration (iteration) update is completed for the initial student model for the first time, the computer device may obtain model parameters of the updated initial student model after the t-step iteration update, and an exponential sliding average value of the updated initial student model after the t-1-step iteration update; the model parameters after the iteration update in the t step are the current model parameters of the updated initial student model, and the index sliding average value after the iteration update in the t-1 step can be determined according to the model parameters of the initial student model after the iteration update in the previous t-1 step. And further, carrying out smoothing treatment on the model parameters after the iteration update in the t step and the index sliding average value after the iteration update in the t-1 step according to a smoothing factor and a smoothing algorithm to obtain the index sliding average value after the iteration update in the t step of the updated initial student model. And (3) determining the index sliding average value after the iteration update in the step (t) as a processed model parameter, and adopting the processed model parameter to replace the model parameter of the initial teacher model to obtain an updated initial teacher model.
For example, the computer device may use an exponential sliding average (EMA, exponential Moving Average) algorithm to smooth the model parameters iteratively updated at the above-described t-th step and the exponential sliding average iteratively updated at the above-described t-1 st step. The exponential moving average algorithm may be a commonly used time series data smoothing technique for calculating a weighted average of data (model parameters) over a period of time. In deep learning, EMA is typically used in optimization algorithms to update model parameters. EMA may be used to smooth the model parameters to reduce noise effects during training. Specifically, when the model parameters of the initial teacher model need to be updated, EMA may be used to calculate new model parameters (i.e., processed model parameters), see equation (6) below:
(6)
Wherein in formula (6) For the processed model parameters, i.e. the index sliding average value updated in the t-th iteration,/>For the updated exponential sliding average (i.e., past model parameters) of step t-1, can be iteratively derived according to equation (6) above,/>Representing model parameters (i.e., current model parameters) updated in step t iteration,/>Is a smoothing factor (0)< 1) For controlling the relative importance of the current model parameters and the past model parameters. When/>When the model parameter is close to 1, the influence of the current model parameter is larger, and the influence of the past model parameter is smaller; when/>Near 0, the effect of the current model parameters is smaller and the effect of the past model parameters is larger. /(I)May increase as the number of steps for iterative updating of the initial student model increases.
For example, when t is 3, that is, after each pair of initial student models is iteratively updated in 3 steps, the model parameters of the initial teacher model are updated once, and the model parameters of the initial teacher model are updated in the following manner: (1) Model parameters of the initial student model; (2) Obtaining model parameters of the initial student model after the initial student model is subjected to the 1 st step iteration update, and marking the model parameters as/>Will/>And/>Substituting into the formula (6) to obtain an index sliding average value after the step 1 updating iteration, and recording the index sliding average value as/>. (3) Obtaining model parameters of the initial student model after the initial student model is subjected to the iterative update of the step 2, and recording the model parameters as/>Will/>And/>Substituting into the formula (6) to obtain an index sliding average value after the updating iteration of the step 2, and recording the index sliding average value as/>. (4) Obtaining model parameters of the updated initial student model after the initial student model is subjected to the step 3 iterative update, and marking the model parameters as/>Will/>And/>Substituting into the formula (6) to obtain an index sliding average value after the step 3 updating iteration, and recording the index sliding average value as/>. Will/>And replacing the model parameters of the initial teacher model with the updated model parameters as updated model parameters.
It should be noted that, when an exponential sliding average (EMA, exponential Moving Average) algorithm is used to smooth the model parameters updated in the above-mentioned t-th step iteration and the exponential sliding average updated in the above-mentioned t-1 st step iteration, the updated initial teacher model is generally referred to as an EMA model of the initial student model, and the EMA model has the following advantages:
(1) Compared with the additional training of an offline teacher model, a great deal of time and calculation resources are saved. Specifically, the knowledge distillation-based method generally uses a model with a more complex structure and a larger specification as a teacher model, and even if the model which is completely consistent with the structure of a student model is used as the teacher model, training the teacher model additionally consumes more training time and resources than nearly one time. Compared with a method of using a model which is completely consistent with the initial student model structure as an initial teacher, the method saves about 50% of resources and time consumption without training any model in advance; the resource and time savings are much greater than 50% compared to methods using larger-scale teachers.
(2) Compared to directly using the student model itself as a teacher model: the EMA model is more stable, and the updated initial teacher model is more stable in the training process by carrying out smoothing treatment on model parameters, so that noise and fluctuation in the training process can be reduced, and the knowledge quality learned by the initial student model can be improved.
(3) The EMA model has better generalization capability, and because the weight of the updated initial student model in a past period is considered in the calculation process of the EMA model, the EMA model can better capture the knowledge learned by the initial student model in the training process, so that the updated initial teacher model has better generalization capability, and the generalization capability of the student model (namely the target student model) is improved; the EMA model can prevent over fitting, and the EMA model is used as an updated initial teacher model to effectively prevent the over fitting phenomenon. Because the EMA model reduces the change speed of model parameters of the updated initial student model through smoothing, the updated initial teacher model is not easily affected by noise in training data, and therefore the risk of over-fitting is reduced.
The application can be effectively applied to scenes in which high-quality identification is required for various resolution images, can improve the identification precision of a target student model for various resolution images, can ensure that the storage and time expenditure of deployment are unchanged, has strong universality, does not need to change the current products and business processes during deployment, and has good application potential.
In one embodiment, the final floor deployment in the application is the target student model, the interaction mode of the user side is not changed, and the reasoning speed and the storage cost are not different from those of the current service, namely the application does not influence the experience of the user side. Specifically, after training to obtain a target student model, the computer device may obtain a target object image to be identified, and perform object identification on the target object image through the target student model to obtain an object attribute of an object in the target object image.
In addition, the application can improve the recognition precision of the target student model for the multi-resolution images in a single training process, and does not additionally increase the consumption of resources and time when the target student model is deployed on the ground. And the multi-resolution recognition capability can reduce the requirements on the user and the usage scenario, and the user can perform high-precision recognition at a plurality of distances near or far from the image acquisition device (the plurality of distances no longer require the user to be close enough to the device).
It should be noted that, in order to illustrate that the object student model of the present application can obtain good recognition accuracy in the process of recognizing the multi-resolution image, it can be compared with the accuracy of the model obtained by other training methods, including ArcFace trained under multi-scale (multi-resolution) amplification, qualNet-LM dedicated to improving the recognition accuracy of the low-resolution image, and multi-scale image recognition model training algorithm F-SKD based on off-line distillation, and the broad test can be performed on the datasets of AGEDB, CFP-FP, LFW, TINYFACE, etc., and the test results are shown in table 1 below:
TABLE 1
Wherein ACC in ACC_x in Table 1 may refer to recognition accuracy, and x may refer to resolution of an image to be recognized, whereinThe images to be identified with the resolution are studied as shown in table 1, the resolution can be large resolution (namely high resolution), the large resolution is downsampled by 2 times, 4 times and 8 times, 4 groups of resolutions are used as test sets for precision test, and the precision average value (ACC_MEAN) is used as an important index for measuring the identification precision of various methods under various resolutions. As can be seen from Table 1, the mean value of the accuracy of the target student model in the application under the datasets AGEDB and CFP-FP, LFW, TINYFACE is larger than the mean value of the accuracy corresponding to other algorithms.
The application comprises at least the following advantages: (1) The model parameters of the initial teacher model are updated through the updated model parameters of the initial student model in the historical time period, so that the initial teacher model is more stable in the training process, noise and fluctuation in the training process are reduced, and the knowledge quantity learned by the student model is improved. (2) Because the model parameters of the updated initial student model in the historical time period are considered in the updating process of the initial teacher model, the initial teacher model can better capture the learned knowledge of the updated initial student model in the training process, so that the initial teacher model has better generalization capability, and the generalization capability of the updated initial student model, namely the training accuracy of the model is improved. (3) The initial teacher model is trained in the process of training the initial student model, and an offline teacher model does not need to be additionally trained, so that a great amount of time and calculation resources are saved, namely, the model training efficiency is improved, and the resources consumed in the model training process are reduced. (4) The initial student model is trained according to the first image feature, the second image feature and the like, the first image feature and the second image feature are obtained by sampling images with different resolutions, namely, information differences among the images with different resolutions are considered, the accuracy of model training is improved, and therefore the training is facilitated, the obtained target student model has the recognition capability of the images with different resolutions, and the applicability of the target student model is improved.
Fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. As shown in fig. 6, the image processing apparatus may include:
An obtaining module 611, configured to obtain a first sample image and a second sample image corresponding to the sample object image, and an annotation object attribute for reflecting a sample object in the sample object image; the resolution of the first sampled image is greater than the resolution of the second sampled image;
The identifying module 612 is configured to identify the sample object image by using an initial student model to obtain a first predicted object attribute, identify the first sample image by using an initial teacher model to obtain a second predicted object attribute and a first image feature, and identify the second sample image by using the initial student model to obtain a third predicted object attribute and a second image feature;
A first updating module 613, configured to update model parameters of the initial student model according to the first image feature, the second image feature, the first predicted object attribute, the second predicted object attribute, the third predicted object attribute, and the labeling object attribute;
and a second updating module 614, configured to update the model parameters of the initial teacher model according to the model parameters of the updated initial student model in the historical time period, repeatedly execute the steps above for the updated initial teacher model and the updated initial student model until the training is completed, and determine the updated initial student model after the training is completed as the target student model.
Optionally, the first updating module 613 is specifically configured to determine an identification loss of the initial student model according to the first predicted object attribute and the labeling object attribute;
determining a self-distillation loss of the initial student model based on the first image feature, the second predicted object attribute, and the third predicted object attribute;
And updating model parameters of the initial student model according to the identification loss and the self-distillation loss.
Optionally, the first updating module 613 is specifically configured to determine a spatial attention loss of the initial student model according to the first image feature and the second image feature;
determining a channel attention loss of the initial student model based on the first image feature and the second image feature;
Determining an attribute prediction loss of the initial student model according to the second predicted object attribute and the third predicted object attribute;
determining the spatial attention loss, the channel attention loss and the attribute prediction loss as self-distillation losses of the initial student model.
Optionally, the first updating module 613 is specifically configured to perform weighted summation processing on the spatial attention loss, the channel attention loss, and the attribute prediction loss included in the self-distillation loss, so as to obtain a self-distillation total loss of the initial student model;
Summing the self-distillation total loss and the identification loss to obtain the total loss of the initial student model;
and updating model parameters of the initial student model according to the total loss.
Optionally, the first image feature includes feature values corresponding to M pixel points in the first sampled image under C feature channels, and the second image feature includes feature values corresponding to M pixel points in the second sampled image under C feature channels, where M, C are integers greater than 1;
Optionally, the first updating module 613 is specifically configured to perform an averaging process on the feature values corresponding to each pixel point in the first sampled image under the C feature channels, so as to obtain an importance degree of the corresponding pixel point in the first sampled image;
Averaging the corresponding characteristic values of each pixel point in the second sampling image under the C characteristic channels to obtain the importance degree of the corresponding pixel point in the first sampling image;
And determining the spatial attention loss of the initial student model according to the importance degrees respectively corresponding to the M pixel points in the first sampling image and the importance degrees respectively corresponding to the M pixel points in the second sampling image.
Optionally, the first updating module 613 is specifically configured to perform a difference processing on the importance level of the f-th pixel in the first sampled image and the importance level of the f-th pixel in the second sampled image, so as to obtain an importance level deviation corresponding to the f-th pixel in the second sampled image; f is a positive integer less than or equal to M;
and squaring and summing the importance degree deviations of the M pixel points in the second sampling image to obtain the spatial attention loss of the initial student model.
Optionally, the first image feature includes feature values corresponding to M pixel points in the first sampled image under C feature channels, and the second image feature includes feature values corresponding to M pixel points in the second sampled image under C feature channels, where M, C are integers greater than 1;
optionally, the first updating module 613 is specifically configured to perform an averaging process on the feature values of the M pixel points in the first sampled image under each feature channel, so as to obtain importance degrees of corresponding feature channels in the first sampled image;
averaging the characteristic values of the M pixel points in the second sampling image under each characteristic channel to obtain the importance degree of the corresponding characteristic channel in the second sampling image;
And determining the channel attention loss of the initial student model according to the importance degrees respectively corresponding to the C characteristic channels in the first sampling image and the importance degrees respectively corresponding to the C characteristic channels in the second sampling image.
Optionally, the first updating module 613 is specifically configured to perform a difference processing on the importance level of the kth feature channel in the first sampled image and the importance level of the kth feature channel in the second sampled image, so as to obtain an importance level deviation of the kth feature channel in the second sampled image; k is a positive integer less than or equal to C;
and squaring and summing the importance degree deviations corresponding to the C characteristic channels in the second sampling image to obtain the channel attention loss of the initial student model.
Optionally, the second updating module 614 is specifically configured to perform smoothing processing on the model parameters of the updated initial student model in the historical time period to obtain processed model parameters;
and updating the model parameters of the initial teacher model according to the processed model parameters.
Optionally, the second updating module 614 is specifically configured to obtain, after performing t-step iterative updating on the initial student model, model parameters of the updated initial student model after the t-step iterative updating, and an exponential sliding average value of the updated initial student model after the t-1-step iterative updating; t is an integer greater than 1;
Carrying out smoothing treatment on the model parameters after the iterative updating in the t step and the index sliding average value after the iterative updating in the t-1 step according to the smoothing factors to obtain an index sliding average value of the initial student model after the updating in the t step;
and (3) determining the index sliding average value which is iteratively updated in the step t as the processed model parameter.
Optionally, the obtaining module 611 is specifically configured to perform downsampling processing on the sample object image according to a first downsampling multiple to obtain a first sampled image;
Performing downsampling processing on the sample object image according to a second downsampling multiple to obtain a second sampled image; the first downsampling multiple is smaller than the second downsampling multiple.
Optionally, the acquiring module 611 is specifically configured to acquire an image of the target object to be identified;
And carrying out object recognition on the target object image through a target student model to obtain the object attribute of the object in the target object image.
The application comprises at least the following advantages: (1) The model parameters of the initial teacher model are updated through the updated model parameters of the initial student model in the historical time period, so that the initial teacher model is more stable in the training process, noise and fluctuation in the training process are reduced, and the knowledge quantity learned by the student model is improved. (2) Because the model parameters of the updated initial student model in the historical time period are considered in the updating process of the initial teacher model, the initial teacher model can better capture the learned knowledge of the updated initial student model in the training process, so that the initial teacher model has better generalization capability, and the generalization capability of the updated initial student model, namely the training accuracy of the model is improved. (3) The initial teacher model is trained in the process of training the initial student model, and an offline teacher model does not need to be additionally trained, so that a great amount of time and calculation resources are saved, namely, the model training efficiency is improved, and the resources consumed in the model training process are reduced. (4) The initial student model is trained according to the first image feature, the second image feature and the like, the first image feature and the second image feature are obtained by sampling images with different resolutions, namely, information differences among the images with different resolutions are considered, the accuracy of model training is improved, and therefore the training is facilitated, the obtained target student model has the recognition capability of the images with different resolutions, and the applicability of the target student model is improved.
Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 7, the above-mentioned computer device 1000 may refer to a terminal or a server, including: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. In some embodiments, the user interface 1003 may include a display (DiSPlay), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a nonvolatile memory (non-volatile MeMory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device remote from the processor 1001. As shown in fig. 7, an operating system, a network communication module, a user interface module, and a computer program may be included in a memory 1005, which is a type of computer-readable storage medium.
In the computer device 1000 shown in FIG. 7, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface to provide input; and the processor 1001 may be used to invoke computer programs stored in the memory 1005 to implement the steps of the method embodiments of the present application.
The application comprises at least the following advantages: (1) The model parameters of the initial teacher model are updated through the updated model parameters of the initial student model in the historical time period, so that the initial teacher model is more stable in the training process, noise and fluctuation in the training process are reduced, and the knowledge quantity learned by the student model is improved. (2) Because the model parameters of the updated initial student model in the historical time period are considered in the updating process of the initial teacher model, the initial teacher model can better capture the learned knowledge of the updated initial student model in the training process, so that the initial teacher model has better generalization capability, and the generalization capability of the updated initial student model, namely the training accuracy of the model is improved. (3) The initial teacher model is trained in the process of training the initial student model, and an offline teacher model does not need to be additionally trained, so that a great amount of time and calculation resources are saved, namely, the model training efficiency is improved, and the resources consumed in the model training process are reduced. (4) The initial student model is trained according to the first image feature, the second image feature and the like, the first image feature and the second image feature are obtained by sampling images with different resolutions, namely, information differences among the images with different resolutions are considered, the accuracy of model training is improved, and therefore the training is facilitated, the obtained target student model has the recognition capability of the images with different resolutions, and the applicability of the target student model is improved.
It should be understood that the computer device described in the embodiments of the present application may perform the description of the image processing method in the foregoing corresponding embodiments, or may perform the description of the image processing apparatus in the foregoing corresponding embodiments, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.
The relevant data collection and processing in the application can obtain the informed consent or independent consent of the personal information body according to the requirements of relevant laws and regulations when the example is applied, and develop the subsequent data use and processing behaviors within the authorized range of laws and regulations and the personal information body.
Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, in which a computer program executed by the aforementioned image processing apparatus is stored, and the computer program includes program instructions, when executed by the aforementioned processor, can execute the description of the aforementioned image processing method in the corresponding embodiment, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.
As an example, the above-described program instructions may be executed on one computer device or at least two computer devices disposed at one site, or at least two computer devices distributed at least two sites and interconnected by a communication network, which may constitute a blockchain network.
The computer readable storage medium may be the image processing apparatus provided in any one of the foregoing embodiments or a middle storage unit of the foregoing computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the computer device. Further, the computer-readable storage medium may also include both a central storage unit and an external storage device of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
The terms first, second and the like in the description and in the claims and drawings of embodiments of the application, are used for distinguishing between different media and not necessarily for describing a particular sequential or chronological order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
The relevant data collection and processing in the application can obtain the informed consent or independent consent of the personal information body according to the requirements of relevant laws and regulations when the example is applied, and develop the subsequent data use and processing behaviors within the authorized range of laws and regulations and the personal information body.
The embodiment of the present application further provides a computer program product, which includes a computer program, where the computer program when executed by a processor implements the descriptions of the image processing method and the decoding method in the foregoing corresponding embodiments, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer program product according to the present application, reference is made to the description of the method embodiments according to the present application.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable network connection device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable network connection device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable network connection device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable network connection device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims (15)

1. An image processing method, comprising:
Acquiring a first sampling image and a second sampling image corresponding to a sample object image, and marking object attributes for reflecting sample objects in the sample object image; the resolution of the first sampled image is greater than the resolution of the second sampled image; the resolution of the sample object image is greater than the resolution of the first sampled image;
Performing object recognition on the sample object image through an initial student model to obtain a first predicted object attribute, performing recognition processing on the first sampled image through an initial teacher model to obtain a second predicted object attribute and a first image characteristic, and performing recognition processing on the second sampled image through the initial student model to obtain a third predicted object attribute and a second image characteristic;
determining the recognition loss of the initial student model according to the first predicted object attribute and the labeling object attribute;
Determining a self-distillation loss of the initial student model according to the first image feature, the second predicted object attribute and the third predicted object attribute;
updating model parameters of the initial student model according to the identification loss and the self-distillation loss;
After the initial student model is subjected to t-step iterative updating, updating the model parameters of the initial teacher model according to the updated model parameters of the initial student model in a historical time period; the updated model parameters of the initial student model in the history period comprise the model parameters of the initial student model after each step of iterative updating in the t steps of iterative updating; t is an integer greater than 1;
And repeatedly executing the steps until training is finished for the updated initial teacher model and the updated initial student model, and determining the updated initial student model after training is finished as a target student model.
2. The method of claim 1, wherein the determining the self-distillation loss of the initial student model based on the first image feature, the second predicted object property, and the third predicted object property comprises:
Determining a spatial attention loss of the initial student model from the first image feature and the second image feature;
Determining a channel attention loss of the initial student model from the first image feature and the second image feature;
Determining an attribute prediction loss of the initial student model according to the second predicted object attribute and the third predicted object attribute;
Determining the spatial attention loss, the channel attention loss, and the attribute prediction loss as self-distillation losses of the initial student model.
3. The method of claim 2, wherein updating model parameters of the initial student model based on the identification loss and the self-distillation loss comprises:
carrying out weighted summation processing on the spatial attention loss, the channel attention loss and the attribute prediction loss contained in the self-distillation loss to obtain a self-distillation total loss of the initial student model;
summing the self-distillation total loss and the identification loss to obtain the total loss of the initial student model;
And updating model parameters of the initial student model according to the total loss.
4. The method of claim 2, wherein the first image feature comprises feature values corresponding to M pixels in the first sampled image under C feature channels, respectively, and the second image feature comprises feature values corresponding to M pixels in the second sampled image under C feature channels, respectively, M, C being integers greater than 1;
The determining the spatial attention loss of the initial student model according to the first image feature and the second image feature comprises:
Averaging the characteristic values corresponding to each pixel point in the first sampling image under the C characteristic channels to obtain the importance degree of the corresponding pixel point in the first sampling image;
averaging the characteristic values corresponding to each pixel point in the second sampling image under the C characteristic channels to obtain the importance degree of the corresponding pixel point in the first sampling image;
and determining the spatial attention loss of the initial student model according to the importance degrees respectively corresponding to the M pixel points in the first sampling image and the importance degrees respectively corresponding to the M pixel points in the second sampling image.
5. The method of claim 4, wherein determining the spatial attention loss of the initial student model based on the respective degrees of importance for the M pixels in the first sampled image and the respective degrees of importance for the M pixels in the second sampled image comprises:
Performing difference processing on the importance degree of the f pixel point in the first sampling image and the importance degree of the f pixel point in the second sampling image to obtain importance degree deviation corresponding to the f pixel point in the second sampling image; f is a positive integer less than or equal to M;
And squaring and summing the importance degree deviations of the M pixel points in the second sampling image to obtain the spatial attention loss of the initial student model.
6. The method of claim 2, wherein the first image feature comprises feature values corresponding to M pixels in the first sampled image under C feature channels, respectively, and the second image feature comprises feature values corresponding to M pixels in the second sampled image under C feature channels, respectively, M, C being integers greater than 1;
the determining the channel attention loss of the initial student model according to the first image feature and the second image feature comprises the following steps:
Averaging the characteristic values of the M pixel points in the first sampling image under each characteristic channel to obtain the importance degree of the corresponding characteristic channel in the first sampling image;
Averaging the characteristic values of the M pixel points in the second sampling image under each characteristic channel to obtain the importance degree of the corresponding characteristic channel in the second sampling image;
And determining the channel attention loss of the initial student model according to the importance degrees respectively corresponding to the C characteristic channels in the first sampling image and the importance degrees respectively corresponding to the C characteristic channels in the second sampling image.
7. The method of claim 6, wherein determining the channel attention loss of the initial student model based on the respective degrees of importance for the C feature channels in the first sampled image and the respective degrees of importance for the C feature channels in the second sampled image comprises:
Performing difference processing on the importance degree of the kth characteristic channel in the first sampling image and the importance degree of the kth characteristic channel in the second sampling image to obtain the importance degree deviation of the kth characteristic channel of the second sampling image; k is a positive integer less than or equal to C;
and squaring and summing the importance degree deviations corresponding to the C characteristic channels in the second sampling image to obtain the channel attention loss of the initial student model.
8. The method of claim 1, wherein updating the model parameters of the initial teacher model based on the updated model parameters of the initial student model over the historical period of time comprises:
smoothing the model parameters of the updated initial student model in the historical time period to obtain the processed model parameters;
And updating the model parameters of the initial teacher model according to the processed model parameters.
9. The method of claim 8, wherein smoothing model parameters of the updated initial student model over a historical period of time to obtain processed model parameters, comprises:
Obtaining model parameters of the updated initial student model after iteration update in the t-th step and an index sliding average value of the updated initial student model after iteration update in the t-1 th step;
Carrying out smoothing treatment on the model parameters after the iteration update in the t step and the index sliding average value after the iteration update in the t-1 step according to a smoothing factor to obtain an index sliding average value of the updated initial student model after the iteration update in the t step;
and determining the index sliding average value updated in the t step iteration as the processed model parameter.
10. The method of claim 1, wherein acquiring the first and second sampled images corresponding to the sample object image comprises:
performing downsampling processing on the sample object image according to the first downsampling multiple to obtain a first sampled image;
Performing downsampling processing on the sample object image according to a second downsampling multiple to obtain a second sampled image; the first downsampling multiple is less than the second downsampling multiple.
11. The method of claim 1, wherein the method further comprises:
Acquiring a target object image to be identified;
And carrying out object recognition on the target object image through a target student model to obtain object attributes of objects in the target object image.
12. An image processing apparatus, comprising:
The acquisition module is used for acquiring a first sampling image and a second sampling image corresponding to the sample object image and reflecting the labeling object attribute of the sample object in the sample object image; the resolution of the first sampled image is greater than the resolution of the second sampled image; the resolution of the sample object image is greater than the resolution of the first sampled image;
The identification module is used for carrying out object identification on the sample object image through an initial student model to obtain a first predicted object attribute, carrying out identification processing on the first sampling image through an initial teacher model to obtain a second predicted object attribute and a first image characteristic, and carrying out identification processing on the second sampling image through the initial student model to obtain a third predicted object attribute and a second image characteristic;
The first updating module is used for determining the recognition loss of the initial student model according to the first predicted object attribute and the labeling object attribute; determining a self-distillation loss of the initial student model according to the first image feature, the second predicted object attribute and the third predicted object attribute; updating model parameters of the initial student model according to the identification loss and the self-distillation loss;
The second updating module is used for updating the model parameters of the initial teacher model according to the model parameters of the updated initial student model in a history time period after t steps of iterative updating are carried out on the initial student model, repeatedly executing the steps until training is finished on the updated initial teacher model and the updated initial student model, determining the updated initial student model after training is finished as a target student model, wherein the model parameters of the updated initial student model in the history time period comprise the model parameters of the initial student model after each step of iterative updating in the t steps of iterative updating; t is an integer greater than 1.
13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.
14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.
15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.
CN202410326496.5A 2024-03-21 2024-03-21 Image processing method, device, equipment and storage medium Active CN117935029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410326496.5A CN117935029B (en) 2024-03-21 2024-03-21 Image processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410326496.5A CN117935029B (en) 2024-03-21 2024-03-21 Image processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117935029A CN117935029A (en) 2024-04-26
CN117935029B true CN117935029B (en) 2024-06-25

Family

ID=90759538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410326496.5A Active CN117935029B (en) 2024-03-21 2024-03-21 Image processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117935029B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731422A (en) * 2022-11-29 2023-03-03 阳光保险集团股份有限公司 Training method, classification method and device of multi-label classification model
CN117726884A (en) * 2024-02-09 2024-03-19 腾讯科技(深圳)有限公司 Training method of object class identification model, object class identification method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160533B (en) * 2019-12-31 2023-04-18 中山大学 Neural network acceleration method based on cross-resolution knowledge distillation
US20230153943A1 (en) * 2021-11-16 2023-05-18 Adobe Inc. Multi-scale distillation for low-resolution detection
CN116597260A (en) * 2023-03-24 2023-08-15 北京迈格威科技有限公司 Image processing method, electronic device, storage medium, and computer program product
CN116152573A (en) * 2023-03-29 2023-05-23 京东方科技集团股份有限公司 Image recognition method, device, electronic equipment and computer readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731422A (en) * 2022-11-29 2023-03-03 阳光保险集团股份有限公司 Training method, classification method and device of multi-label classification model
CN117726884A (en) * 2024-02-09 2024-03-19 腾讯科技(深圳)有限公司 Training method of object class identification model, object class identification method and device

Also Published As

Publication number Publication date
CN117935029A (en) 2024-04-26

Similar Documents

Publication Publication Date Title
EP3757905A1 (en) Deep neural network training method and apparatus
CN109993102B (en) Similar face retrieval method, device and storage medium
CN113039555B (en) Method, system and storage medium for classifying actions in video clips
CN112395979B (en) Image-based health state identification method, device, equipment and storage medium
JP7425147B2 (en) Image processing method, text recognition method and device
CN112418292B (en) Image quality evaluation method, device, computer equipment and storage medium
CN110827236B (en) Brain tissue layering method, device and computer equipment based on neural network
CN102640168A (en) Method and apparatus for local binary pattern based facial feature localization
CN111738403B (en) Neural network optimization method and related equipment
CN112288831A (en) Scene image generation method and device based on generation countermeasure network
CN114266897A (en) Method and device for predicting pox types, electronic equipment and storage medium
JP2019152964A (en) Learning method and learning device
CN115050064A (en) Face living body detection method, device, equipment and medium
CN112668608A (en) Image identification method and device, electronic equipment and storage medium
CN111557010A (en) Learning device and method, and program
CN113435531B (en) Zero sample image classification method and system, electronic equipment and storage medium
CN115620122A (en) Training method of neural network model, image re-recognition method and related equipment
CN113762261A (en) Method, device, equipment and medium for recognizing characters of image
CN117935029B (en) Image processing method, device, equipment and storage medium
CN111783688A (en) Remote sensing image scene classification method based on convolutional neural network
CN111476144A (en) Pedestrian attribute identification model determination method and device and computer readable storage medium
CN117115824A (en) Visual text detection method based on stroke region segmentation strategy
CN115841596A (en) Multi-label image classification method and training method and device of multi-label image classification model
CN117011566A (en) Target detection method, detection model training method, device and electronic equipment
CN113822293A (en) Model processing method, device and equipment for graph data and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant