CN115601629A

CN115601629A - Model training method, image recognition method, medium, device and computing equipment

Info

Publication number: CN115601629A
Application number: CN202211347748.XA
Authority: CN
Inventors: 李雨珂; 李唐薇; 胡宜峰; 刘稳军; 杨卫强; 朱浩齐
Original assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Current assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-01-13

Abstract

The embodiment of the disclosure provides a model training method, an image recognition method, a medium, a device and a computing device, which relate to the technical field of artificial intelligence, wherein the model training method comprises the following steps: obtaining a plurality of target characteristic graphs corresponding to the sample image; inputting a plurality of target feature maps into a first sub-model for convolution processing to obtain a classification feature map, a regression feature map and an object feature map corresponding to each target feature map; inputting the classification feature map into a second sub-model for feature decorrelation processing to obtain a target sample weight; determining a target loss value based on the target sample weight, the classification feature map, the regression feature map, the object feature map and the labeling information of the sample image; and adjusting parameters of the image recognition model according to the target loss value to obtain the trained image recognition model. The generalization capability of the image recognition model can be greatly improved.

Description

Model training method, image recognition method, medium, device and computing equipment

Technical Field

The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to a model training method, an image recognition method, a medium, a device and a computing device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In the face of mass images on the internet, images containing information such as illicit behaviors or illegal behaviors, namely forbidden images, need to be selected and identified from the mass images so as to purify the network environment, ensure that users can enjoy the convenience brought by the network, and meanwhile, the received information is safe.

Currently, a trained image recognition model is usually used to recognize whether an input image is a forbidden image. The training of the image recognition model is based on the fact that the training image and the test image are independently and identically distributed. However, in a real scene, the distribution of the test image is usually different from that of the training image, so that the iterative processing is required to be continuously performed through a process of "mining and screening data, labeling data, training model and testing model" in the training process, so that the distribution of the training image continuously approaches the distribution of the test image, and the trained image recognition model is obtained. But the generalization capability of the obtained image recognition model is poor.

Disclosure of Invention

The present disclosure provides a model training method, an image recognition method, a medium, an apparatus, and a computing device to solve the problem of poor generalization capability of image recognition models obtained by the current technology.

In a first aspect of embodiments of the present disclosure, there is provided a model training method for training an image recognition model, the image recognition model including a first sub-model and a second sub-model, the model training method including:

obtaining a plurality of target characteristic graphs corresponding to the sample image, wherein the sizes of the target characteristic graphs are different;

inputting the target feature maps into a first sub-model for convolution processing to obtain a classification feature map, a regression feature map and an object feature map which are output by the first sub-model and correspond to each target feature map;

inputting the classification characteristic diagram into a second submodel to perform characteristic decorrelation processing, so as to obtain a target sample weight output by the second submodel, wherein the target sample weight is used for representing the weight of each classification characteristic in the classification characteristic diagram;

determining a target loss value based on the target sample weight, the classification feature map, the regression feature map, the object feature map and the labeling information of the sample image;

and adjusting parameters of the image recognition model according to the target loss value to obtain the trained image recognition model.

In a possible implementation manner, inputting the classification feature map into the second sub-model for feature decorrelation processing, so as to obtain a target sample weight output by the second sub-model, including: inputting the plurality of classification characteristic graphs into a second submodel to obtain a target classification characteristic graph, wherein the target classification characteristic graph is obtained by splicing the plurality of classification characteristic graphs by the second submodel; extracting target classification features based on the target classification feature map and a target classification feature map obtained by training an image recognition model last time; carrying out random Fourier transform processing on the target classification features to obtain random Fourier features; and performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier features to obtain the weight of the target sample.

In a possible implementation manner, the iterative training of the feature decorrelation processing is performed on the second submodel based on the random fourier features, and the obtaining of the target sample weight includes: and performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier feature, the first sample weight obtained by training the second submodel last time and the second sample weight obtained by training the image recognition model last time to obtain the target sample weight, wherein in the first training of the second submodel, the preset initial sample weight is used as the first sample weight.

In one possible implementation, determining the target loss value based on the target sample weight, the classification feature map, the regression feature map, the object feature map, and the labeling information of the sample image includes: carrying out class detection on the plurality of classification characteristic graphs through a first submodel to obtain a first detection result; determining an initial classification loss value according to the first detection result and the labeling information; determining a first loss value according to the initial classification loss value and the target sample weight; carrying out position detection on the multiple regression feature maps through the first submodel to obtain a second detection result; determining a second loss value according to the second detection result and the labeling information; carrying out object detection on the plurality of object characteristic graphs through the first sub-model to obtain a third detection result; determining a third loss value according to the third detection result and the labeling information; and determining a target loss value according to the first loss value, the second loss value and the third loss value.

In one possible embodiment, determining the first loss value according to the initial classification loss value and the target sample weight includes: and performing weighted summation processing on the initial classification loss value and the target sample weight to determine a first loss value.

In one possible implementation, the performing a weighted summation process on the initial classification loss value and the target sample weight to determine a first loss value includes: determining a first loss value according to the following equation:

wherein, the first and the second end of the pipe are connected with each other,

representing an initial classification loss value;

representing a target sample weight;

representing a sample image;

annotation information representing a sample image;

to represent

Corresponding random Fourier features;

a first detection result representing an output of the first submodel; b denotes the number of sample images.

In a possible implementation manner, the first sub-model includes a first convolution layer, and a second convolution layer and a third convolution layer connected to the first convolution layer, respectively, the second sub-model is connected to the second convolution layer, the plurality of target feature maps are input to the first sub-model for convolution processing, and a classification feature map, a regression feature map, and an object feature map corresponding to each target feature map output by the first sub-model are obtained, including: inputting the target feature map into a first convolution layer for convolution processing to obtain a convolution processing result output by the first convolution layer, wherein the first convolution layer comprises a convolution layer; and inputting convolution processing results into a second convolution layer and a third convolution layer respectively for convolution processing to obtain a classification feature map output by the second convolution layer, a regression feature map output by the third convolution layer and an object feature map, wherein the second convolution layer comprises a plurality of cascaded convolution layers, and the third convolution layer comprises a plurality of cascaded convolution layers.

In one possible implementation, adjusting parameters of the image recognition model according to the target loss value to obtain a trained image recognition model, includes: and adjusting parameters of the image recognition model according to the target loss value, and performing iterative training on the image recognition model until the preset iteration times are reached.

In a possible implementation manner, after reaching the preset number of iterations, the model training method further includes: based on the preset iteration times, the current iteration times, the initial parameters corresponding to the current iteration times and the parameters obtained by training the image recognition model last time, performing moving average processing on the initial parameters corresponding to the current iteration times to obtain target parameters corresponding to the current iteration times;

and performing iterative training on the image recognition model according to the target parameters corresponding to the current iteration times to obtain the trained image recognition model.

In a possible embodiment, the sample image includes a first image and a second image, the first image is an image acquired in a real scene, and the second image is an image generated by using a preset generation manner, where the preset generation manner includes at least one of adding a background, adding noise, and combining image elements.

In a possible implementation manner, before obtaining a plurality of target feature maps corresponding to a sample image, the model training method further includes: and performing enhancement processing on the sample image, wherein the enhancement processing comprises at least one of turning, size adjustment, cutting, brightness adjustment, contrast adjustment and noise addition.

In a possible implementation manner, the image recognition model further includes a backbone network model and a feature pyramid network model, the feature pyramid network model is connected to the backbone network model, the first sub-model is connected to the feature pyramid network model, and the obtaining of the plurality of target feature maps corresponding to the sample image includes: inputting a sample image into a backbone network model for feature extraction to obtain a plurality of initial feature maps corresponding to the sample image, wherein the plurality of initial feature maps are different in size; and inputting the plurality of initial feature maps into the feature pyramid network model for feature fusion processing to obtain a plurality of target feature maps corresponding to the sample image.

In a second aspect, an embodiment of the present disclosure provides an image recognition method, including:

acquiring a plurality of target characteristic graphs corresponding to an image to be identified, wherein the sizes of the plurality of target characteristic graphs are different;

the method comprises the steps of inputting a plurality of target characteristic graphs into a first sub-model of an image recognition model for recognition processing, and obtaining an image recognition result output by the image recognition model, wherein the image recognition result is used for indicating whether an image to be recognized is a forbidden image, and the image recognition model is obtained by training by using a model training method according to the first aspect of the disclosure.

In one possible implementation, acquiring a plurality of target feature maps corresponding to an image to be recognized includes: acquiring an image to be identified; preprocessing an image to be identified to obtain a preprocessed image, wherein the preprocessing comprises image normalization processing and/or image scaling processing; and acquiring a plurality of target characteristic graphs according to the preprocessed images.

In one possible implementation, obtaining a plurality of target feature maps from the preprocessed image includes: inputting the preprocessed image into a backbone network model of an image recognition model for feature extraction to obtain a plurality of initial feature maps corresponding to the preprocessed image, wherein the plurality of initial feature maps are different in size; and inputting the plurality of initial feature maps into a feature pyramid network model of the image recognition model to perform feature fusion processing to obtain a plurality of target feature maps.

In one possible implementation, inputting a plurality of target feature maps into a first sub-model of an image recognition model for recognition processing, and obtaining an image recognition result output by the image recognition model, the method includes: inputting a plurality of target characteristic graphs into a first sub-model of an image recognition model for recognition processing to obtain the scores of target objects contained in an image to be recognized; if the fraction is larger than the threshold value, the obtained image identification result is that the image to be identified is a forbidden image; and if the score is smaller than or equal to the threshold value, obtaining an image identification result that the image to be identified is not a forbidden image.

In a possible implementation manner, the image recognition result further includes position information of the target object in the image to be recognized, and the position information is used for determining the position of the target contraband object in the contraband image.

In a third aspect, an embodiment of the present disclosure provides a model training apparatus, configured to train an image recognition model, where the image recognition model includes a first submodel and a second submodel, and the model training apparatus includes:

the acquisition module is used for acquiring a plurality of target characteristic graphs corresponding to the sample image, and the sizes of the plurality of target characteristic graphs are different;

the first processing module is used for inputting the plurality of target feature maps into a first submodel for convolution processing to obtain a classification feature map, a regression feature map and an object feature map which are output by the first submodel and correspond to each target feature map;

the second processing module is used for inputting the classification characteristic diagram into a second submodel to carry out characteristic decorrelation processing so as to obtain a target sample weight output by the second submodel, and the target sample weight is used for representing the weight of each classification characteristic in the classification characteristic diagram;

the determining module is used for determining a target loss value based on the target sample weight, the classification characteristic diagram, the regression characteristic diagram, the object characteristic diagram and the labeling information of the sample image;

and the third processing module is used for adjusting the parameters of the image recognition model according to the target loss value so as to obtain the trained image recognition model.

In a possible implementation manner, the second processing module is specifically configured to: inputting the plurality of classification characteristic graphs into a second submodel to obtain a target classification characteristic graph, wherein the target classification characteristic graph is obtained by splicing the plurality of classification characteristic graphs by the second submodel; extracting target classification features based on the target classification feature map and a target classification feature map obtained by training an image recognition model last time; carrying out random Fourier transform processing on the target classification features to obtain random Fourier features; and performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier features to obtain the weight of the target sample.

In a possible implementation manner, the second processing module, when configured to perform iterative training of feature decorrelation processing on the second submodel based on the random fourier features to obtain the target sample weight, is specifically configured to: and performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier feature, the first sample weight obtained by training the second submodel last time and the second sample weight obtained by training the image recognition model last time to obtain the target sample weight, wherein in the first training of the second submodel, the preset initial sample weight is used as the first sample weight.

In a possible implementation, the determining module is specifically configured to: carrying out class detection on the plurality of classification characteristic graphs through a first submodel to obtain a first detection result; determining an initial classification loss value according to the first detection result and the labeling information; determining a first loss value according to the initial classification loss value and the target sample weight; carrying out position detection on the multiple regression feature maps through the first submodel to obtain a second detection result; determining a second loss value according to the second detection result and the labeling information; carrying out object detection on the plurality of object characteristic graphs through the first sub-model to obtain a third detection result; determining a third loss value according to the third detection result and the labeling information; and determining a target loss value according to the first loss value, the second loss value and the third loss value.

In a possible implementation, the determining module, when configured to determine the first loss value according to the initial classification loss value and the target sample weight, is specifically configured to: and performing weighted summation processing on the initial classification loss value and the target sample weight to determine a first loss value.

In a possible implementation manner, the determining module, when configured to perform weighted summation processing on the initial classification loss value and the target sample weight to determine the first loss value, is specifically configured to: determining a first loss value according to the following equation:

representing an initial classification loss value;

representing a target sample weight;

representing a sample image;

annotation information representing a sample image;

represent

Corresponding random fourier features;

In a possible implementation manner, the first submodel includes a first convolution layer, and a second convolution layer and a third convolution layer connected to the first convolution layer, respectively, a second submodel is connected to the second convolution layer, and the first processing module is specifically configured to: inputting the target feature map into a first convolution layer for convolution processing to obtain a convolution processing result output by the first convolution layer, wherein the first convolution layer comprises a convolution layer; and respectively inputting convolution processing results into the second convolution layer and the third convolution layer for convolution processing to obtain a classification characteristic diagram output by the second convolution layer, a regression characteristic diagram output by the third convolution layer and an object characteristic diagram, wherein the second convolution layer comprises a plurality of convolution layers in cascade connection, and the third convolution layer comprises a plurality of convolution layers in cascade connection.

In a possible implementation manner, the third processing module is specifically configured to: and adjusting parameters of the image recognition model according to the target loss value, and performing iterative training on the image recognition model until the preset iteration times are reached.

In one possible implementation, the third processing module is further configured to: after the preset iteration number is reached, based on the preset iteration number, the current iteration number, the initial parameter corresponding to the current iteration number and the parameter obtained by training the image recognition model last time, performing moving average processing on the initial parameter corresponding to the current iteration number to obtain a target parameter corresponding to the current iteration number; and performing iterative training on the image recognition model according to the target parameters corresponding to the current iteration times to obtain the trained image recognition model.

In a possible implementation, the obtaining module is further configured to: before obtaining a plurality of target characteristic graphs corresponding to the sample image, enhancing the sample image, wherein the enhancing process comprises at least one of turning, size adjustment, cutting, brightness adjustment, contrast adjustment and noise addition.

In a possible implementation manner, the image recognition model further includes a backbone network model and a feature pyramid network model, the feature pyramid network model is connected to the backbone network model, the first submodel is connected to the feature pyramid network model, and the obtaining module is specifically configured to: inputting the sample image into a backbone network model for feature extraction to obtain a plurality of initial feature maps corresponding to the sample image, wherein the sizes of the plurality of initial feature maps are different; and inputting the plurality of initial feature maps into the feature pyramid network model for feature fusion processing to obtain a plurality of target feature maps corresponding to the sample image.

In a fourth aspect, an embodiment of the present disclosure provides an image recognition apparatus, including:

the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a plurality of target characteristic graphs corresponding to an image to be recognized, and the sizes of the plurality of target characteristic graphs are different;

the processing module is configured to input the plurality of target feature maps into a first sub-model of the image recognition model for recognition processing, so as to obtain an image recognition result output by the image recognition model, where the image recognition result is used to indicate whether an image to be recognized is an illegal image, and the image recognition model is obtained by training using the model training method according to the first aspect of the disclosure.

In a possible implementation manner, the obtaining module is specifically configured to: acquiring an image to be identified; preprocessing an image to be identified to obtain a preprocessed image, wherein the preprocessing comprises image normalization processing and/or image scaling processing; and acquiring a plurality of target characteristic graphs according to the preprocessed images.

In a possible implementation manner, the obtaining module, when configured to obtain a plurality of target feature maps according to the preprocessed image, is specifically configured to: inputting the preprocessed image into a backbone network model of an image recognition model for feature extraction to obtain a plurality of initial feature maps corresponding to the preprocessed image, wherein the plurality of initial feature maps are different in size; and inputting the plurality of initial feature maps into a feature pyramid network model of the image recognition model to perform feature fusion processing to obtain a plurality of target feature maps.

In a possible implementation, the processing module is specifically configured to: inputting a plurality of target characteristic graphs into a first sub-model of an image recognition model for recognition processing to obtain the scores of target objects contained in an image to be recognized; if the fraction is larger than the threshold value, obtaining an image identification result that the image to be identified is a forbidden image; and if the score is smaller than or equal to the threshold value, obtaining an image identification result that the image to be identified is not a forbidden image.

In a fifth aspect, an embodiment of the present disclosure provides a computing device, including: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory to implement the model training method according to the first aspect of the present disclosure or the image recognition method according to the second aspect.

In a sixth aspect, the present disclosure provides a storage medium, in which computer program instructions are stored, and when executed, implement the model training method according to the first aspect or the image recognition method according to the second aspect of the present disclosure.

In a seventh aspect, the embodiments of the present disclosure provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the model training method according to the first aspect or the image recognition method according to the second aspect of the present disclosure.

The model training method, the image recognition method, the medium, the device and the computing equipment provided by the embodiment of the disclosure are used for training an image recognition model, wherein the image recognition model comprises a first sub-model and a second sub-model, and a plurality of target characteristic graphs corresponding to a sample image are obtained, and the sizes of the plurality of target characteristic graphs are different; inputting the target feature maps into a first sub-model for convolution processing to obtain a classification feature map, a regression feature map and an object feature map which are output by the first sub-model and correspond to each target feature map; inputting the classification characteristic diagram into a second submodel to perform characteristic decorrelation processing, so as to obtain a target sample weight output by the second submodel, wherein the target sample weight is used for representing the weight of each classification characteristic in the classification characteristic diagram; determining a target loss value based on the target sample weight, the classification characteristic diagram, the regression characteristic diagram, the object characteristic diagram and the labeling information of the sample image; and adjusting parameters of the image recognition model according to the target loss value to obtain the trained image recognition model. According to the image recognition method and device, the classification characteristic diagram output by the first sub-model and corresponding to the target characteristic diagram is input into the second sub-model to be subjected to characteristic decorrelation processing, the target sample weight output by the second sub-model is obtained, different classification characteristics are mutually independent through the target sample weight, false association among different classification characteristics is removed, and then the image recognition model can better focus on essential characteristics related to a recognition result, so that the generalization capability of the image recognition model can be greatly improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a schematic diagram of the distribution of training samples and test samples provided by a related art;

FIG. 2 is a schematic diagram illustrating a distribution of training samples and testing samples provided by another related art;

FIG. 3 is a diagram illustrating a target detection result provided by a related art;

fig. 4 is a schematic view of an application scenario of a model training method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of a model training method according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an image recognition model according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of a model training method provided by another embodiment of the present disclosure;

fig. 8 is a flowchart of an image recognition method according to an embodiment of the disclosure;

fig. 9 is a schematic diagram of image recognition by an image recognition model according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a storage medium provided by an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present disclosure, a model training method, an image recognition method, a medium, an apparatus, and a computing device are provided.

In this context, it is to be understood that the terms referred to:

the forbidden image recognition is a technology for processing, analyzing and understanding images by using a computer to select and recognize images containing information such as illicit behaviors, illegal behaviors and the like from massive internet images; the forbidden image identification can purify the network, so that the received information is safe while the public enjoys convenience of the network;

supervised Learning (Supervised Learning), which is one of machine Learning tasks; assuming that X represents a training sample and Y represents a label of the training sample, labeled training data (X, Y) means that each training sample includes an input and an expected output; supervised learning is to derive an optimal prediction function from the labeled training data;

independent and identifiable Distributed (i.i.d.), in the probability statistics theory, it means that the values at any time in the random process are random variables, and if these random variables obey the same distribution and are Independent of each other, these random variables are Independent and Identically Distributed; in conventional algorithms, the independent co-distribution is usually assumed to be that the training sample X and the test sample Z both obey the same distribution and are both independently co-distributed, i.e. P _tr (X，Z)＝P _te (X, Z) wherein P _tr (X, Z) represents the distribution of training samples, P _te (X, Z) represents the distribution of the test samples. Fig. 1 is a schematic diagram of a distribution of training samples and test samples provided by a related art, and as shown in fig. 1, shows a distribution of training samples and test samples obtained by an independent co-distribution algorithm, where the training samples and the test samples both obey the same distribution and are both independently co-distributed.

Out-Of-Distribution Generalization (OOD Generalization) refers to the task Of Generalization Of an image recognition model in a scene Of Distribution change, i.e., in P _tr (X，Z)≠P _te (X, Z), it can be considered that under supervised learning, there is a deviation between the distribution of the test sample and the distribution of the training sample, and the deviation between the two is unknown during training. Exemplarily, fig. 2 is a schematic diagram of a distribution of training samples and test samples provided by another related art, and as shown in fig. 2, shows a distribution of training samples and test samples obtained by a distribution-outer generalization algorithm, and the training samples and the test samples do not obey the same distribution.

Object Detection (Object Detection) for finding objects of interest in an image or video and simultaneously detecting their position and size; unlike the image classification Task, the object detection not only solves the classification problem but also solves the positioning problem, and belongs to the Multi-Task (Multi-Task) problem. For example, fig. 3 is a schematic diagram of a target detection result provided by the related art, and as shown in fig. 3, by the target detection, each object (e.g., a dog) included in the image and position information of each object (e.g., coordinates of a rectangular frame where the dog is located) can be obtained.

Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

In addition, the data related to the present disclosure may be data authorized by a user or fully authorized by each party, and the acquisition, transmission, use, and the like of the data all meet the requirements of relevant national laws and regulations, and the embodiments of the present disclosure may be combined with each other.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

The inventor finds that whether the input image is a forbidden image or not is recognized by using the trained image recognition model, so that the received content information is safe while the public enjoys convenience brought by the network, and the method is an important recognition technology achieving the effect of purifying the network. In the related art, when the trained image recognition model is used to recognize whether the input image is a forbidden image, the training image and the test image are both independently and identically distributed. However, in a real scene, the distribution of the test image is often different from the distribution of the training image, so that the iterative processing needs to be continuously performed through a process of "mining and screening data, labeling data, training model, and testing model" in the training process, so that the distribution of the training image continuously approximates the distribution of the test image, the recognition accuracy of the image recognition model and the generalization capability of the image recognition model are improved, and the trained image recognition model is obtained. However, the above-described related art has the following problems: (1) the cost is high; in a real scene, data distribution usually changes along with time, the cost for constantly tracking the distribution change of used training data is very expensive, and the scale of the training data required by supervised training is large, so that the labeling cost is high, and the cost is further increased; (2) The forbidden image has less data quantity, higher acquisition difficulty and smaller domain (different domains represent different distribution of data) known range; in the mass image data of the internet, due to the fact that the ratio of the number of forbidden images is extremely low, not only is it relatively difficult to acquire a large number of images suitable for related forbidden identification tasks, but also it is more difficult to acquire forbidden images of different domains, and further the known range of the domains in the training data is narrow; (3) distribution deviation is unavoidable and the randomness of the deviation is large; in a real scene, the data distribution usually changes with time, which breaks the assumption of independent equal distribution; whereas classical supervised learning approaches are usually optimized by minimizing training errors, which greedily absorb all correlations found in the data to predict, although proven effective in independent co-distributed settings, the performance of supervised learning approaches is compromised in case of changing data distributions, since the correlations between data do not remain constant in unseen test distributions at all times; when strongly distributed transitions are involved, image recognition models that only account for training errors can fail significantly, sometimes even worse than random guessing; meanwhile, the distribution of the label data is constantly changed along with the time, and the change is unknown in the training process of the image recognition model, so that the randomness is high, and the recognition effect of the image recognition model is reduced to different degrees after a period of time. Therefore, the image recognition model obtained by the above-described related art has poor generalization ability.

Based on the above problems, the present disclosure provides a model training method, an image recognition method, a medium, an apparatus, and a computing device, which enhance a sample image by enhancing and changing the sample image, and generate a sample image having an invisible pattern outside of an original distribution, so as to improve the generalization ability of an image recognition model; in the training process of the image recognition model, the target feature and the background information are stripped through the first sub-model and the second sub-model included in the image recognition model, the intrinsic features of different categories are extracted, irrelevant features and false associations are removed, the generalization capability of the image recognition model on an unknown domain is improved, and in addition, the generalization capability of the image recognition model is further improved by adopting a corresponding training strategy in the training process of the image recognition model. Therefore, an image recognition model with better generalization capability can be obtained, and when the image recognition model is used for image recognition, an image recognition result can be obtained more accurately.

Application scene overview

An application scenario of the scheme provided by the present disclosure is first illustrated with reference to fig. 4. Fig. 4 is a schematic view of an application scenario of the model training method according to an embodiment of the present disclosure, and as shown in fig. 4, the application scenario may include: a server cluster 41 and a terminal 42. The server cluster 41 includes a plurality of servers 411 and a storage 412, and the terminal 42 may be a tablet computer, a notebook computer, a desktop computer, or an intelligent appliance. The server 411 is used for training the image recognition model, acquiring data from the memory 412 in the training process, and storing the generated data in the memory 412. In addition, the terminal 42 communicates with the training process through a wireless network or a wired network.

In addition, the embodiments of the present disclosure may be applied in image recognition scenarios. For example, at the time of content review, it is determined whether the images contained therein are illicit images.

It should be noted that fig. 4 is only a schematic diagram of an application scenario provided by the embodiment of the present disclosure, and the embodiment of the present disclosure does not limit the devices included in fig. 4, and does not limit the positional relationship between the devices in fig. 4. The model training method provided by the embodiment of the disclosure can be applied to a server, and the server can be an independent server, or can also be a service cluster and the like.

Exemplary method

A method for model training according to an exemplary embodiment of the present disclosure is described below with reference to fig. 5 in conjunction with the application scenario of fig. 4. It should be noted that the above application scenarios are only illustrated for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in any way in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

First, a model training method is described by way of a specific embodiment.

Fig. 5 is a flowchart of a model training method provided in an embodiment of the present disclosure, and is used for training an image recognition model, where the image recognition model includes a first sub-model and a second sub-model. As shown in fig. 5, the method of the embodiment of the present disclosure includes:

s501, obtaining a plurality of target characteristic graphs corresponding to the sample image, wherein the sizes of the plurality of target characteristic graphs are different.

In the embodiment of the present disclosure, the sample image is, for example, a prohibited image. Before the sample image is used for training the image recognition model, in order to ensure the balance between the speed and the accuracy, the sample image may be preprocessed to obtain a preprocessed sample image, where the preprocessing includes, for example, sample image normalization processing and/or sample image scaling processing, so that a plurality of target feature maps corresponding to the sample image may be obtained according to the preprocessed sample image. When the sample image is subjected to image scaling processing, the specific scaling size may be determined according to actual task requirements, for example, the default scaling size is generally 320 pixels × 320 pixels.

Optionally, the sample image includes a first image and a second image, the first image is an image acquired in a real scene, the second image is an image generated in a preset generation manner, and the preset generation manner includes at least one of adding a background, adding noise, and combining image elements.

The first image is, for example, a forbidden image acquired in a real scene. Illustratively, as can be seen from the above related art, the image recognition model performs well when the training image and the test image are both independently and simultaneously distributed, but the recognition effect of the image recognition model is reduced when the image recognition model encounters some images of the unseen domain (i.e., not included in the training image). Considering that the recognition effect of the image recognition model is reduced on the unseen images, it is necessary to acquire images of various different domains as much as possible in training the image recognition model to improve the generalization capability of the image recognition model. However, in practical applications, especially in forbidden scenes, it is difficult to acquire images of more domains. Therefore, from the perspective of image generation, images in different domains can be generated to expand the diversity of sample images and improve the generalization capability of the image recognition model. Specifically, for example, the second image is generated by generating sample images of various different domains in a random background, random noise, random image element combination and other ways, so that the image recognition model pays more attention to the essential characteristic information of the target, and avoids noise information, thereby improving the generalization capability of the image recognition model.

As an embodiment, the specific generation process of the second image is as follows: preset a background gallery B _g Target object library O _obj And irrelevant objects storehouse N _obj Library of target objects O _obj Contains M target objects and an irrelevant object library N _obj The device comprises Q unrelated objects; assuming that the number of target composite images (i.e., second images) is denoted by num, the background gallery B is randomly selected _g An image B of _gk From a library of target objects O _obj Randomly selecting M target objects from the M target objects, and storing the selected M target objects in an unrelated object library N _obj Randomly selecting Q independent objects from the Q independent objects contained in the solution B _gk Randomly placing the m selected target objects and q irrelevant objects, and adding random noise to generate a target image

And outputs various types of information of the target object including position information (i.e., coordinate information) and category information of the target object, etc. By analogy, a set I of num second images can be obtained _m ，I _m Any of the second images contained in (1) may be used

Wherein k ranges from 1 to num.

Optionally, before obtaining a plurality of target feature maps corresponding to the sample image, the model training method provided in the embodiment of the present disclosure may further include: and carrying out enhancement processing on the sample image, wherein the enhancement processing comprises at least one of turning, size adjustment, cutting, brightness adjustment, contrast adjustment and noise addition.

It is understood that after the sample image is obtained, the sample image includes the first image and the second image, and the sample image may be enhanced to generate a sample image having an invisible pattern (i.e., not included in the pattern corresponding to the obtained sample image) outside the original distribution, so as to improve the generalization capability of the image recognition model. Specific enhancement functions such as random flipping, random resizing of sample images, random cropping, etc., and specific enhancement operations such as random brightness, random contrast, random gaussian noise, etc.

Optionally, the image recognition model further includes a backbone network model and a feature pyramid network model, the feature pyramid network model is connected to the backbone network model, the first sub-model is connected to the feature pyramid network model, and the obtaining of the plurality of target feature maps corresponding to the sample image may include: inputting a sample image into a backbone network model for feature extraction to obtain a plurality of initial feature maps corresponding to the sample image, wherein the plurality of initial feature maps are different in size; and inputting the plurality of initial feature maps into the feature pyramid network model for feature fusion processing to obtain a plurality of target feature maps corresponding to the sample image.

Exemplarily, fig. 6 is a schematic structural diagram of an image recognition model provided in an embodiment of the present disclosure, and as shown in fig. 6, the image recognition model includes a backbone network model, a feature pyramid network model, a first sub model, and a second sub model. The backbone network model is also called a backbone network model, comprises a plurality of cascaded convolutional layers and is mainly used for feature extraction, and a feature map output by each convolutional layer can be obtained; the feature pyramid network model comprises a plurality of cascaded convolutional layers, feature graphs output by different convolutional layers contained in the backbone network model are mainly fused, the expression capacity of the network is enhanced, the feature graphs of different sizes can be distributed to different convolutional layers contained in the feature pyramid network model for fusion processing, and the low-level feature expression capacity can be increased by fusing high-level features into low-level features. Therefore, a sample image with the size of 320 pixels × 320 pixels can be input into the backbone network model for feature extraction, so as to obtain a plurality of initial feature maps corresponding to the sample image, wherein the sizes of the plurality of initial feature maps are different; and inputting the plurality of initial feature maps into the feature pyramid network model for feature fusion processing to obtain a plurality of target feature maps corresponding to the sample image. In one example, assume that the backbone network model includes five cascaded convolutional layers, which are, from bottom to top, a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, and a fifth convolutional layer, respectively; the feature pyramid network model comprises three cascaded convolutional layers, wherein the three cascaded convolutional layers are respectively a first convolutional layer, a second convolutional layer and a third convolutional layer from top to bottom. The sample image with the size of 320 pixels × 320 pixels is input into the backbone network model for feature extraction, and a feature map output by each convolution layer included in the backbone network model can be obtained. Then, inputting a feature map output by the fifth convolutional layer of the backbone network model into the first convolutional layer of the feature pyramid network, so as to obtain a corresponding target feature map (for example, represented by P5); performing feature fusion processing on the feature maps output by the P5 and the fourth convolutional layer of the backbone network model through the second convolutional layer of the feature pyramid network model to obtain a corresponding target feature map (for example, represented by P4); and performing feature fusion processing on the P4 and the feature map output by the third convolutional layer of the backbone network model through the third convolutional layer of the feature pyramid network to obtain a corresponding target feature map (for example, represented by P3). The size of P5 is, for example, height × width × 1024, the size of P4 is, for example, height × width × 512, the size of P3 is, for example, height × width × 256; 1024, 512 and 256 of which each represent the number of channels. For the first submodel and the second submodel in fig. 6, the subsequent embodiments may be referred to. The characteristics are extracted through the backbone network model, and then the characteristics are fused through the characteristic pyramid network model, so that the performance of obtaining a plurality of target characteristic graphs corresponding to the sample images can be effectively improved, and the resource occupation is reduced.

S502, inputting the target feature maps into a first sub-model for convolution processing to obtain a classification feature map, a regression feature map and an object feature map which are output by the first sub-model and correspond to each target feature map.

In this step, after obtaining a plurality of target feature maps corresponding to the sample image, the plurality of target feature maps may be input to the first submodel for convolution processing, so as to obtain a classification feature map, a regression feature map, and an object feature map output by the first submodel and corresponding to each target feature map. It is understood that the classification feature map is used for performing class detection on the object included in the sample image, such as a prohibited class to which the object belongs; the regression feature map is used for carrying out position detection on an object contained in the sample image, such as the coordinates of the object; the object feature map is used for object detection of an object included in the sample image, such as determining whether the object is an object. For how to obtain the classification feature map, the regression feature map, and the object feature map corresponding to each target feature map output by the first sub-model, reference may be made to the subsequent embodiments, which are not described herein again.

S503, inputting the classification characteristic diagram into a second sub-model for characteristic decorrelation processing, and obtaining a target sample weight output by the second sub-model, wherein the target sample weight is used for representing the weight of each classification characteristic in the classification characteristic diagram.

Based on the causal association judgment method, the image recognition model can find a group of sample weights through learning, so that any variable and other variables are independent, namely, any variable is selected as a target variable, the distribution of the target variable is not changed along with the change of the values of other variables, the image recognition model can be helped to learn the association relation between each variable and a final result in the training process, the false association between the variables is removed, and the image recognition model can better focus on the essential characteristics related to the result. Therefore, in this step, after the classification feature map corresponding to the target feature map output by the first submodel is obtained, the classification feature map may be input into the second submodel to perform the feature decorrelation processing, so as to obtain the target sample weight output by the second submodel. The target sample weight represents the weight of each classification feature in the classification feature map, so that different classification features can be independent from each other. For how to obtain the target sample weight output by the second submodel, refer to the following embodiments, which are not described herein again.

S504, determining a target loss value based on the target sample weight, the classification characteristic diagram, the regression characteristic diagram, the object characteristic diagram and the labeling information of the sample image.

In this step, after the target sample weight output by the second sub-model is obtained, the target loss value may be determined based on the target sample weight, the classification feature map, the regression feature map, the object feature map, and the labeling information of the sample image. For how to determine the target loss value based on the labeling information of the target sample weight, the classification feature map, the regression feature map, the object feature map, and the sample image, reference may be made to the subsequent embodiments, which are not repeated herein.

And S505, adjusting parameters of the image recognition model according to the target loss value to obtain the trained image recognition model.

In this step, after the target loss value is determined, parameters of the image recognition model may be adjusted according to the target loss value, and the parameters may be understood as global weight parameters of the image recognition model, so as to obtain the trained image recognition model. Illustratively, in the process of iteratively training the image recognition model, in order to further improve the generalization capability of the image recognition model, the training of the image recognition model may be completed in a training strategy, for example, by using a Simple Moving Average (SMA) training manner. For how to adjust the parameters of the image recognition model according to the target loss value to obtain the trained image recognition model, reference may be made to the subsequent embodiments, which are not described herein again.

The model training method provided by the embodiment of the disclosure is used for training an image recognition model, wherein the image recognition model comprises a first sub-model and a second sub-model, and a plurality of target characteristic graphs corresponding to sample images are obtained, and the sizes of the plurality of target characteristic graphs are different; inputting the target feature maps into a first sub-model for convolution processing to obtain a classification feature map, a regression feature map and an object feature map which are output by the first sub-model and correspond to each target feature map; inputting the classification characteristic diagram into a second submodel to perform characteristic decorrelation processing, so as to obtain a target sample weight output by the second submodel, wherein the target sample weight is used for representing the weight of each classification characteristic in the classification characteristic diagram; determining a target loss value based on the target sample weight, the classification feature map, the regression feature map, the object feature map and the labeling information of the sample image; and adjusting parameters of the image recognition model according to the target loss value to obtain the trained image recognition model. According to the embodiment of the disclosure, the classification characteristic diagram output by the first sub-model and corresponding to the target characteristic diagram is input into the second sub-model for characteristic decorrelation processing, so that the target sample weight output by the second sub-model is obtained, different classification characteristics are mutually independent through the target sample weight, false association among different classification characteristics is removed, and the image recognition model can better focus on essential characteristics related to a recognition result, so that the generalization capability of the image recognition model can be greatly improved.

Fig. 7 is a flowchart of a model training method according to another embodiment of the present disclosure. On the basis of the above embodiments, the embodiments of the present disclosure further illustrate the model training method. As shown in fig. 7, a method of an embodiment of the present disclosure may include:

s701, obtaining a plurality of target characteristic graphs corresponding to the sample image, wherein the sizes of the plurality of target characteristic graphs are different.

For a detailed description of this step, reference may be made to the description of S501 in the embodiment shown in fig. 5, which is not described herein again.

The first submodel includes a first convolution layer, and a second convolution layer and a third convolution layer connected to the first convolution layer, respectively, and a second submodel is connected to the second convolution layer, in this embodiment of the disclosure, the step S502 in fig. 5 may further include the following two steps S702 and S703:

s702, inputting the target feature map into the first convolution layer for convolution processing to obtain a convolution processing result output by the first convolution layer.

Wherein the first convolutional layer comprises a convolutional layer.

And S703, inputting the convolution processing results into the second convolution layer and the third convolution layer respectively for convolution processing to obtain a classification characteristic diagram output by the second convolution layer, a regression characteristic diagram output by the third convolution layer and an object characteristic diagram.

Wherein the second convolutional layer comprises a plurality of cascaded convolutional layers, and the third convolutional layer comprises a plurality of cascaded convolutional layers.

Exemplarily, referring to fig. 6, the first sub-model includes a first convolution layer 601 and second and third convolution layers 602 and 603 connected to the first convolution layer, respectively; the first convolution layer 601 includes a 1 × 1 convolution layer for performing dimension reduction processing; the second convolutional layer 602 contains 3 cascaded convolutional layers, the 3 cascaded convolutional layers including 2 convolutional layers of 3 × 3 and one convolutional layer of 1 × 1; the third convolutional layer 603 contains 4 cascaded convolutional layers, wherein the 4 cascaded convolutional layers include 2 convolutional layers of 3 × 3 and 2 convolutional layers of 1 × 1. Based on the example of the step S501, P5, P4, and P3 may be sequentially input into the first convolution layer 601 for convolution processing, that is, for dimension reduction processing, to obtain the convolution processing result output by the first convolution layer 601, for example, dimension reduction of P5, P4, and P3 to be height × width × 256. And inputting each convolution processing result output by the first convolution layer 601 into the second convolution layer 602 and the third convolution layer 603 respectively for convolution processing, so as to obtain a classification feature map output by the second convolution layer 602 and a regression feature map and an object feature map output by the third convolution layer 603, and obtain a classification feature map, a regression feature map and an object feature map corresponding to P5, P4 and P3 respectively. Compared with the classification feature map, the regression feature map and the object feature map respectively corresponding to P5, P4 and P3 obtained by only one convolution layer of 1 × 1 in the prior art, the embodiment of the disclosure can implement prediction branch decoupling of the image recognition model by the second convolution layer and the third convolution layer.

In the embodiment of the present disclosure, the step S503 in fig. 5 may further include the following four steps S704 to S707:

and S704, inputting the plurality of classification characteristic graphs into a second submodel to obtain a target classification characteristic graph.

And the target classification characteristic graph is obtained by splicing the plurality of classification characteristic graphs by the second submodel.

It can be understood that the second sub-model can complete the stripping of the target feature and the background information, extract the intrinsic features of different categories, remove the irrelevant features and the false correlation, so that the image recognition model is more focused on the feature information of the object to be recognized, weaken the interference of the background information on the object, further obtain more accurate target intrinsic features, strip other irrelevant interference information, and make prediction only based on the intrinsic features (the features having causal correlation with the recognition result) to improve the generalization capability of the image recognition model in unknown domains. Exemplarily, referring to fig. 6, after the classification feature maps corresponding to P5, P4, and P3 are obtained, the 3 classification feature maps may be input into the second sub-model, and the second sub-model performs a splicing process on the 3 classification feature maps to obtain the target classification feature map.

S705, extracting target classification characteristics based on the target classification characteristic diagram and the target classification characteristic diagram obtained by training the image recognition model last time.

For example, a first preset number of times of iterative training may be performed on the image recognition model in advance, so that the image recognition model has a certain learning capability, where in each time the image recognition model is trained, the target sample weights output by the second submodel are all preset initial sample weights. On this basis, assuming that the number of times of training the image recognition model last time is the (N-1) th time, for the target classification feature map obtained when the image recognition model is trained for the nth time (i.e., the current time), the target classification feature map obtained when the image recognition model is trained for the (N-1) th time (i.e., the last time) needs to be combined together to extract the target classification feature.

And S706, carrying out random Fourier transform processing on the target classification features to obtain random Fourier features.

It can be understood that there is a complex dependency relationship between the features of each dimension of the deep network, and since the original features are not sufficient to completely eliminate the false association between the irrelevant features and the tags by only removing the linear correlation between the original features when the dimension is low, a method that considers kernel functions (such as linear kernel function, polynomial kernel function, gaussian kernel function, etc.) can map the original features to a high-dimensional space, but the Feature map dimension of the mapped original features is enlarged to an infinite dimension, and the correlation between the features is difficult to calculate, while a method of Random Fourier Feature (RFF) has better performance in approximating the kernel function and measuring the Feature independence, so that the Random Fourier Feature is introduced into the image recognition model. Illustratively, the target classification features are subjected to random Fourier transform processing through a random Fourier feature extractor, so as to obtain random Fourier features.

And S707, performing iterative training of feature decorrelation processing on the second sub-model based on the random Fourier features to obtain the weight of the target sample.

It is understood that during each iterative training of the image recognition model, the second sub-model is iteratively trained a second predetermined number of times. In this step, after the random fourier features are obtained, iterative training of feature decorrelation processing may be performed on the second submodel based on the random fourier features, so as to obtain a target sample weight.

Further, optionally, performing iterative training of feature decorrelation processing on the second submodel based on the random fourier features to obtain the target sample weight, which may include: and performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier features, the first sample weight obtained by training the second submodel last time and the second sample weight obtained by training the image recognition model last time to obtain a target sample weight, wherein a preset initial sample weight is used as the first sample weight in the first training of the second submodel.

Exemplarily, based on the example of the step S705, a second sample weight obtained by the second submodel when the image recognition model is trained for the N-1 st time can be obtained, and in the iterative training process of the second submodel for the second preset number of times, when the second submodel is trained for the first time, the iterative training of the feature decorrelation processing needs to be performed on the second submodel in combination with the second sample weight and the preset initial sample weight to obtain a sample weight output by the first-trained second submodel; when the second submodel is trained subsequently, iterative training of feature decorrelation processing can be carried out on the second submodel on the basis of the random Fourier features and the first sample weight obtained by training the second submodel last time until the iteration times reach a second preset time, and the target sample weight is obtained.

In the embodiment of the present disclosure, the step S504 in fig. 5 may further include the following four steps S708 to S711:

s708, carrying out class detection on the plurality of classification characteristic graphs through the first sub-model to obtain a first detection result; determining an initial classification loss value according to the first detection result and the labeling information; a first loss value is determined based on the initial classification loss value and the target sample weight.

Exemplarily, referring to fig. 6, after the classification feature maps corresponding to P5, P4, and P3 are obtained, the 3 classification feature maps may be combined by the first sub-model to perform class detection on the 3 classification feature maps to obtain a first detection result (i.e., a prediction result), an initial classification loss value is determined according to the first detection result and the labeling information, and then the first loss value is determined according to the initial classification loss value and the obtained target sample weight.

Further, optionally, determining the first loss value according to the initial classification loss value and the target sample weight includes: and performing weighted summation processing on the initial classification loss value and the target sample weight to determine a first loss value.

Illustratively, the target sample weight characterizes the weight of each classification feature in the classification feature map, so that the target sample weight can be understood as a vector, and the initial classification loss value can also be understood as a vector, and the corresponding elements of the two vectors are multiplied one by one, that is, a weighted summation process is performed, so that a first determined loss value can be obtained, and the first loss value is a cross entropy loss.

Optionally, performing weighted summation processing on the initial classification loss value and the target sample weight to determine a first loss value, including: determining a first loss value according to the following equation:

wherein the content of the first and second substances,

representing an initial classification loss value;

representing a target sample weight;

representing a sample image;

annotation information representing a sample image;

to represent

Corresponding random Fourier features;

It is understood that, for the case of inputting the batch sample image into the model, the corresponding first loss value of the batch sample image may be obtained based on the above formula.

S709, carrying out position detection on the multiple regression feature maps through the first sub-model to obtain a second detection result; and determining a second loss value according to the second detection result and the labeling information.

Exemplarily, referring to fig. 6, after obtaining regression feature maps corresponding to P5, P4, and P3, the 3 regression feature maps may be combined by the first submodel to perform position detection on the regression feature maps to obtain a second detection result, and then a second loss value may be determined according to the second detection result and the label information.

S710, carrying out object detection on the plurality of object characteristic graphs through the first submodel to obtain a third detection result; and determining a third loss value according to the third detection result and the labeling information.

Exemplarily, referring to fig. 6, after the object feature maps corresponding to P5, P4, and P3 are obtained, the 3 object feature maps may be combined by the first sub-model to perform object detection on the 3 object feature maps to obtain a third detection result, and then a third loss value may be determined according to the third detection result and the label information.

It should be noted that, the present disclosure does not limit the order of execution of S708, S709, and S710.

And S711, determining a target loss value according to the first loss value, the second loss value and the third loss value.

In this step, after obtaining the first loss value, the second loss value, and the third loss value, the first loss value, the second loss value, and the third loss value may be added to obtain a target loss value, and the target loss value may be reduced by back propagation plus gradient descent based on the target loss value.

In the embodiment of the present disclosure, the step S505 in fig. 5 may further include the following three steps S712 to S714:

and S712, adjusting parameters of the image recognition model according to the target loss value, and performing iterative training on the image recognition model until the preset iteration times are reached.

S713, based on the preset iteration number, the current iteration number, the initial parameter corresponding to the current iteration number and the parameter obtained by training the image recognition model last time, performing moving average processing on the initial parameter corresponding to the current iteration number to obtain a target parameter corresponding to the current iteration number.

And S714, performing iterative training on the image recognition model according to the target parameters corresponding to the current iteration times to obtain the trained image recognition model.

It can be understood that, in the process of iteratively training the image recognition model, in order to further improve the generalization capability of the image recognition model, the training of the image recognition model may be completed in a training strategy, for example, by using an SMA training mode. Specifically, parameters of the image recognition model are adjusted according to the target loss value, iterative training is conducted on the image recognition model until the preset iteration times are reached, then the image recognition model is trained in an SMA training mode, and the moving average value is kept calculated until the training is finished. For the t-th training after reaching the preset iteration number, the parameters of the image recognition model are as the following formula:

wherein t represents the current iteration number; t is t ₀ Representing the iteration time when the weight parameter needs to start calculating the moving average value; it can also be understood as the last iteration in the preset number of iterations;

representing a weight parameter of the image recognition model trained by the SMA method at the time t, namely a target parameter corresponding to the current iteration times; theta.theta. _t Representing a weight parameter of the normal current image recognition model at the time t, namely an initial parameter corresponding to the current iteration times;

representing the parameters obtained from the last training of the image recognition model.

On the basis of the above embodiment, in the process of training the image recognition model, a plurality of image recognition models can be retained based on different check points (such as accuracy indexes of a verification set), and further, by testing the plurality of image recognition models on a plurality of domains, an image recognition model which performs better on a plurality of known domains is selected as a final prediction model, that is, the accuracy of the verification set and the variance of the verification accuracy in each domain are considered.

According to the model training method provided by the embodiment of the disclosure, a plurality of target characteristic graphs corresponding to a sample image are obtained, and the sizes of the plurality of target characteristic graphs are different; inputting the target feature map into the first convolution layer for convolution processing to obtain a convolution processing result output by the first convolution layer; inputting convolution processing results into the second convolution layer and the third convolution layer respectively for convolution processing to obtain a classification feature map output by the second convolution layer, a regression feature map output by the third convolution layer and an object feature map; compared with the prior art that the classification feature map, the regression feature map and the object feature map are obtained by only one convolution layer, the prediction branch decoupling of the image recognition model can be realized by the second convolution layer and the third convolution layer. Inputting the plurality of classification characteristic graphs into a second submodel to obtain a target classification characteristic graph; extracting target classification features based on the target classification feature map and a target classification feature map obtained by training an image recognition model last time; and carrying out random Fourier transform processing on the target classification features to obtain random Fourier features, and mapping the target classification features to a high-dimensional space to calculate the correlation among different target classification features. Performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier features to obtain the weight of the target sample; carrying out class detection on the plurality of classification characteristic graphs through a first submodel to obtain a first detection result; determining an initial classification loss value according to the first detection result and the labeling information; determining a first loss value according to the initial classification loss value and the target sample weight; carrying out position detection on the multiple regression feature maps through the first submodel to obtain a second detection result; determining a second loss value according to the second detection result and the labeling information; carrying out object detection on the plurality of object characteristic graphs through the first submodel to obtain a third detection result; determining a third loss value according to the third detection result and the labeling information; and determining a target loss value according to the first loss value, the second loss value and the third loss value, wherein the target loss value is determined by using the target sample weight, so that different classification features can be independent from each other by determining the target loss value, false association among the different classification features is removed, and the image recognition model can better focus on essential features related to a recognition result, and the generalization capability of the image recognition model can be greatly improved. Adjusting parameters of the image recognition model according to the target loss value, and performing iterative training on the image recognition model until a preset iteration number is reached; based on the preset iteration times, the current iteration times, the initial parameters corresponding to the current iteration times and the parameters obtained by training the image recognition model last time, performing moving average processing on the initial parameters corresponding to the current iteration times to obtain target parameters corresponding to the current iteration times; and performing iterative training on the image recognition model according to the target parameters corresponding to the current iteration times to obtain the trained image recognition model. Due to the adoption of the training strategy of moving average processing, the generalization capability of the image recognition model can be further improved.

Fig. 8 is a flowchart of an image recognition method according to an embodiment of the present disclosure. The method of the disclosed embodiments may be applied in a computing device, which may be a server or a cluster of servers, etc. As shown in fig. 8, the method of the embodiment of the present disclosure includes:

s801, acquiring a plurality of target characteristic diagrams corresponding to the image to be recognized, wherein the sizes of the plurality of target characteristic diagrams are different.

In the embodiment of the present disclosure, the image to be identified is, for example, a forbidden image. Optionally, obtaining a plurality of target feature maps corresponding to the image to be recognized includes: acquiring an image to be identified; preprocessing an image to be identified to obtain a preprocessed image, wherein the preprocessing comprises image normalization processing and/or image scaling processing; and acquiring a plurality of target characteristic graphs according to the preprocessed image.

Before the image to be recognized is input into the image recognition model, in order to ensure the balance between the speed and the accuracy, the image to be recognized may be subjected to preprocessing, such as image normalization processing and/or image scaling processing. When the image to be recognized is subjected to image scaling processing, the specific scaling size may be determined according to actual task requirements, for example, the default scaling size is generally 320 pixels × 320 pixels. By carrying out normalization processing on the image to be recognized, the data distribution in the training process of the image recognition model can be kept consistent.

Optionally, obtaining a plurality of target feature maps according to the preprocessed image, including: inputting the preprocessed image into a backbone network model of an image recognition model for feature extraction to obtain a plurality of initial feature maps corresponding to the preprocessed image, wherein the plurality of initial feature maps are different in size; and inputting the plurality of initial feature maps into a feature pyramid network model of the image recognition model for feature fusion processing to obtain a plurality of target feature maps.

For example, referring to fig. 6, feature extraction may be performed on a backbone network model of a preprocessed image input image recognition model to obtain a plurality of initial feature maps corresponding to the preprocessed image, and then feature fusion processing may be performed on a feature pyramid network model of the plurality of initial feature maps input image recognition models to obtain a plurality of target feature maps.

S802, inputting the target characteristic graphs into a first sub-model of the image recognition model for recognition processing, and obtaining an image recognition result output by the image recognition model.

The image identification result is used for indicating whether the image to be identified is a forbidden image or not.

The image recognition model is obtained by training by adopting a model training method in any one of the above method embodiments.

Exemplarily, fig. 9 is a schematic diagram of image recognition by an image recognition model according to an embodiment of the present disclosure, as shown in fig. 9, referring to fig. 6, in the process of image recognition by the image recognition model, a second sub-model in the image recognition model does not need to participate in model prediction, that is, the whole image recognition model only has an increased cost in the training process compared with the original image recognition model, and has no new computational cost in the prediction stage. Thus, the image recognition model shown in fig. 9 includes the backbone network model, the feature pyramid network model, and the first sub-model, and does not include the second sub-model in the image recognition model shown in fig. 6. The multiple target feature maps are input into the first sub-model of the image recognition model shown in fig. 9 for recognition processing, so that an image recognition result output by the image recognition model can be obtained, and the image recognition result is used for indicating whether the image to be recognized is a forbidden image.

Further, optionally, inputting the plurality of target feature maps into a first sub-model of the image recognition model for recognition processing, to obtain an image recognition result output by the image recognition model, and including: inputting a plurality of target characteristic graphs into a first sub-model of an image recognition model for recognition processing to obtain the scores of target objects contained in an image to be recognized; if the fraction is larger than the threshold value, obtaining an image identification result that the image to be identified is a forbidden image; and if the score is smaller than or equal to the threshold value, obtaining an image identification result that the image to be identified is not a forbidden image.

The threshold value may be preset, and the present disclosure does not limit this. After the score of the target object contained in the image to be recognized is obtained through the first sub-model, the score can be judged, and if the score is larger than a threshold value, the obtained image recognition result is that the image to be recognized is a forbidden image; and if the score is smaller than or equal to the threshold value, obtaining an image identification result that the image to be identified is not a forbidden image. After the image recognition result is obtained, the image recognition result can be output and fed back to the user.

Optionally, the image recognition result further includes position information of the target object in the image to be recognized, and the position information is used to determine the position of the target prohibited object in the prohibited image.

For example, referring to fig. 9, since the first sub-model can perform position detection on the regression feature map, position information of the target object in the image to be recognized, such as coordinates of a rectangular frame in which the target object is located, can be output through the first sub-model.

According to the image identification method provided by the embodiment of the disclosure, a plurality of target feature maps corresponding to an image to be identified are obtained, and the sizes of the plurality of target feature maps are different; and inputting the target characteristic graphs into a first sub-model of the image recognition model for recognition processing to obtain an image recognition result output by the image recognition model, wherein the image recognition result is used for indicating whether the image to be recognized is a forbidden image. Because the image recognition model of the embodiment of the disclosure has better generalization capability, the image recognition result can be more accurately obtained through the image recognition model.

Exemplary devices

Having described the medium of the exemplary embodiment of the present disclosure, next, an image recognition apparatus of the exemplary embodiment of the present disclosure will be described with reference to fig. 10. The apparatus according to the exemplary embodiment of the present disclosure may implement each process in the foregoing image recognition method embodiments, and achieve the same function and effect.

Fig. 10 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present disclosure, configured to train an image recognition model, where the image recognition model includes a first sub-model and a second sub-model. As shown in fig. 10, the model training apparatus 1000 according to the embodiment of the present disclosure includes: an obtaining module 1001, a first processing module 1002, a second processing module 1003, a determining module 1004, and a third processing module 1005. Wherein:

the obtaining module 1001 is configured to obtain a plurality of target feature maps corresponding to the sample image, where the plurality of target feature maps are different in size.

The first processing module 1002 is configured to input the multiple target feature maps into a first sub-model for convolution processing, so as to obtain a classification feature map, a regression feature map, and an object feature map, which are output by the first sub-model and correspond to each target feature map.

The second processing module 1003 is configured to input the classification feature map into the second sub-model to perform feature decorrelation processing, so as to obtain a target sample weight output by the second sub-model, where the target sample weight is used to characterize a weight of each classification feature in the classification feature map.

And the determining module 1004 is configured to determine a target loss value based on the target sample weight, the classification feature map, the regression feature map, the object feature map, and the labeling information of the sample image.

A third processing module 1005, configured to adjust parameters of the image recognition model according to the target loss value, so as to obtain a trained image recognition model.

In a possible implementation, the second processing module 1003 may specifically be configured to: inputting the plurality of classification characteristic graphs into a second submodel to obtain a target classification characteristic graph, wherein the target classification characteristic graph is obtained by splicing the plurality of classification characteristic graphs by the second submodel; extracting target classification features based on the target classification feature map and a target classification feature map obtained by training an image recognition model last time; carrying out random Fourier transform processing on the target classification features to obtain random Fourier features; and performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier features to obtain the weight of the target sample.

In a possible implementation, the second processing module 1003, when configured to perform iterative training of feature decorrelation processing on the second sub-model based on random fourier features, to obtain a target sample weight, may specifically be configured to: and performing iterative training of feature decorrelation processing on the second submodel based on the random Fourier feature, the first sample weight obtained by training the second submodel last time and the second sample weight obtained by training the image recognition model last time to obtain the target sample weight, wherein in the first training of the second submodel, the preset initial sample weight is used as the first sample weight.

In a possible implementation, the determining module 1004 may specifically be configured to: carrying out class detection on the plurality of classification characteristic graphs through the first sub-model to obtain a first detection result; determining an initial classification loss value according to the first detection result and the labeling information; determining a first loss value according to the initial classification loss value and the target sample weight; carrying out position detection on the multiple regression feature maps through the first sub-model to obtain a second detection result; determining a second loss value according to the second detection result and the labeling information; carrying out object detection on the plurality of object characteristic graphs through the first submodel to obtain a third detection result; determining a third loss value according to the third detection result and the labeling information; and determining a target loss value according to the first loss value, the second loss value and the third loss value.

In a possible implementation, the determining module 1004, when configured to determine the first loss value according to the initial classification loss value and the target sample weight, may specifically be configured to: and performing weighted summation processing on the initial classification loss value and the target sample weight to determine a first loss value.

In a possible implementation manner, the determining module 1004, when configured to perform weighted summation processing on the initial classification loss value and the target sample weight to determine the first loss value, may specifically be configured to: determining a first loss value according to the following equation:

wherein the content of the first and second substances,

representing an initial classification loss value;

representing a target sample weight;

representing a sample image;

annotation information representing a sample image;

to represent

Corresponding random fourier features;

In a possible implementation manner, the first sub-model includes a first convolutional layer and a second convolutional layer and a third convolutional layer respectively connected to the first convolutional layer, and a second sub-model is connected behind the second convolutional layer, and the first processing module 1002 may be specifically configured to: inputting the target feature map into a first convolution layer for convolution processing to obtain a convolution processing result output by the first convolution layer, wherein the first convolution layer comprises a convolution layer; and inputting convolution processing results into a second convolution layer and a third convolution layer respectively for convolution processing to obtain a classification feature map output by the second convolution layer, a regression feature map output by the third convolution layer and an object feature map, wherein the second convolution layer comprises a plurality of cascaded convolution layers, and the third convolution layer comprises a plurality of cascaded convolution layers.

In a possible implementation, the third processing module 1005 may be specifically configured to: and adjusting parameters of the image recognition model according to the target loss value, and performing iterative training on the image recognition model until the preset iteration times are reached.

In a possible implementation, the third processing module 1005 may be further configured to: after the preset iteration number is reached, based on the preset iteration number, the current iteration number, the initial parameter corresponding to the current iteration number and the parameter obtained by training the image recognition model last time, performing moving average processing on the initial parameter corresponding to the current iteration number to obtain a target parameter corresponding to the current iteration number; and performing iterative training on the image recognition model according to the target parameters corresponding to the current iteration times to obtain the trained image recognition model.

In a possible implementation, the sample image includes a first image and a second image, the first image is an image acquired in a real scene, and the second image is an image generated by adopting a preset generation manner, where the preset generation manner includes at least one of adding a background, adding noise, and combining image elements.

In one possible implementation, the obtaining module 1001 may further be configured to: before a plurality of target characteristic graphs corresponding to the sample image are obtained, enhancement processing is carried out on the sample image, wherein the enhancement processing comprises at least one of turning, size adjustment, cutting, brightness adjustment, contrast adjustment and noise addition.

In a possible implementation manner, the image recognition model further includes a backbone network model and a feature pyramid network model, the feature pyramid network model is connected to the backbone network model, the first sub-model is connected to the feature pyramid network model, and the obtaining module 1001 may be specifically configured to: inputting a sample image into a backbone network model for feature extraction to obtain a plurality of initial feature maps corresponding to the sample image, wherein the plurality of initial feature maps are different in size; and inputting the plurality of initial characteristic graphs into the characteristic pyramid network model for characteristic fusion processing to obtain a plurality of target characteristic graphs corresponding to the sample image.

The apparatus in the embodiment of the present disclosure may be configured to execute the scheme of the model training method in any one of the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 11 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present disclosure, and as shown in fig. 11, an image recognition apparatus 1100 according to an embodiment of the present disclosure includes: an acquisition module 1101 and a processing module 1102. Wherein:

the obtaining module 1101 is configured to obtain a plurality of target feature maps corresponding to an image to be recognized, where the target feature maps are different in size.

The processing module 1102 is configured to input the multiple target feature maps into a first sub-model of the image recognition model for recognition processing, so as to obtain an image recognition result output by the image recognition model, where the image recognition result is used to indicate whether an image to be recognized is an illegal image, and the image recognition model is obtained by using a model training method in any one of the above method embodiments.

In a possible implementation, the obtaining module 1101 may be specifically configured to: acquiring an image to be identified; preprocessing an image to be identified to obtain a preprocessed image, wherein the preprocessing comprises image normalization processing and/or image scaling processing; and acquiring a plurality of target characteristic graphs according to the preprocessed images.

In a possible implementation, the obtaining module 1101, when configured to obtain a plurality of target feature maps according to the preprocessed image, may specifically be configured to: inputting the preprocessed image into a backbone network model of an image recognition model for feature extraction to obtain a plurality of initial feature maps corresponding to the preprocessed image, wherein the plurality of initial feature maps are different in size; and inputting the plurality of initial feature maps into a feature pyramid network model of the image recognition model for feature fusion processing to obtain a plurality of target feature maps.

In a possible implementation, the processing module 1102 may be specifically configured to: inputting a plurality of target characteristic graphs into a first sub-model of an image recognition model for recognition processing to obtain the scores of target objects contained in an image to be recognized; if the fraction is larger than the threshold value, obtaining an image identification result that the image to be identified is a forbidden image; and if the score is less than or equal to the threshold value, obtaining an image identification result that the image to be identified is not a forbidden image.

The apparatus in the embodiment of the present disclosure may be configured to implement the scheme of the image recognition method in any one of the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

Exemplary Medium

Having described the method of the exemplary embodiment of the present disclosure, next, a storage medium of the exemplary embodiment of the present disclosure will be described with reference to fig. 12.

Fig. 12 is a schematic diagram of a storage medium according to an embodiment of the disclosure. Referring to fig. 12, a storage medium 1200 stores therein a program product for implementing the above method according to an embodiment of the present disclosure, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. The readable signal medium may also be any readable medium other than a readable storage medium.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

Exemplary computing device

Having described the methods, media, and apparatus of the exemplary embodiments of the present disclosure, a computing device of the exemplary embodiments of the present disclosure is described next with reference to fig. 13.

The computing device 1300 shown in fig. 13 is only one example and should not place any limitation on the scope of use and functionality of embodiments of the present disclosure.

Fig. 13 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure, and as shown in fig. 13, a computing device 1300 is represented in the form of a general-purpose computing device. Components of computing device 1300 may include, but are not limited to: the at least one processing unit 1301 and the at least one storage unit 1302 are connected to a bus 1303 of different system components (including the processing unit 1301 and the storage unit 1302). Illustratively, the processing unit 1301 may be embodied as a processor, the storage unit 1302 stores computer-executable instructions, and the processing unit 1301 executes the computer-executable instructions stored in the storage unit 1302 to implement the image recognition method described above.

The bus 1303 includes a data bus, a control bus, and an address bus.

The storage unit 1302 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 13021 and/or cache memory 13022, and may further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 13023.

The storage unit 1302 may also include a program/utility 13025 having a set (at least one) of program modules 13024, such program modules 13024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.

Computing device 1300 may also communicate with one or more external devices 1304 (e.g., keyboard, pointing device, etc.). Such communication may occur via an input/output (I/O) interface 1305. Moreover, the computing device 1300 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) through the network adapter 1306. As shown in fig. 13, the network adapter 1306 communicates with the other modules of the computing device 1300 over the bus 1303. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 1300, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the image recognition apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units/modules described above may be embodied in one unit/module according to embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A model training method for training an image recognition model, the image recognition model comprising a first sub-model and a second sub-model, the model training method comprising:

obtaining a plurality of target characteristic graphs corresponding to a sample image, wherein the sizes of the target characteristic graphs are different;

inputting the plurality of target feature maps into the first submodel for convolution processing to obtain a classification feature map, a regression feature map and an object feature map which are output by the first submodel and correspond to each target feature map;

inputting the classification feature map into the second submodel to perform feature decorrelation processing, so as to obtain a target sample weight output by the second submodel, wherein the target sample weight is used for representing the weight of each classification feature in the classification feature map;

2. The model training method of claim 1, wherein the inputting the classification feature map into the second sub-model for feature decorrelation processing to obtain a target sample weight output by the second sub-model comprises:

inputting the plurality of classification feature maps into the second submodel to obtain a target classification feature map, wherein the target classification feature map is obtained by splicing the plurality of classification feature maps by the second submodel;

extracting target classification features based on the target classification feature map and a target classification feature map obtained by training the image recognition model last time;

carrying out random Fourier transform processing on the target classification features to obtain random Fourier features;

and performing iterative training of the feature decorrelation processing on the second submodel based on the random Fourier features to obtain the target sample weight.

3. The model training method of claim 2, wherein the iteratively training the feature decorrelation process on the second submodel based on the random fourier features to obtain the target sample weights comprises:

and performing iterative training of feature decorrelation processing on the second sub-model based on the random Fourier feature, the first sample weight obtained by training the second sub-model last time and the second sample weight obtained by training the image recognition model last time to obtain the target sample weight, wherein in the first training of the second sub-model, a preset initial sample weight is used as the first sample weight.

4. The model training method according to any one of claims 1 to 3, wherein the determining a target loss value based on the target sample weight, the classification feature map, the regression feature map, the object feature map, and annotation information of the sample image comprises:

carrying out class detection on the plurality of classification characteristic graphs through the first submodel to obtain a first detection result; determining an initial classification loss value according to the first detection result and the labeling information; determining a first loss value according to the initial classification loss value and the target sample weight;

carrying out position detection on the regression feature maps through the first submodel to obtain a second detection result; determining a second loss value according to the second detection result and the labeling information;

carrying out object detection on the plurality of object characteristic graphs through the first submodel to obtain a third detection result; determining a third loss value according to the third detection result and the labeling information;

and determining the target loss value according to the first loss value, the second loss value and the third loss value.

5. An image recognition method, comprising:

acquiring a plurality of target feature maps corresponding to an image to be recognized, wherein the sizes of the plurality of target feature maps are different;

and inputting the target feature maps into a first submodel of an image recognition model for recognition processing to obtain an image recognition result output by the image recognition model, wherein the image recognition result is used for indicating whether the image to be recognized is a forbidden image or not, and the image recognition model is obtained by adopting the model training method as claimed in any one of claims 1 to 4.

6. A model training apparatus for training an image recognition model, the image recognition model comprising a first sub-model and a second sub-model, the model training apparatus comprising:

the acquisition module is used for acquiring a plurality of target characteristic maps corresponding to the sample image, wherein the sizes of the plurality of target characteristic maps are different;

the first processing module is used for inputting the plurality of target feature maps into the first submodel for convolution processing to obtain a classification feature map, a regression feature map and an object feature map which are output by the first submodel and correspond to each target feature map;

the second processing module is used for inputting the classification characteristic diagram into the second submodel to perform characteristic decorrelation processing, so as to obtain a target sample weight output by the second submodel, wherein the target sample weight is used for representing the weight of each classification characteristic in the classification characteristic diagram;

a determining module, configured to determine a target loss value based on the target sample weight, the classification feature map, the regression feature map, the object feature map, and labeling information of the sample image;

and the third processing module is used for adjusting parameters of the image recognition model according to the target loss value so as to obtain the trained image recognition model.

7. An image recognition apparatus comprising:

the device comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring a plurality of target characteristic maps corresponding to an image to be recognized, and the sizes of the plurality of target characteristic maps are different;

a processing module, configured to input the multiple target feature maps into a first sub-model of an image recognition model for recognition processing, so as to obtain an image recognition result output by the image recognition model, where the image recognition result is used to indicate whether the image to be recognized is an illegal image, and the image recognition model is obtained by training according to the model training method as claimed in any one of claims 1 to 4.

8. A computing device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer execution instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1 to 5.

9. A storage medium having stored therein computer program instructions which, when executed, implement the method of any one of claims 1 to 5.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 5.