CN114863229A

CN114863229A - Image classification method and training method and device of image classification model

Info

Publication number: CN114863229A
Application number: CN202210315149.3A
Authority: CN
Inventors: 袁小童; 谭资昌; 郭国栋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-08-05

Abstract

The disclosure provides an image classification method and an image classification model training method, and relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning. The image classification model comprises a self-attention encoder, a global coding network, a first local coding network and a prediction network. The specific implementation scheme of the image classification method is as follows: dividing an image to be classified into a plurality of image blocks to obtain an image block sequence; self-attention coding is carried out on the image block sequence by adopting a self-attention coder to obtain a first characteristic diagram sequence; the first feature map sequence comprises a plurality of feature maps respectively aiming at a plurality of image blocks; extracting global features of the first feature map sequence by adopting a global coding network to obtain a global feature map; extracting a first local feature of the first feature map sequence by adopting a first local coding network to obtain a first local feature map; and inputting the global feature map and the first local feature map into a prediction network to obtain the classification information of the image to be classified.

Description

Image classification method and training method and device of image classification model

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the field of computer vision and deep learning technologies, and in particular, to an image classification method based on an image classification model, an image classification model training method, an image classification model training device, an electronic device, and a storage medium.

Background

With the development of computer technology and network technology, deep learning technology has been widely used in many fields. For example, a deep learning technique may be employed to extract image features and predict the class of images from the image features. In an actual scene, for example, the predicted image category may be an age category of an object in the image, and the like, so that potential value is provided for an application scene such as human-computer interaction.

Disclosure of Invention

The present disclosure is directed to an image classification method based on an image classification model and a training method, apparatus, electronic device, and storage medium for the image classification model, which improve classification accuracy.

According to an aspect of the present disclosure, there is provided an image classification method based on an image classification model, wherein the image classification model includes a self-attention encoder, a global coding network, a first local coding network, and a prediction network; the image classification method comprises the following steps: dividing an image to be classified into a plurality of image blocks to obtain an image block sequence; self-attention coding is carried out on the image block sequence by adopting a self-attention coder to obtain a first characteristic diagram sequence; the first feature map sequence comprises a plurality of feature maps respectively aiming at a plurality of image blocks; extracting global features of the first feature map sequence by adopting a global coding network to obtain a global feature map; extracting a first local feature of the first feature map sequence by adopting a first local coding network to obtain a first local feature map; and inputting the global feature map and the first local feature map into a prediction network to obtain the classification information of the image to be classified.

According to an aspect of the present disclosure, there is provided a training method of an image classification model, wherein the image classification model includes a self-attention encoder, a global coding network, a first local coding network; the training method comprises the following steps: dividing a sample image into a plurality of image blocks to obtain an image block sequence; self-attention coding is carried out on the image block sequence by adopting a self-attention coder to obtain a first characteristic diagram sequence; the first feature map sequence comprises a plurality of feature maps respectively aiming at a plurality of image blocks; extracting global features of the first feature map sequence by adopting a global coding network to obtain a global feature map; extracting a first local feature of the first feature map sequence by adopting a first local coding network to obtain a first local feature map; determining first classification information according to the global feature map, and determining second classification information according to the first local feature map; and training the image classification model according to a first difference between the first classification information and the second classification information.

According to an aspect of the present disclosure, there is provided an image classification apparatus based on an image classification model, wherein the image classification model includes a self-attention encoder, a global coding network, a first local coding network, and a prediction network; the image classification device includes: the image segmentation module is used for segmenting an image to be classified into a plurality of image blocks to obtain an image block sequence; the self-attention coding module is used for carrying out self-attention coding on the image block sequence by adopting a self-attention coder to obtain a first characteristic diagram sequence; the first feature map sequence comprises a plurality of feature maps respectively aiming at a plurality of image blocks; the global coding module is used for extracting global features of the first feature map sequence by adopting a global coding network to obtain a global feature map; the local coding module is used for extracting a first local feature of the first feature map sequence by adopting a first local coding network to obtain a first local feature map; and the classification module is used for inputting the global feature map and the first local feature map into the prediction network to obtain the classification information of the image to be classified.

According to an aspect of the present disclosure, there is provided a training apparatus for an image classification model, wherein the image classification model includes a self-attention encoder, a global coding network, a first local coding network; the training device comprises: the image segmentation module is used for segmenting the sample image into a plurality of image blocks to obtain an image block sequence; the self-attention coding module is used for carrying out self-attention coding on the image block sequence by adopting a self-attention coder to obtain a first characteristic diagram sequence; the first feature map sequence comprises a plurality of feature maps respectively aiming at a plurality of image blocks; the global coding module is used for extracting global features of the first feature map sequence by adopting a global coding network to obtain a global feature map; the local coding module is used for extracting a first local feature of the first feature map sequence by adopting a first local coding network to obtain a first local feature map; the classification module is used for determining first classification information according to the global feature map and determining second classification information according to the first local feature map; and the model training module is used for training the image classification model according to the first difference between the first classification information and the second classification information.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the image classification model based image classification method and/or the training method of the image classification model provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the image classification model-based image classification method and/or the training method of the image classification model provided by the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement the image classification model-based image classification method and/or the training method of the image classification model provided by the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an application scenario of an image classification method based on an image classification model and a training method and device of the image classification model according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of an image classification method based on an image classification model according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an image classification model according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of an image classification model according to another embodiment of the present disclosure;

FIG. 5 is a flow chart diagram of a method of training an image classification model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a method for training an image classification model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a method of training an image classification model according to another embodiment of the present disclosure;

FIG. 8 is a block diagram of an image classification device based on an image classification model according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of an image classification model training apparatus according to an embodiment of the present disclosure; and

FIG. 10 is a block diagram of an electronic device for implementing an image classification model-based image classification method and/or an image classification model training method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides an image classification method based on an image classification model, wherein the image classification model comprises a self-attention encoder, a global coding network, a first local coding network and a prediction network. The method comprises an image segmentation stage, a self-attention coding stage, a global coding stage, a local coding stage and a classification stage. In the image segmentation stage, an image to be classified is segmented into a plurality of image blocks to obtain an image block sequence. In a self-attention coding stage, a self-attention coder is adopted to carry out self-attention coding on an image block sequence to obtain a first characteristic diagram sequence; the first feature map sequence includes a plurality of feature maps for a plurality of image blocks, respectively. In the global coding network, the global feature of the first feature map sequence is extracted by adopting the global coding network to obtain a global feature map. In the local coding stage, a first local feature of the first feature map sequence is extracted by adopting a first local coding network to obtain a first local feature map. In the classification stage, the global feature map and the first local feature map are input into a prediction network to obtain the classification information of the image to be classified.

An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

Fig. 1 is a schematic view of an image classification method based on an image classification model and an application scenario of a training method and device of the image classification model according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be various electronic devices with processing functionality, including but not limited to a smartphone, a tablet, a laptop, a desktop computer, a server, and so on.

The electronic device 110 may, for example, process the input image 120 to obtain classification information for the image 120. For example, in an application of face Age identification (Facial Age Estimation), the image 120 is an image including a face, the classification information may include a probability that the face in the image belongs to each of a plurality of Age groups, and the Age class 130 of the face in the image may be obtained according to the classification information. The plurality of age groups may include, for example, 101 age groups in total from 0 to 100, and the interval size of each age group is 1.

According to the embodiment of the disclosure, the application of face age identification exists in the scenes of public facility management, user portrait and product recommendation, information security management and the like. For example, in a public facility management and control scenario, the crowd flow in each area in a public place can be reasonably adjusted by determining the crowd gathering condition and the age distribution condition in the public place. For example, more attention may be given to the older or younger population, providing more serving tips and help. For example, in a user portrait and product recommendation scenario, suitable push content, advertisements and the like can be provided to a target audience by determining an age category, so as to provide personalized services for a user. For example, in an information security management scenario, an age category can be determined to give a reminder to a target audience and perform automatic offline and other measures beyond a specified time, and in this way, the occurrence of a situation that the target audience is indulged in a network and the like can be reduced.

In one embodiment, the electronic device 110 may process the image 120 using the image classification model 140 to obtain classification information for the image 120. The image classification model 140 may include, for example, a model formed by a convolutional neural network and a classifier, where the convolutional neural network is used to extract image features as a basis for determining a category, and the image features are processed by the classifier, so that classification information may be obtained. Alternatively, the image classification model 140 may be constructed based on a Label distribution method, for example, a Deep Label Distribution (DLDL) -v2 model. Alternatively, the image classification model may be constructed based on an adaptive variance distribution learning method or the like. Alternatively, the image classification model may also be constructed based on a Vision Transformer (ViT) network, which is not limited in this disclosure.

In an embodiment, as shown in fig. 1, a server 150 may be further included in the application scenario, and the electronic device 110 may be communicatively connected to the server 150 through a network. The network may be a wired or wireless communication link. For example, the electronic device 110 may send a model acquisition request to the server 150 over the network, and the server 150 may send the trained image classification model 140 to the electronic device 110 in response to the model acquisition request.

In an embodiment, the electronic device 110 may further send the image 120 to the server 150, and the server 150 may process the image 120 using the trained image classification model 140 to obtain the classification information.

It should be noted that the image classification method based on the image classification model provided in the present disclosure may be executed by the electronic device 110, and may also be executed by the server 150. Accordingly, the image classification apparatus based on the image classification model provided by the present disclosure may be disposed in the electronic device 110, and may also be disposed in the server 150. The training method of the image classification model provided by the present disclosure may be performed by the server 150. Accordingly, the training device of the image classification model provided by the present disclosure may be disposed in the server 150.

It should be understood that the number and type of electronic devices 110 and servers 150 in fig. 1 are merely illustrative. There may be any number and type of electronic devices 110 and servers 150, as desired for an implementation.

The image classification method based on the image classification model provided by the present disclosure will be described in detail below with reference to fig. 2 to 4.

Fig. 2 is a flowchart illustrating an image classification method based on an image classification model according to an embodiment of the present disclosure.

As shown in fig. 2, the image classification method 200 based on the image classification model according to this embodiment may include operations S210 to S250. The image classification model comprises a self-attention encoder, a global coding network, a first local coding network and a prediction network. The self-attention encoder may be an encoder in a transform network.

In operation S210, the image to be classified is divided into a plurality of image blocks to obtain an image block sequence.

According to the embodiment of the disclosure, the image to be classified can be divided into a plurality of image blocks, and then the plurality of image blocks obtained after division are flattened, so that an image block sequence is obtained. For example, setting the size of the image to be classified as C × H × W, where C is the number of channels, H, W is the height and width of the image to be classified, respectively, and the size of each image block is 2 × 2, then H/2 × W/2 image blocks can be obtained in total. And flattening the H/2W/2 image blocks to the same height along the height direction to obtain an image block sequence.

In operation S220, a self-attention encoder is used to perform self-attention encoding on the image block sequence to obtain a first feature map sequence.

According to an embodiment of the present disclosure, the first sequence of feature maps includes a plurality of feature maps for a plurality of image blocks, respectively. The self-attention encoder may include a multi-layer encoder, and the embodiment may input the embedded representation of the sequence of image blocks into the multi-layer encoder, the first feature map sequence being output by a last layer encoder in the multi-layer encoder. Each layer of the encoder may be one Block (Block) of the encoded portion of the transform structure. The number of encoder layers included in the self-attention encoder may be set according to actual requirements, for example, if the image classification model is constructed based on ViT network, the number of encoder layers may be any value greater than 1 and equal to or less than 12, such as 6, 8, and 9, which is not limited in this disclosure.

Wherein the embedded representation of each image block in the sequence of image blocks may be taken as a token. In addition, when processing an image to be classified, the embodiment does not need to input a learnable class token to the self-attention encoder, and only needs to input a plurality of tokens obtained from a plurality of image blocks. It is to be understood that, for example, position information of each image block may also be added to the embedded representation of each image block, and the position information may be obtained by a sine coding method or a cosine coding method, or may be obtained by a learning method, which is not limited by the present disclosure.

It is understood that one Block of the coding part in the transform structure may include a Self-Attention (Self-Attention) layer, an Add & normalization (Add & normalization) layer, a Feed-Forward (Feed-Forward) layer, and a second residual and normalization layer, which are connected in sequence.

In operation S230, a global feature of the first feature map sequence is extracted by using a global coding network, so as to obtain a global feature map.

According to an embodiment of the present disclosure, the global coding network may be composed of other structures besides the self-attention encoder in the coding part of the ViT network. For example, if the coding part of the ViT network comprises a total of 12 blocks connected in sequence, and the self-attention encoder comprises the first 9 blocks of the 12 blocks, the global coding network may comprise 3 blocks that are arranged after the first 9 blocks. This is because the self-attention mechanism transform network has global modeling capability and can capture long-term relationships. In this way, feature maps in the feature map sequence output by 3 blocks arranged after the first 9 blocks can be rearranged, so as to obtain a global feature map. When the feature map is rearranged, the positions of the image blocks corresponding to the feature map in the image to be classified can be used as the basis.

In operation S240, a first local feature of the first feature map sequence is extracted using a first local coding network, resulting in a first local feature map.

According to the embodiment of the disclosure, the first local encoding network may, for example, focus on only a part of feature maps in the first feature map sequence, and encode the part of feature maps, so as to achieve the purpose of focusing on only local information of the image to be classified, and extract and obtain the first local feature map. Alternatively, the first local encoding network may process each feature map in the first feature map sequence by using a network structure having a fixed and limited receptive field, such as a convolutional neural network, and rearrange a plurality of feature maps obtained after the processing to obtain the first local feature map. When the feature map is rearranged, the position of the image block corresponding to the processed feature map in the image to be classified may be used as a basis.

In operation S250, the global feature map and the first local feature map are input to the prediction network, so as to obtain classification information of the image to be classified.

According to embodiments of the present disclosure, a predictive network may include a fusion layer and a Softmax classifier. The fusion layer is used for fusing the global feature map and the first local feature map to obtain a fused feature map. After the fused feature map is processed by the Softmax classifier, the Softmax classifier can output classification information. The classification information may include probability distribution information (specifically, a probability vector) of the image to be classified for a plurality of predetermined classes, and the probability distribution information includes a prediction probability value of the image to be classified belonging to each predetermined class.

According to the embodiment of the disclosure, the prediction network may further include a full connection layer disposed between the fusion layer and the Softmx classifier, and the fused feature map may be projected to the class space after being processed by the full connection layer. The features output by the full connection layer are input into a Softmax classifier, and classification information is output by the classifier.

The fusion layer can realize the fusion of the two feature maps by adding the first local feature map and the global feature map. The fusion of the two feature maps can also be realized by calculating the weighted sum of the two feature maps, and the weight adopted in the weighting is obtained by learning in the training process of the image classification model, which is not limited by the disclosure.

Compared with the technical scheme that the category token is set, and the output information corresponding to the category token is used as the classification information, in the image classification model based on the self-attention mechanism, the classification information is predicted according to the extracted feature map, so that the information in the feature map can be fully utilized, and the classification accuracy is improved. And moreover, global features are extracted by setting a global coding network. And the local coding branch is set to extract the local features, and the global features and the local features are comprehensively considered to predict the classification information finally, so that the abundant local detail information in the feature map can be further fully utilized, and the classification precision is favorably further improved.

It will be appreciated that in age prediction applications, the image to be classified may be an image having a human face, and the plurality of predetermined categories include the aforementioned plurality of age groups. Therefore, as the classification information is often influenced by the facial skin texture, the brightness degree, the wrinkle texture and the like in the human face, if only the global information is concerned, the key details of the face can be ignored; on the other hand, if only local information is focused, the entire structure information is lost. According to the embodiment of the invention, the age is predicted by comprehensively considering the global features and the local features, so that the prediction precision can be effectively improved.

Fig. 3 is a schematic structural diagram of an image classification model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the aforementioned global coding network may include, for example, a first coding subnetwork and a first converting subnetwork. The first coding subnetwork is used for extracting features based on the first feature map sequence, and the first conversion subnetwork is used for converting the extracted feature sequences into a global feature map. This is because, when input data of the coding sub-network is a feature map sequence, output data is also a feature map sequence. In order to facilitate the fusion with the first local feature map and predict the classification information according to the fusion result, the feature map sequence needs to be converted into an integral feature map. In an embodiment, the first transformation sub-network may be further configured to, for example, adjust the size of the feature maps in the feature map sequence, so that the size of the finally obtained global feature map is equal to that of the first local feature map, thereby facilitating the fusion of the two feature maps.

Illustratively, the first coding subnetwork may comprise 3 blocks that line after the first 9 blocks described previously. The data resulting from processing with the first coding sub-network may be a second signature sequence based on the first signature sequence. This embodiment may in particular input the first signature sequence into the first coding subnetwork and output the second signature sequence by the first coding subnetwork. The second coding subnetwork may be configured to convert the second sequence of feature maps into a feature map matrix, which may be a global feature map. It is understood that the process of converting the second feature map sequence into the feature map matrix may be the reverse of the process of expanding a plurality of image blocks obtained by segmenting the image to be classified into the image block sequence described above. For example, the second sequence of feature maps comprises a plurality of feature maps for a plurality of image blocks. The embodiment may place the feature map for an image block at a position of the image block in the image to be classified, so that a plurality of feature maps may constitute an overall global feature map.

As shown in fig. 3, in embodiment 300, the self-attention encoder 310 is constructed using 9 blocks of the encoded portion of the transform network. After the image block sequence 301 is processed by the self-attention encoder 310, a first feature map sequence 302 can be obtained. The global coding network 320 may comprise a first convergence subnetwork 321 in addition to the first coding subnetwork 322 and the first transition subnetwork 323. The first merging subnetwork 321 is used for global merging of the feature maps in the first feature map sequence 302, so as to obtain a third feature map sequence 303. The third feature map sequence 303 is processed by the first encoding subnetwork 322 to obtain a second feature map sequence, and the second feature map sequence is transformed by the first transforming subnetwork 323 to obtain the global feature map 304. Through the arrangement of the first fusion subnetwork 321, information fusion between the feature maps of the image blocks can be realized, the correlation among the image blocks is considered to a certain extent, and the improvement of the expression capability and the final classification accuracy of the obtained global feature map is facilitated.

Illustratively, the first coding subnetwork 322 may employ 3 blocks that are listed after the first 9 blocks as described previously.

Illustratively, the first merging subnetwork 321 may, for example, randomly divide the feature maps in the first feature map sequence 302 into several groups, each group including at least two feature maps. Subsequently, a weighted sum is calculated for at least two feature maps in each set of feature maps, so as to obtain a weighted feature map, which may be used as a feature map in the third feature map sequence 303. By the method, each feature map in the second feature map sequence is obtained by fusing the feature maps of at least two image blocks, and the global features of the images to be classified can be reflected to a certain extent.

Illustratively, the first blending sub-network 321 may derive the third feature map sequence in a manner similar to the inverse operation of the recomposed pixel map. By the method, the full fusion of the feature maps in the first feature map sequence in the global range can be realized, the expression capability of the obtained global features can be improved, and the classification precision is improved.

For example, the first blending subnetwork 321 may include a first conversion layer, a first down-sampling layer, and a second conversion layer. The first profile sequence 302 can be converted into a first profile matrix 321_1, for example, using a first conversion layer. Specifically, the first conversion layer is configured to arrange the feature maps in the first feature map sequence according to positions of image blocks, to which the feature maps in the first feature map sequence 302 are directed, in the image to be classified, so as to form a first feature map matrix 321_ 1. For example, if the number of the first image blocks is 6 × 6, the obtained first feature map sequence 302 includes 36 feature maps, and the 36 feature maps may form a 6 × 6 first feature map matrix 321_ 1. The first down-sampling layer is used for sampling the feature map interval in the first feature map matrix 321_1 to obtain a plurality of fusion sub-matrices. The interval value of sampling may be any integer greater than 1, such as 2 or 3. Taking the interval value as 2 and the size of the first feature map matrix 321_1 as 6 × 6 as an example, in one sampling, the first feature map, the third feature map and the fifth feature map of the radix row in the first feature map matrix 321_1 can be obtained by sampling, and a fusion sub-matrix can be formed by using the obtained feature maps. Similarly, the second feature map, the fourth feature map and the sixth feature map of the radix row in the first feature map matrix 321_1 can be obtained by sampling, so as to obtain a fusion sub-matrix. Similarly, by sampling a plurality of times, four fusion sub-matrices 321_2 can be obtained without repetition. The second conversion layer is used for converting the plurality of fusion sub-matrixes into a plurality of fusion characteristic maps, so that a third characteristic map sequence consisting of the plurality of fusion characteristic maps is obtained. Specifically, the second conversion layer may first arrange the four fusion sub-matrices 321_2 in a direction perpendicular to the width and height, for example. Thereby obtaining a tensor 321_3 of size 3 × 3 × 4. Subsequently, four feature maps at the same position in the height direction and the width direction in the tensor 321_3 are stitched to obtain one stitched feature map, and 3 × 3 to 9 stitched feature maps 321_4 can be obtained in total. This embodiment can directly construct the 9 spliced signatures into the third signature sequence 303. Alternatively, the 9 spliced feature maps 321_4 may be sequentially input into a full connection layer to adjust the channel dimension of each spliced feature map, and the 9 feature maps output by the full connection layer constitute the third feature map sequence 303. For example, the size of the channel dimension of the 9 feature maps output by the full connection layer may be equal to the size of the channel dimension of the feature map in the first feature map sequence, which is not limited by this disclosure.

It is to be understood that the number of the first image blocks is 6 × 6 only as an example to facilitate understanding of the present disclosure, for example, the number of the first image blocks may also be 14 × 14, and the number of the image blocks obtained by the width direction splitting may also be different from the number of the image blocks obtained by the height direction splitting, which is not limited in the present disclosure. Furthermore, the interval value of the above sampling is also only used as an example to facilitate understanding of the present disclosure, and the present disclosure does not limit this. By arranging the first down-sampling layer, the embodiment can realize the purpose of performing subsequent calculation by replacing the original characteristic diagram with fewer characteristic diagrams, thereby reducing the calculated amount and the display memory occupation amount of the image classification model, and improving the classification efficiency to a certain extent while ensuring the image classification precision.

In an embodiment, similar to the global coding network, as shown in fig. 3, the first local coding network 330 in embodiment 300 may include a second convergence subnetwork 331, a second coding subnetwork 332, and a second transformation subnetwork 333. Wherein the second coding subnetwork 332, like the first coding subnetwork 322, may employ 3 blocks that are arranged after the first 9 blocks as described above.

The second sub-fusion network 331 is different from the first sub-fusion network 321 in that the second sub-fusion network 331 only fuses part of feature maps in the first feature map sequence 302 to focus on a local area in an image to be classified, so as to extract the first local feature map.

Specifically, in this embodiment, the second merging sub-network 331 may be used to merge the feature maps of the first portion of the first feature map sequence 302, so as to obtain the fourth feature map sequence 305. For example, the number of the first image blocks may be 14 × 14, and the size of the first portion may be any size such as 12 × 12 or 10 × 10. The feature map of the first portion may include a plurality of continuous feature maps in the first feature map sequence 302, or may include a plurality of discontinuous feature maps, which is not limited in this disclosure. When fusing the feature maps of the first part, pooling operations may be employed to achieve the fusion. Wherein the pooling operation may include an average pooling operation, an adaptive average pooling operation, a maximum pooling operation, etc., which is not limited by this disclosure. By the aid of the pooling operation, the purpose of performing subsequent calculation by replacing original feature maps with fewer feature maps can be achieved while local features of the images to be classified are concerned, calculated amount and display memory occupation amount of an image classification model can be reduced, and classification efficiency can be improved to a certain extent while image classification accuracy is guaranteed. After the fourth feature map sequence 305 is obtained, the fourth feature map sequence 305 can be input into the second coding subnetwork 332, and the fifth feature map sequence can be output from the second coding subnetwork 332. Finally, the second transformation subnetwork 333 is used to transform the fifth feature map sequence into a feature map matrix, so as to obtain the first local feature map 306. The second converting subnetwork 333 is similar in structure and function to the first converting subnetwork 323, and will not be described herein again.

In an embodiment, the selected feature map of the first portion may be a feature map of an image block corresponding to a central area in an image to be classified, so as to improve richness of content expressed by a finally obtained local feature map. This is because the central region of the image tends to contain richer information. In particular, in this embodiment, the second converged sub-network may include a third conversion layer, a first pooling layer, and a fourth conversion layer. The third conversion layer is similar to the first conversion layer, and is used for converting the first characteristic diagram sequence into the first characteristic diagram matrix, which is not described herein again. The first pooling layer may be configured to intercept a feature map of a first predetermined area in the first feature map matrix and perform a pooling operation on the intercepted feature map, thereby obtaining a sixth feature map matrix. The first predetermined area may be centered on a center point of the first profile matrix. The fourth conversion layer is used to convert the sixth signature matrix into a fourth signature sequence 305. The operations performed by the fourth conversion layer are the reverse of the operations performed by the third conversion layer, for example, if the size of the sixth feature map matrix is set to 3 × 3, the matrix is expanded to obtain 3 × 3 total 9 feature maps, and the 9 feature maps constitute the fourth feature map sequence 305. It should be noted that the number of feature maps in the fourth feature map sequence 305 and the number of feature maps in the third feature map sequence 303 should be equal, so that the first local feature map and the global feature map have the same size.

After obtaining the global feature map 304 and the first local feature map 306, the global feature map 304 and the first local feature map 306 may be input to the prediction network 340, and the classification information 307 may be output by the prediction network 340, according to an embodiment of the present disclosure.

Fig. 4 is a schematic structural diagram of an image classification model according to another embodiment of the present disclosure.

In an embodiment, as shown in fig. 4, the image classification model may include a second local coding network 450 in addition to the self-attention encoder 410, the global coding network 420, the first local coding network 430, and the prediction network 440. The second local encoding network 450 functions similarly to the first local encoding network 430, and both local feature maps are extracted, except that the local feature maps extracted by the two local encoding networks may have different sizes of receptive fields.

Specifically, the embodiment may employ the second local encoding network 450 to extract the second local feature of the first feature map sequence 402, so as to obtain the second local feature map 409. Finally, the second local feature map 409, the global feature map 404 and the first local feature map 406 may all be input into the prediction network 440, and the classification information 407 of the image to be classified is obtained after being processed by the prediction network 440. In this way, the classification information may be obtained by comprehensively considering the global features and the local features of the two different receptive fields, and therefore, the accuracy of the classification information may be improved to a certain extent, that is, the classification accuracy of the image classification model is improved. For example, the second local feature map may focus on local information of a smaller size than the first local feature map, that is, the size of the second local feature is smaller than that of the first local feature, and the receptive field of the second local feature map is larger than that of the first local feature map.

In an embodiment, similar to the first local coding network 430, the second local coding network 450 may include a third convergence sub-network 451, a third coding sub-network 452, and a third transformation sub-network 453. Wherein third coding subnetwork 452 is similar to second coding subnetwork 432 and third conversion subnetwork 453 is similar to second conversion subnetwork 433. In particular, the third blending sub-network 451 may be used to blend feature maps of the second part of the first feature map sequence to obtain the seventh feature map sequence 408. This third convergence sub-network 451 performs similar operations to the second convergence sub-network 431, except that the number of feature maps of the second part into which the third convergence sub-network 451 merges is smaller than the number of feature maps of the first part into which the second convergence sub-network 431 merges. The number of feature maps in the seventh feature map sequence 408, the number of feature maps in the fourth feature map sequence 405, and the number of feature maps in the third feature map sequence 403 may all be equal. The seventh signature sequence 408 is input into the third coding sub-network 452, and the sixth signature sequence can be output by the third coding sub-network 452. Finally, the sixth feature map sequence is converted into a feature map matrix by using a third conversion sub-network, so that a second local feature map 409 can be obtained.

In an embodiment, the third converged sub-network may include a fifth conversion layer, a second pooling layer, and a sixth conversion layer. Wherein the fifth conversion layer is the same as the third conversion layer described previously and the sixth conversion layer is similar to the fourth conversion layer described previously. The second pooling layer differs from the first pooling layer in that the predetermined area where the feature map is cut out is different in size. For example, the embodiment may employ a fifth conversion layer to convert the first profile sequence 402 into a first profile matrix. And performing pooling operation on the feature map of the second preset area in the intercepted first feature map matrix by adopting a second pooling layer to obtain an eighth feature map matrix. If the size of the first predetermined area is 12 × 12, the size of the second predetermined area in this embodiment may be, for example, 8 × 8, and the second predetermined area may also be centered on the center point of the first feature map matrix. And converting the eighth characteristic diagram matrix into a seventh characteristic diagram sequence by adopting a sixth conversion layer.

In order to facilitate the implementation of the image classification method based on the image classification model provided by the present disclosure, the present disclosure also provides a training method of the image classification model, which will be described in detail below with reference to fig. 5 to 7.

Fig. 5 is a flowchart illustrating a training method of an image classification model according to an embodiment of the present disclosure.

As shown in fig. 5, the training method 500 of this embodiment may include operations S510 to S560. The image classification model may be an image classification model adopted by the image classification method 200 of the image classification model. For example, the image classification model includes at least a self-attention encoder, a global encoding network, a first local encoding network.

In operation S510, the sample image is divided into a plurality of image blocks, resulting in an image block sequence.

In operation S520, a self-attention encoder is used to perform self-attention encoding on the image block sequence to obtain a first feature map sequence. The first feature map sequence comprises a plurality of feature maps respectively aiming at the plurality of image blocks.

In operation S530, a global feature of the first feature map sequence is extracted by using a global coding network, so as to obtain a global feature map.

In operation S540, a first local feature of the first feature map sequence is extracted using a first local coding network, resulting in a first local feature map.

Operations S510 to S540 are similar to operations S210 to S240 described above, and are not described again.

In operation S550, first classification information is determined according to the global feature map, and second classification information is determined according to the first local feature map.

According to the embodiment of the disclosure, the global feature map and the first local feature map can be respectively processed by adopting a prediction model composed of a full connection layer and a Softmax classifier, so as to respectively obtain the first classification information and the second classification information. The first classification information and the second classification information are similar to the classification information of the image to be classified described above, and are not described herein again.

In operation S560, the image classification model is trained according to a first difference between the first classification information and the second classification information.

According to an embodiment of the present disclosure, the first difference may be represented by a euclidean distance, a manhattan distance, a cosine distance, and the like between the first classification information and the second classification information, or may be represented by a relative entropy (i.e., a Kullback-Leibler divergence, a KL divergence) between the first classification information and the second classification information, which is not limited in this disclosure. The embodiment may employ a back propagation algorithm to train the image classification model by minimizing the first difference.

The embodiment essentially adopts a mode of simulating learning to train the image classification model, so that the characteristics output by the global network and the first local network in the image classification model tend to be consistent. The distance between the global feature and the first local feature is drawn. By the method, unsupervised training of the image classification model can be realized, and the robustness of the model is improved.

FIG. 6 is a schematic diagram illustrating a training method of an image classification model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, similar to the embodiment 400 described above, in this embodiment 600, the image classification model may include a second local coding network in addition to the self-attention encoder, the global coding network, the first local coding network.

Accordingly, in this embodiment 600, while the global feature map 604 is obtained in the foregoing operation S530 and the first local feature map 606 is obtained in the foregoing operation S540, the second local feature of the first feature map sequence may be extracted by using the second local coding network, so as to obtain the second local feature map 609. The obtaining principle of the second local feature map 609 is similar to that of the second local feature map 409 obtained in the foregoing embodiment 400, and is not described again here.

After the global feature map 604, the first local feature map 606, and the second local feature map 609, the embodiment may input the three feature maps into a prediction model 660 including a fully-connected layer and a Softmax classifier, respectively, to obtain first classification information 671, second classification information 672, and third classification information 673, respectively. After the three classification information is obtained, the image classification model can be trained according to the difference between every two three classification information. For example, the KL divergence between the first classification information 671 and the second classification information 672 may be taken as the first difference 681. The KL divergence between the first classification information 671 and the third classification information 673 is taken as the second difference 682. The KL divergence between the second classification information 672 and the third classification information 673 is taken as a third difference 683.

Illustratively, the KL divergence between the first classification information 671 and the second classification information 672 may include, for example, a KL divergence of the first classification information 671 with respect to the second classification information 672, and a KL divergence of the second classification information 672 with respect to the first classification information 671. For example, the first classification information is set to p ₁ The second classification information is p ₂ Then the first difference D ₁ 681 can be calculated using the following equation (1):

D ₁ ＝D _KL (p ₁ ||p ₂ )+D _KL (p ₂ ||p ₁ ). Formula (1)

Setting the sample image to include N images in total, one first classification information, one second classification information, and one third classification information can be obtained for each of the N images. Setting the number of the plurality of predetermined categories to M, the first category information p ₁ Including the ith sample x _i Probability p for the mth of the M predetermined classes ₁ ^m (x _i ) Second classification information p ₂ Including the ith sample x _i Probability for mth predetermined category among M predetermined categories

D in the formula (1) _KL (p ₂ ||p ₁ ) For example, it can be calculated by the following equation (2):

similarly, the third classification information is set to P ₃ Then the second difference 682 can be represented by D _KL (p ₁ ||p ₃ ) And D _KL (p ₃ ||p ₁ ) And adding the two to obtain the final product. The third difference 683 can be formed by D _KL (p ₂ ||p ₃ ) And D _KL (p ₃ ||p ₂ ) And adding the two to obtain the final product.

By the method of the embodiment, the distance between the global coding network, the first local coding network and the second local coding network can be shortened, so that the three networks can learn each other in a pairwise combination manner, and the precision of the three networks is synchronously improved.

Fig. 7 is a schematic diagram illustrating a method for training an image classification model according to another embodiment of the present disclosure.

According to the embodiment of the disclosure, the obtained first classification information and second classification information may be compared with the true value classification information of the sample image, and the first classification information and the second classification information are closer to the true value classification information by training the image classification model. Therefore, supervised training of the image classification model can be realized, and the improvement of the precision of the image classification model is facilitated. Accordingly, the sample image should include the true value category information. The truth category information may include a truth category of the sample image, the truth category belonging to a plurality of predetermined categories as described above.

In an embodiment, when the image classification model further includes the second local encoding network, the embodiment may further compare the third classification information with the true value class information of the sample image, and train the image classification model to make the third classification information closer to the true value class information.

For example, the embodiment may determine, for any one of the first classification information, the second classification information, and the third classification information, a fourth difference between the any one of the classification information and the true value classification information. Then, the image classification model is trained according to three fourth differences obtained for the three classification information. Wherein the fourth difference may be calculated using cross entropy or the like. Alternatively, the embodiment may determine a probability value for the true class in any classification information, and determine the fourth difference according to a difference between the probability value and 1.

In an embodiment, as shown in fig. 7, when determining the fourth difference between any classification information 701 and the true value category information 702, the embodiment 700 may first determine the predetermined distribution information 703 of the sample image for a plurality of predetermined categories according to the true value category indicated by the true value category information 702. The predetermined distribution information may satisfy, for example, a normal distribution. For example, predetermined distribution information including a probability value of the sample image for each of the M predetermined categories is set to Pt. For example, each predetermined category may have a corresponding value, and the value corresponding to the ith category is set to k _i Probability value p of sample image for ith category in predetermined distribution information _i (k _i ) Can be calculated by the following formula (3):

wherein μ is a numerical value corresponding to a true value category of the sample image. The value of σ is 1. It is understood that when the plurality of predetermined categories are a plurality of age groups, the value corresponding to each predetermined category may be an age value, k _i The value of (A) includes a value range of [0, 100 ]]Is an integer of (1).

In this way, for each predetermined category, a probability value can be obtained, and a plurality of probability values can constitute the predetermined distribution information 703.

After obtaining the predetermined distribution information 703, the embodiment may determine the fourth difference according to the distribution difference 704 between the predetermined distribution information 703 and any of the classification information 701. For example, the fourth difference 705 may be a distribution difference 704 between the predetermined distribution information 703 and any classification information 701. The distribution difference 704 is similar to the first difference, the second difference, and the third difference described above, for example, KL divergence between the predetermined distribution information 703 and any classification information 701 may be used to represent the distribution difference 704, which is not limited in this disclosure.

According to the embodiment, the difference between the predicted classification information and the truth value classification information is represented according to the distribution difference, so that the accuracy of the determined difference can be improved, and the accuracy of the image classification model is improved.

In an embodiment, for any classification information 701, a weighted value of the sample image may also be determined by combining numerical values corresponding to a plurality of predetermined categories. The difference between the value corresponding to the truth category and the weighted value is used to represent the fourth difference between any classification information 701 and the truth category information 702.

Specifically, in the probability distribution information of the sample image for a plurality of predetermined categories included in any classification information 701, the probability value for a certain predetermined category is used as the weight of the numerical value corresponding to the certain predetermined category, and the weighted sum is calculated for the numerical value 706 of the plurality of predetermined categories, so as to obtain the weighted value 707. Then, a value difference 709 between the weighted value 707 and a value 708 corresponding to the true value category is determined, and a fourth difference 705 is determined according to the value difference 709. The value difference 709 between the weighted value 707 and the numerical value 708 may be represented by a euclidean distance or a manhattan distance between the two numerical values, which is not limited in this disclosure.

It will be appreciated that the weighted values 707 are essentially predicted values obtained using a depth expectation approach. The fourth difference 705 is determined by the method of the embodiment, and the image classification model is trained according to the fourth difference 705, so that the accuracy of the image classification model can be further improved, and the inconsistency between the training stage and the prediction stage of the image classification model can be reduced.

In an embodiment, when determining the fourth difference 705, both the distribution difference 704 and the value difference 709 may be considered, for example, a weighted sum of values of the two differences may be used as a value of the fourth difference 705.

It is to be understood that one fourth difference may be obtained for each of the first classification information, the second classification information, and the third classification information. The embodiment may use the sum of the obtained three fourth differences as a prediction loss of the image classification model, and train the image classification model according to the prediction loss.

In an embodiment, the image classification model may further include a prediction network in addition to the self-attention encoder, the global coding network and the first local coding network, and the input of the prediction network includes the global feature map and the first local feature map obtained from the sample image. The prediction network can perform prediction after fusing the global feature map and the first local feature map to obtain fourth classification information. The embodiment may also train the image classification model according to a fifth difference between the fourth classification information and the true classification information. The fifth difference is similar to the fourth difference described above, and is not described herein again. By the method, the accuracy of the prediction network can be improved to a certain extent, and the accuracy of the category obtained by prediction of the image classification model is improved.

It will be appreciated that when the image classification model further comprises a second local encoding network, the input to the prediction network may comprise a second local feature map in addition to the global feature map and the first local feature map. The fourth classification information is predicted after the global feature map, the first local feature map and the second local feature map are fused.

In an embodiment, when the image classification model is trained, the aforementioned first difference, second difference, third difference, three fourth differences, and fifth difference may be considered comprehensively, and a weighted sum of values of the differences may be used as a loss value of the image classification model. The loss value is minimized by adjusting network parameters in the image classification model, so that the training of the image classification model is realized.

The image classification model obtained by training in the embodiment of the disclosure is applied to tasks such as face age recognition, face expression recognition and the like, and the task completion effect is remarkably improved. Therefore, the image classification model obtained by training in the embodiment of the disclosure can better support the development of each service in the scenes of public facility management and control, product recommendation, information safety management and the like described above.

Based on the image classification model-based image classification method provided by the disclosure, the disclosure also provides an image classification device based on the image classification model. The apparatus will be described in detail below with reference to fig. 8.

Fig. 8 is a block diagram of an image classification apparatus based on an image classification model according to an embodiment of the present disclosure.

As shown in fig. 8, the image classification apparatus 800 based on the image classification model of this embodiment may include an image segmentation module 810, a self-attention coding module 820, a global coding module 830, a first local coding module 840, and a classification module 850. The image classification model comprises a self-attention encoder, a global coding network, a first local coding network and a prediction network.

The image segmentation module 810 is configured to segment an image to be classified into a plurality of image blocks to obtain an image block sequence. In an embodiment, the image segmentation module 810 may be configured to perform the operation S210 described above, which is not described herein again.

The self-attention coding module 820 is configured to perform self-attention coding on the image block sequence by using a self-attention coder to obtain a first feature map sequence. The first feature map sequence comprises a plurality of feature maps respectively aiming at a plurality of image blocks. In an embodiment, the self-attention encoding module 820 may be configured to perform the operation S220 described above, which is not described herein again.

The global encoding module 830 is configured to extract global features of the first feature map sequence by using a global encoding network, so as to obtain a global feature map. In an embodiment, the global encoding module 830 may be configured to perform the operation S230 described above, which is not described herein again.

The first local encoding module 840 is configured to extract a first local feature of the first feature map sequence by using a first local encoding network, so as to obtain a first local feature map. In an embodiment, the first local encoding module 840 may be configured to perform the operation S240 described above, which is not described herein again.

The classification module 850 is configured to input the global feature map and the first local feature map into the prediction network, so as to obtain classification information of the image to be classified. In an embodiment, the classification module 850 may be configured to perform the operation S250 described above, which is not described herein again.

According to an embodiment of the present disclosure, a global coding network includes a first coding subnetwork and a first transition subnetwork. The global encoding module 830 may include a first encoding sub-module and a first converting sub-module. The first coding submodule is used for obtaining a second feature map sequence by adopting a first coding sub-network based on the first feature map sequence. And the first converter module is used for converting the second characteristic diagram sequence into a characteristic diagram matrix by adopting a first converter network to obtain a global characteristic diagram.

According to an embodiment of the present disclosure, the global coding network further comprises a first converged sub-network. The first encoding submodule may include a global fusion unit and a first encoding unit. The global fusion unit is used for performing global fusion on the feature maps in the first feature map sequence by adopting the first fusion sub-network to obtain a third feature map sequence. The first coding unit is used for inputting the third feature map sequence into the first coding subnetwork to obtain the second feature map sequence.

According to an embodiment of the present disclosure, the first convergence subnetwork comprises a first conversion layer, a first down-sampling layer and a second conversion layer. The global fusion unit comprises a first conversion subunit, a sampling subunit and a second conversion subunit. The first conversion subunit is used for converting the first characteristic diagram sequence into a first characteristic diagram matrix by adopting a first conversion layer. The sampling subunit is configured to sample the feature map in the first feature map matrix at intervals by using the first downsampling layer to obtain a plurality of fusion sub-matrices. The second conversion unit is used for converting the multiple fusion sub-matrixes into multiple fusion characteristic maps by adopting a second conversion layer to obtain a third characteristic map sequence consisting of the multiple fusion characteristic maps.

According to an embodiment of the present disclosure, the first local coding network comprises a second convergence subnetwork, a second coding subnetwork, and a second transformation subnetwork. The first local coding module comprises a first fusion submodule, a second coding submodule and a second conversion submodule. The first fusion submodule is used for fusing the feature map of the first part in the first feature map sequence by adopting the second fusion subnetwork to obtain a fourth feature map sequence. And the second coding submodule is used for inputting the fourth feature map sequence into the second coding subnetwork to obtain a fifth feature map sequence. And the second conversion submodule is used for converting the fifth characteristic diagram sequence into a characteristic diagram matrix by adopting a second conversion sub-network to obtain a first local characteristic diagram.

According to an embodiment of the present disclosure, the second converged sub-network comprises a third conversion layer, a first pooling layer and a fourth conversion layer. The first fusion submodule comprises a first conversion unit, a first pooling unit and a second conversion unit. The first conversion unit is used for converting the first characteristic diagram sequence into a first characteristic diagram matrix by adopting a third conversion layer. The first pooling unit is used for performing pooling operation on the feature map of the first preset area in the intercepted first feature map matrix by adopting a first pooling layer to obtain a sixth feature map matrix. The second conversion unit is used for converting the sixth characteristic diagram matrix into a fourth characteristic diagram sequence by adopting a fourth conversion layer.

According to an embodiment of the disclosure, the image classification model further comprises a second local coding network. The apparatus 800 may further include a second local encoding module, configured to extract a second local feature of the first feature map sequence by using a second local encoding network, so as to obtain a second local feature map. The classification module 850 may further be configured to input the global feature map, the first local feature map, and the second local feature map into a prediction network, so as to obtain classification information of the image to be classified. Wherein the first local feature has a dimension greater than a dimension of the second local feature.

According to an embodiment of the present disclosure, the second local coding network comprises a third convergence subnetwork, a third coding subnetwork and a third conversion subnetwork. The second partial coding network may include a second merging sub-module, a third coding sub-module, and a third converting sub-module. The second fusion sub-module is used for fusing the feature maps of the second part in the first feature map sequence by adopting a third fusion sub-network to obtain a seventh feature map sequence. And the third coding submodule is used for inputting the seventh feature map sequence into the third coding subnetwork to obtain a sixth feature map sequence. And the third conversion submodule is used for converting the sixth characteristic diagram sequence into a characteristic diagram matrix by adopting a third conversion sub-network to obtain a second local characteristic diagram.

According to an embodiment of the present disclosure, the third converged sub-network comprises a fifth conversion layer, a second pooling layer and a sixth conversion layer. The second fusion submodule may include a third conversion unit, a second pooling unit, and a fourth conversion unit. The third conversion unit is used for converting the first characteristic diagram sequence into a first characteristic diagram matrix by adopting a fifth conversion layer. The second pooling unit is used for performing pooling operation on the feature map of the second preset area in the intercepted first feature map matrix by adopting a second pooling layer to obtain an eighth feature map matrix. The fourth conversion unit is used for converting the eighth characteristic diagram matrix into a seventh characteristic diagram sequence by adopting a sixth conversion layer.

Based on the training method of the image classification model provided by the present disclosure, the present disclosure also provides a training device of the image classification model, which will be described in detail below with reference to fig. 9.

Fig. 9 is a block diagram of a structure of a training apparatus for an image classification model according to an embodiment of the present disclosure.

As shown in fig. 9, the training apparatus 900 for an image classification model of this embodiment may include an image segmentation module 910, a self-attention coding module 920, a global coding module 930, a first local coding module 940, a first classification module 950, and a model training module 960. The image classification model comprises a self-attention encoder, a global coding network and a first local coding network.

The image segmentation module 910 may be configured to segment the sample image into a plurality of image blocks, resulting in an image block sequence. In an embodiment, the image segmentation module 910 is configured to perform the operation S510 described above, which is not described herein again.

The self-attention coding module 920 is configured to perform self-attention coding on the image block sequence by using a self-attention coder to obtain a first feature map sequence; the first feature map sequence includes a plurality of feature maps for a plurality of image blocks, respectively. In an embodiment, the self-attention coding module 920 may perform the operation S520 described above, which is not described herein again.

The global coding module 930 is configured to extract global features of the first feature map sequence by using a global coding network, so as to obtain a global feature map. In an embodiment, the global encoding module 930 may perform the operation S530 described above, which is not described herein again.

The first local encoding module 940 is configured to extract a first local feature of the first feature map sequence by using a first local encoding network to obtain a first local feature map. In an embodiment, the first partial encoding module 940 may perform the operation S540 described above, and is not described herein again.

The first classification module 950 is configured to determine first classification information according to the global feature map and determine second classification information according to the first local feature map. In an embodiment, the first classification module 950 may perform the operation S550 described above, which is not described herein again.

The model training module 960 is configured to train the image classification model according to a first difference between the first classification information and the second classification information. In an embodiment, the model training module 960 may perform the operation S560 described above, which is not described herein again.

According to an embodiment of the disclosure, the image classification model further comprises a second local coding network. The apparatus 900 may further include a second local encoding module, configured to extract a second local feature of the first feature map sequence by using a second local encoding network, so as to obtain a second local feature map. The first classification module 950 is further configured to determine third classification information according to the second local feature map. The model training module 960 may be further configured to train the image classification model according to a second difference between the first classification information and the third classification information and a third difference between the second classification information and the third classification information.

According to an embodiment of the present disclosure, the sample image includes true value category information. The apparatus 900 may further include a first difference determining module configured to determine, for any one of the first classification information, the second classification information, and the third classification information, a fourth difference between the any one of the classification information and the true value classification information. The model training module 960 may be further configured to train the image classification model according to a fourth difference between each of the first classification information, the second classification information, and the third classification information and the true classification information.

According to an embodiment of the present disclosure, any classification information includes probability distribution information of the sample image for a plurality of predetermined classes, and the true value class information includes a true value class to which the sample image belongs. The first difference determination module may include a distribution determination sub-module and a first difference determination sub-module. The distribution determination submodule is used for determining the preset distribution information of the sample image for a plurality of preset categories according to the truth value category. The difference determining submodule is used for determining a fourth difference between any classification information and the truth value category information according to the distribution difference between the preset distribution information and the probability distribution information.

According to an embodiment of the present disclosure, any classification information includes probability distribution information of the sample image for a plurality of predetermined classes. Each of the plurality of predetermined categories has a corresponding numerical value, and the true value category information includes a true value category to which the sample image belongs. The first difference determining module may include a value determining submodule and a second difference determining submodule. And the value determination submodule is used for determining the weighted sum of a plurality of numerical values corresponding to a plurality of preset categories according to the probability distribution information, and the weighted sum is used as a weighted value. The second difference determination submodule is used for determining a fourth difference between any classification information and the truth value category information according to a value difference value between a numerical value corresponding to the truth value category and the weighted value.

According to an embodiment of the disclosure, the sample image comprises true value category information, the image classification model further comprises a prediction network. The apparatus 900 may further include a second classification module and a second difference determination module. And the second classification module is used for inputting the global feature map and the first local feature map into the prediction network to obtain fourth classification information. The second difference determination module is to determine a fifth difference between the fourth classification information and the true classification information. The model training module 960 is further configured to train the image classification model according to the fifth difference.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying the personal information of the related users all conform to the regulations of related laws and regulations, and necessary security measures are taken without violating the good customs of the public order. In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that may be used to implement the image classification model-based image classification method and/or the training method of the image classification model of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 performs the respective methods and processes described above, such as an image classification method based on an image classification model and/or a training method of the image classification model. For example, in some embodiments, the image classification model-based image classification method and/or the training method of the image classification model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the image classification model-based image classification method and/or the training method of the image classification model described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured by any other suitable means (e.g. by means of firmware) to perform an image classification model based image classification method and/or a training method of an image classification model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image classification method based on an image classification model, wherein the image classification model comprises a self-attention encoder, a global coding network, a first local coding network and a prediction network; the method comprises the following steps:

dividing an image to be classified into a plurality of image blocks to obtain an image block sequence;

performing self-attention coding on the image block sequence by adopting the self-attention coder to obtain a first characteristic diagram sequence; the first feature map sequence comprises a plurality of feature maps for the plurality of image blocks, respectively;

extracting global features of the first feature map sequence by adopting the global coding network to obtain a global feature map;

extracting a first local feature of the first feature map sequence by adopting the first local coding network to obtain a first local feature map; and

and inputting the global feature map and the first local feature map into the prediction network to obtain the classification information of the image to be classified.

2. The method of claim 1, wherein the global coding network comprises a first coding subnetwork and a first transition subnetwork; the extracting the global features of the first feature map sequence by using the global coding network to obtain the global feature map comprises:

based on the first feature map sequence, adopting the first coding subnetwork to obtain a second feature map sequence; and

and converting the second characteristic diagram sequence into a characteristic diagram matrix by adopting the first conversion sub-network to obtain the global characteristic diagram.

3. The method of claim 2, wherein the global coding network further comprises a first convergence sub-network; the obtaining a second feature map sequence by using the first coding subnetwork based on the first feature map sequence comprises:

globally fusing the feature maps in the first feature map sequence by using the first fusion sub-network to obtain a third feature map sequence; and

and inputting the third feature map sequence into the first coding subnetwork to obtain the second feature map sequence.

4. The method of claim 3, wherein the first convergence subnetwork comprises a first conversion layer, a first down-sampling layer, and a second conversion layer; the performing global fusion on the feature maps in the first feature map sequence by using the first fusion sub-network to obtain a third feature map sequence includes:

converting the first characteristic diagram sequence into a first characteristic diagram matrix by adopting the first conversion layer;

sampling the characteristic diagram in the first characteristic diagram matrix at intervals by adopting the first down-sampling layer to obtain a plurality of fusion sub-matrixes; and

and converting the plurality of fusion sub-matrixes into a plurality of fusion characteristic graphs by using the second conversion layer to obtain a third characteristic graph sequence consisting of the plurality of fusion characteristic graphs.

5. The method of claim 1, wherein the first local coding network comprises a second convergence subnetwork, a second coding subnetwork, and a second conversion subnetwork; the extracting, by using the first local encoding network, the first local feature of the first feature map sequence to obtain a first local feature map includes:

fusing the feature maps of the first part in the first feature map sequence by using the second fusion sub-network to obtain a fourth feature map sequence;

inputting the fourth feature map sequence into the second coding subnetwork to obtain a fifth feature map sequence; and

and converting the fifth feature map sequence into a feature map matrix by using the second conversion sub-network to obtain the first local feature map.

6. The method of claim 5, wherein the second fused subnetwork comprises a third conversion layer, a first pooling layer, and a fourth conversion layer; the fusing the feature maps of the first part in the first feature map sequence by using the second fusion sub-network to obtain a fourth feature map sequence includes:

converting the first characteristic diagram sequence into a first characteristic diagram matrix by using the third conversion layer;

performing pooling operation on the intercepted feature map of the first preset area in the first feature map matrix by using the first pooling layer to obtain a sixth feature map matrix; and

and converting the sixth characteristic diagram matrix into the fourth characteristic diagram sequence by adopting the fourth conversion layer.

7. The method of claim 5, wherein the image classification model further comprises a second local coding network; the method further comprises the following steps:

extracting a second local feature of the first feature map sequence by using the second local coding network to obtain a second local feature map; and

inputting the global feature map, the first local feature map and the second local feature map into the prediction network to obtain the classification information of the image to be classified,

wherein the first local feature has a dimension that is greater than a dimension of the second local feature.

8. The method of claim 7, wherein the second local coding network comprises a third convergence sub-network, a third coding sub-network, and a third transformation sub-network; the extracting, by using the second local encoding network, the second local feature of the first feature map sequence to obtain a second local feature map includes:

fusing the feature maps of the second part in the first feature map sequence by using the third fusion sub-network to obtain a seventh feature map sequence;

inputting the seventh feature map sequence into the third coding subnetwork to obtain a sixth feature map sequence; and

and converting the sixth feature map sequence into a feature map matrix by using the third conversion sub-network to obtain the second local feature map.

9. The method of claim 8, wherein the third fused subnetwork comprises a fifth conversion layer, a second pooling layer, and a sixth conversion layer; the fusing the feature maps of the second part in the first feature map sequence by using the third fusion sub-network to obtain a seventh feature map sequence includes:

converting the first characteristic diagram sequence into a first characteristic diagram matrix by using the fifth conversion layer;

performing pooling operation on the intercepted feature map of the second preset area in the first feature map matrix by using the second pooling layer to obtain an eighth feature map matrix; and

and converting the eighth characteristic diagram matrix into the seventh characteristic diagram sequence by adopting the sixth conversion layer.

10. A training method of an image classification model, wherein the image classification model comprises a self-attention encoder, a global coding network and a first local coding network; the method comprises the following steps:

dividing a sample image into a plurality of image blocks to obtain an image block sequence;

extracting a first local feature of the first feature map sequence by adopting the first local coding network to obtain a first local feature map;

determining first classification information according to the global feature map, and determining second classification information according to the first local feature map; and

training the image classification model according to a first difference between the first classification information and the second classification information.

11. The method of claim 10, wherein the image classification model further comprises a second local coding network; the method further comprises the following steps:

extracting a second local feature of the first feature map sequence by using the second local coding network to obtain a second local feature map;

determining third classification information according to the second local feature map; and

and training the image classification model according to a second difference between the first classification information and the third classification information and a third difference between the second classification information and the third classification information.

12. The method of claim 11, wherein the sample image includes true value category information; the method further comprises the following steps:

determining, for any one of the first classification information, the second classification information, and the third classification information, a fourth difference between the any one of the classification information and the true value classification information; and

training the image classification model according to a fourth difference between each of the first classification information, the second classification information and the third classification information and the true value classification information.

13. The method of claim 12, wherein the any classification information includes probability distribution information of the sample image for a plurality of predetermined classes; the truth category information comprises a truth category to which the sample image belongs; determining a fourth difference between the any classification information and the truth category information comprises:

determining predetermined distribution information of the sample image for the plurality of predetermined classes according to the truth class; and

determining a fourth difference between the any one of the classification information and the truth-value category information according to a distribution difference between the predetermined distribution information and the probability distribution information.

14. The method of claim 12, wherein the any classification information includes probability distribution information of the sample image for a plurality of predetermined classes; each of the plurality of predetermined categories has a corresponding numerical value; the truth category information comprises a truth category to which the sample image belongs; determining a fourth difference between the any classification information and the truth class information comprises:

determining a weighted sum of a plurality of numerical values corresponding to the plurality of predetermined categories according to the probability distribution information, and taking the weighted sum as a weighted value; and

and determining a fourth difference between any classification information and the truth value category information according to a value difference value between a numerical value corresponding to the truth value category and the weighted value.

15. The method of any of claims 10-14, wherein the sample image includes truth category information; the image classification model further comprises a prediction network; the method further comprises the following steps:

inputting the global feature map and the first local feature map into the prediction network to obtain fourth classification information;

determining a fifth difference between the fourth classification information and the true classification information; and

and training the image classification model according to the fifth difference.

16. An image classification device based on an image classification model, wherein the image classification model comprises a self-attention encoder, a global coding network, a first local coding network and a prediction network; the device comprises:

the image segmentation module is used for segmenting an image to be classified into a plurality of image blocks to obtain an image block sequence;

the self-attention coding module is used for carrying out self-attention coding on the image block sequence by adopting the self-attention coder to obtain a first characteristic diagram sequence; the first feature map sequence comprises a plurality of feature maps for the plurality of image blocks, respectively;

the global coding module is used for extracting global features of the first feature map sequence by adopting the global coding network to obtain a global feature map;

the local coding module is used for extracting a first local feature of the first feature map sequence by adopting the first local coding network to obtain a first local feature map; and

and the classification module is used for inputting the global feature map and the first local feature map into the prediction network to obtain the classification information of the image to be classified.

17. An apparatus for training an image classification model, wherein the image classification model comprises a self-attention encoder, a global coding network, a first local coding network; the device comprises:

the image segmentation module is used for segmenting the sample image into a plurality of image blocks to obtain an image block sequence;

the local coding module is used for extracting a first local feature of the first feature map sequence by adopting the first local coding network to obtain a first local feature map;

the classification module is used for determining first classification information according to the global feature map and determining second classification information according to the first local feature map; and

and the model training module is used for training the image classification model according to a first difference between the first classification information and the second classification information.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-15.

20. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 15.