CN116843830A

CN116843830A - Mask image modeling algorithm based on self-supervision learning

Info

Publication number: CN116843830A
Application number: CN202310691621.8A
Authority: CN
Inventors: 张正卿; 胡超; 朱力强; 黄家耀; 赖盛鑫; 邬伟杰
Original assignee: China Unicom Shanghai Industrial Internet Co Ltd
Current assignee: China Unicom Shanghai Industrial Internet Co Ltd
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-10-03

Abstract

The invention relates to the technical field of mask image modeling (MaskedImageModeling, MIM), in particular to a mask image modeling algorithm based on Self-supervised learning, which is characterized in that an image is firstly divided into patches and randomly divided into 4 equal parts, each patch is used as a visible patch, the rest is used as a mask patch, the other patches are used as 4 mask images, the visible patches are used as encoder input to obtain potential feature representation, the coded visible patches and the mask patches are used as the input of a decoder together to reconstruct the image, and the average absolute error of the prediction results of the overlapped parts of the mask patches in different mask images obtained from the same image is minimized to enhance the certainty of the model reconstruction result. The invention greatly saves training time and hardware resources by fully utilizing the data, and is positioned at the front position in the mask image modeling method on the open source data set.

Description

Mask image modeling algorithm based on self-supervision learning

Technical Field

The invention relates to the technical field of mask image modeling, in particular to a mask image modeling algorithm based on self-supervision learning.

Background

The self-supervision learning can be trained by using a large amount of unlabeled data, improves the generalization capability and efficiency of the model, and is widely applied to the fields of images, voice, texts and the like. Mask image modeling (Masked Image Modeling, MIM) has enjoyed excellent success in the field of self-supervised visual representation learning due to the success of mask language modeling (Masked Language Modeling, MLM) in natural language processing and the development of visual convertors. MIMs learn semantic representations, e.g., normalized pixels, discrete markers, HOG features, depth features, or frequencies, by first masking portions of the input and then predicting their signals based on the unmasked portions.

MAE (MaskedAutoencoders) is a self-supervision learning method for MIM, and has the advantages of strong expansibility and simple method. The MAE may randomly mask portions of the input picture and reconstruct the missing pixels. The MAE adopts an asymmetric encoding and decoding structure, the encoder only encodes visible patches and does not process the mask keys, and the decoder takes the output of the encoder and the mask keys as inputs to reconstruct an image. Because of the difference in information density between image and language data, MAE uses a higher mask ratio. However, this results in a huge computational burden and slow learning process. And random masks are different patches, the model can generate different prediction results, and high uncertainty exists. Such problems are also common problems with mask image modeling.

In summary, the mask image modeling deep learning paradigm based on self-supervision learning is designed to realize high-precision and high-efficiency mask image modeling.

Disclosure of Invention

Aiming at the defects of the self-supervision mask image modeling algorithm at the present stage, the invention provides a novel mask image modeling algorithm based on self-supervision learning, which is characterized in that an image is firstly divided into patches and randomly divided into 4 equal parts, each patch is used as a visible patch, the rest is used as a mask patch, the 4 mask images are obtained by taking the visible patches as encoder input, potential characteristic representation is obtained by taking the encoded visible patches and the mask patches as encoder input, image reconstruction is carried out, and the average absolute error of the prediction results of the overlapped parts of the mask patches in different mask images obtained by the same image is minimized, so that the certainty of the model reconstruction result is enhanced. The mask image modeling method is located in a front position on the open source dataset.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a mask image modeling algorithm based on self-supervised learning, comprising the steps of:

step S1: the open source data sets mainly used by MIM are ImageNet data set, COCO data set, places365 data set and the like, if training is needed on the data sets, the data format is needed to be prepared to be consistent with the data sets, firstly, the data is converted, the image is scaled to the same size, and normalization processing is carried out;

s2, dividing an image into patches with the same size, randomly adding a mask (high mask ratio) to part of the patches, coding visible patches without the mask as input of a coder (ViT), performing linear projection on the visible patches, adding position embedding, and then sending the visible patches into a transform block to obtain potential feature representation;

step S3, combining the mask patterns and the output of the encoder according to the sequence in the original image, taking the combined mask patterns and the output of the encoder as the input of a decoder (ViT), wherein the last layer of the decoder is linear projection, mapping potential features back to a pixel space, and completing the reconstruction prediction of the whole image;

step S4, for the same image, a plurality of mask images with non-overlapping visible patches can be obtained by adding masks in the step S2, partial same mask patches exist in reconstruction prediction of any two mask images, and the average absolute error of reconstruction results of the same mask patches in different mask images is minimized so as to enhance the certainty of model prediction results;

and S5, calculating a Mean Square Error (MSE) between the reconstructed mask patterns and the original image patterns according to the reconstructed mask patterns, minimizing to optimize a model, wherein the model can directly execute an image reconstruction task, or can use different modules to replace a decoder, and execute a corresponding downstream task after fine adjustment.

The mask image modeling algorithm based on self-supervised learning uniformly scales the image data used in S1 to 224×224 size.

The size of the image patches divided in the S2 is 16×16, an image with the size of 224×224 can be divided into 14×14 image patches, the 14×14 image patches are randomly divided into 4 parts to be used as visible patches, masks are added to the rest patches, in this way, 4 mask images with the visible patches not overlapping each other and with the mask ratio of 75% can be obtained from one original image, each mask image uses a vector t with the length of 14×14 to represent position information, each element in the vector t satisfies {0,1} binary distribution, 0 represents mask patches,1 represents visible patches, element indexes are image patches position information, and each mask image contains a pair of complementary combinations x: visible patchesx _v =x+.t and mask patterns x _m =x+ (1-t), the random partitioning of image patches used the following strategy:

(1) Initializing a 14×14 vector d= [0,1, …,195];

(2) Randomly scrambling the order of the elements in vector d;

(3) Initializing 4 14 x 14 zero vectors t ₀ ,t ₁ ,t ₂ ,t ₃ For storing mask image information;

(4) Let i= {0,1,2,3}, t _i The update process is shown in formula (1):

t _i [d[4*i:4*(i+1)]]＝1 (1)

thus, one original image can obtain 4 mask images with the mask ratio of 75% and the visible patches are not overlapped with each other. Wherein 75% is the optimal mask ratio, decreasing the mask ratio increases the image redundancy information, and too high a mask ratio results in poor image reconstruction due to too little image information.

The encoder uses ViT to input visible patches with position element 1 in vector t, as in standard ViT, the encoder embeds the patches by linear projection (one token is generated for each input patch), and adds position embeds, then processes the embedded sequence by a series of Transformer blocks, since the encoder processes only 25% of the visible patches in the whole set of patches, a larger encoder can be trained with less computational resources and hardware cost, and the full use of image data also greatly reduces the training difficulty of the model.

The decoder in S3 is lightweight and consists of a series of Transformer blocks, potential feature representation and mask latches (latches with position elements of 0 in vector t) are obtained by processing visible latches through an encoder, the mask latches is a shared learnable vector, a missing latch needing to be predicted exists and position embedding is added, the last layer of the decoder is linear projection, in order to reconstruct the mask latches conveniently, the number of output channels is the same as the number of pixels in one latch, each element output is a pixel value vector representing the latch, then reshape is obtained, the quality of the feature representation can be effectively improved by using the normalized pixel value of each mask latch as a reconstruction target, the decoder and the encoder adopt an asymmetric design, the calculated amount of each latch is less than 10% of the encoder, the model training time is greatly reduced, and the encoding and decoding processes are as shown in a formula (2):

y＝g(f(x _v )∪x _m ) (2)

wherein x is _v Representing visible patches, x _m Representing masks patches, f (·) representing the encoder, g (·) representing the decoder, the encoded visible patches and the masks patches being aligned in accordance with the original image patches, the position information being recorded in the vector t.

In the reconstruction results of 4 different mask images from the same original image in S4, 50% of mask patches between every two are overlapped, overlap o _ij The definition is as follows:

o _ij ＝t _i ∩t _j (3)

wherein t is _i And t _j For masking image i and j vector representations, o _ij At t _i And t _j And elements of both 0 represent overlapping masks patches.

Because of the difference of the visible patches, the reconstruction results of the same mask patches also have certain difference, the average absolute error of the reconstruction results of the same mask patches in different mask images is minimized, the mask patches are guided to reconstruct so as to enhance the certainty of the model prediction result, and the calculation process is shown in a formula (4):

wherein p is _i And p _j Representing the reconstruction result of mask images i and j.

The reconstruction result of the mask patches in S5 and the mean square error calculation process of the original image patches in the pixel space are as shown in formula (5):

wherein x is _m Representing masks Patches, y _m Represents x _m The loss of use of the final training model can be expressed by equation (6).

L _total ＝L _c +L _w (6)

The trained model can directly execute the image reconstruction task, can also be used as a pre-trained model, adopts different modules to replace decoders and then carries out fine adjustment, and downstream task tasks such as classification, target detection, instance segmentation and the like are executed.

Compared with the prior art, the invention has the beneficial effects that:

aiming at the defects of the self-supervision mask image modeling algorithm at the present stage, the invention provides a novel mask image modeling algorithm based on self-supervision learning, which is characterized in that an image is firstly divided into patches and randomly divided into 4 equal parts, each patch is used as a visible patch, the rest is used as a mask patch, the 4 mask images are obtained by taking the visible patches as encoder input, potential characteristic representation is obtained by taking the encoded visible patches and the mask patches as encoder input, image reconstruction is carried out, and the average absolute error of the prediction results of the overlapped parts of the mask patches in different mask images obtained by the same image is minimized, so that the certainty of the model reconstruction result is enhanced. By fully utilizing the data, training time and hardware resources are greatly saved, and the mask image modeling method is positioned at the front position on the open source data set.

Detailed Description

In the following, the technical solutions of the embodiments of the present invention will be clearly and completely described in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort in the embodiments of the present invention are included in the scope of protection of the present invention.

The image data used in S1 is uniformly scaled to 224×224 size.

The size of the image patches divided in the S2 is 16×16, an image with the size of 224×224 can be divided into 14×14 image patches, the 14×14 image patches are randomly divided into 4 parts to be used as visible patches, masks are added to the rest patches, in this way, 4 mask images with the visible patches not overlapping each other and with the mask ratio of 75% can be obtained from one original image, each mask image uses a vector t with the length of 14×14 to represent position information, each element in the vector t satisfies {0,1} binary distribution, 0 represents mask patches,1 represents visible patches, element indexes are image patches position information, and each mask image contains a pair of complementary combinations x: visible patches x _v =x+.t and mask patchesx _m =x+ (1-t), the random partitioning of image patches used the following strategy:

(1) Initializing a 14×14 vector d= [0,1, …,195];

(2) Randomly scrambling the order of the elements in vector d;

(4) Let i= {0,1,2,3}, t _i The update process is shown in formula (1):

t _i [d[4*i:4*(i+1)]]＝1 (1)

y＝g(f(x _v )∪x _m ) (2)

o _ij ＝t _i ∩t _j (3)

L _total ＝L _c +L _w (6)

Examples:

the algorithm comprises the following steps:

(1) Image reconstruction task

Inputting a 224×224 size defect image, dividing the defect image into 14×14 patches with the size of 16×16, enabling the perfect patches to be visible patches, enabling the patches with defects to be mask patches, inputting the visible patches into an encoder, performing linear projection, adding position embedding, then sending the mask patches and the encoded visible patches into a transform block to obtain potential feature representation, combining the mask patches and the encoded visible patches according to the sequence in an original image, taking the mask patches and the encoded visible patches as input of a decoder (ViT), enabling the final layer of the decoder to be linear projection, mapping the potential features back to a pixel space, and outputting an image after reconstructing the defect patches.

(2) Image classification task

In the detection of the downstream task, an image classification task is described as an example, and the image classification task is performed after fine adjustment using a multi-layer perceptron (MLP) header or a linear layer instead of a decoder. A224×224 natural image is input, the natural image is divided into 14×14 patches with the size of 16×16, all patches are input into an encoder, the encoder outputs a patch with the size of 196×1024, wherein each row represents a feature vector of one patch, and the obtained potential feature representation is input into a multi-layer perceptron (MLP) head or a linear layer to obtain a classification result.

The downstream task approaches such as object detection and instance segmentation are similar, except that only the modules employed by the decoder are replaced.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A mask image modeling algorithm based on self-supervision learning is characterized by comprising the following steps:

2. A mask image modeling algorithm based on self-supervised learning as claimed in claim 1, wherein the image data used in S1 is uniformly scaled to a 224 x 224 size.

3. The mask image modeling algorithm based on self-supervised learning as claimed in claim 1, wherein the size of the image patches divided in S2 is 16×16, an image of 224×224 is divided into 14×14 image patches, the 14×14 image patches are randomly divided into 4 parts, the visible patches are used as the visible patches, and masks are added to the rest patches, in such a way that one original image can obtain mask images of which 4 visible patches are not overlapped with each other and the mask ratio is 75%, each mask image represents position information by using a vector t of length 14×14, and each element in the vector t is fullWith a {0,1} binary distribution, 0 represents mask patches,1 represents visible patches, the element index is image patches location information, and each mask image contains a pair of complementary combinations x: visible patches x _v =x+.t and mask patchesx _m =x+ (1-t), the random partitioning of image patches used the following strategy:

(1) Initializing a 14×14 vector d= [0,1, …,195];

(2) Randomly scrambling the order of the elements in vector d;

(4) Let i= {0,1,2,3}, t _i The update process is shown in formula (1):

t _i [d[4*i:4*(i+1)]]＝1 (1)

4. The method for modeling mask image based on self-supervised learning as claimed in claim 1, wherein the decoder in S3 is lightweight and consists of a series of transform blocks, the inputs are potential feature representation and mask token (latches with position element 0 in vector t) obtained after the visible latches are processed by the encoder, the mask token is a shared learnable vector, which represents that there is a missing latch to be predicted, and position embedding is added, the last layer of the decoder is linear projection, in order to reconstruct the mask latches, the number of output channels is the same as the number of pixels in one latch, each element of the output is a pixel value vector representing the latch, then reshape is obtained, the quality of feature representation can be effectively improved by using normalized pixel value of each mask latch as a reconstruction target, the decoder and the encoder adopt asymmetric design, the calculation amount of each ton is less than 10% of the encoder, the decoding training time is greatly reduced, and the decoding training time is as shown in formula (2):

y＝g(f(x _v )∪x _m ) (2)

5. The mask image modeling algorithm based on self-supervised learning as claimed in claim 1, wherein 50% of mask patches between two of the reconstruction results of 4 different mask images from the same original image in S4 are overlapped by an overlap o _ij The definition is as follows:

o _ij ＝t _i ∩t _j (3)

6. The mask image modeling algorithm based on self-supervised learning as set forth in claim 1, wherein the process of calculating the mean square error between the reconstruction result of the mask patches in S5 and the original image patches in the pixel space is as shown in formula (5):

L _total ＝L _c +L _w (6)