CN116843830A - Mask image modeling algorithm based on self-supervision learning - Google Patents

Mask image modeling algorithm based on self-supervision learning Download PDF

Info

Publication number
CN116843830A
CN116843830A CN202310691621.8A CN202310691621A CN116843830A CN 116843830 A CN116843830 A CN 116843830A CN 202310691621 A CN202310691621 A CN 202310691621A CN 116843830 A CN116843830 A CN 116843830A
Authority
CN
China
Prior art keywords
mask
patches
image
visible
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310691621.8A
Other languages
Chinese (zh)
Inventor
张正卿
胡超
朱力强
黄家耀
赖盛鑫
邬伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unicom Shanghai Industrial Internet Co Ltd
Original Assignee
China Unicom Shanghai Industrial Internet Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unicom Shanghai Industrial Internet Co Ltd filed Critical China Unicom Shanghai Industrial Internet Co Ltd
Priority to CN202310691621.8A priority Critical patent/CN116843830A/en
Publication of CN116843830A publication Critical patent/CN116843830A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Graphics (AREA)
  • Biophysics (AREA)
  • Geometry (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of mask image modeling (MaskedImageModeling, MIM), in particular to a mask image modeling algorithm based on Self-supervised learning, which is characterized in that an image is firstly divided into patches and randomly divided into 4 equal parts, each patch is used as a visible patch, the rest is used as a mask patch, the other patches are used as 4 mask images, the visible patches are used as encoder input to obtain potential feature representation, the coded visible patches and the mask patches are used as the input of a decoder together to reconstruct the image, and the average absolute error of the prediction results of the overlapped parts of the mask patches in different mask images obtained from the same image is minimized to enhance the certainty of the model reconstruction result. The invention greatly saves training time and hardware resources by fully utilizing the data, and is positioned at the front position in the mask image modeling method on the open source data set.

Description

Mask image modeling algorithm based on self-supervision learning
Technical Field
The invention relates to the technical field of mask image modeling, in particular to a mask image modeling algorithm based on self-supervision learning.
Background
The self-supervision learning can be trained by using a large amount of unlabeled data, improves the generalization capability and efficiency of the model, and is widely applied to the fields of images, voice, texts and the like. Mask image modeling (Masked Image Modeling, MIM) has enjoyed excellent success in the field of self-supervised visual representation learning due to the success of mask language modeling (Masked Language Modeling, MLM) in natural language processing and the development of visual convertors. MIMs learn semantic representations, e.g., normalized pixels, discrete markers, HOG features, depth features, or frequencies, by first masking portions of the input and then predicting their signals based on the unmasked portions.
MAE (MaskedAutoencoders) is a self-supervision learning method for MIM, and has the advantages of strong expansibility and simple method. The MAE may randomly mask portions of the input picture and reconstruct the missing pixels. The MAE adopts an asymmetric encoding and decoding structure, the encoder only encodes visible patches and does not process the mask keys, and the decoder takes the output of the encoder and the mask keys as inputs to reconstruct an image. Because of the difference in information density between image and language data, MAE uses a higher mask ratio. However, this results in a huge computational burden and slow learning process. And random masks are different patches, the model can generate different prediction results, and high uncertainty exists. Such problems are also common problems with mask image modeling.
In summary, the mask image modeling deep learning paradigm based on self-supervision learning is designed to realize high-precision and high-efficiency mask image modeling.
Disclosure of Invention
Aiming at the defects of the self-supervision mask image modeling algorithm at the present stage, the invention provides a novel mask image modeling algorithm based on self-supervision learning, which is characterized in that an image is firstly divided into patches and randomly divided into 4 equal parts, each patch is used as a visible patch, the rest is used as a mask patch, the 4 mask images are obtained by taking the visible patches as encoder input, potential characteristic representation is obtained by taking the encoded visible patches and the mask patches as encoder input, image reconstruction is carried out, and the average absolute error of the prediction results of the overlapped parts of the mask patches in different mask images obtained by the same image is minimized, so that the certainty of the model reconstruction result is enhanced. The mask image modeling method is located in a front position on the open source dataset.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a mask image modeling algorithm based on self-supervised learning, comprising the steps of:
step S1: the open source data sets mainly used by MIM are ImageNet data set, COCO data set, places365 data set and the like, if training is needed on the data sets, the data format is needed to be prepared to be consistent with the data sets, firstly, the data is converted, the image is scaled to the same size, and normalization processing is carried out;
s2, dividing an image into patches with the same size, randomly adding a mask (high mask ratio) to part of the patches, coding visible patches without the mask as input of a coder (ViT), performing linear projection on the visible patches, adding position embedding, and then sending the visible patches into a transform block to obtain potential feature representation;
step S3, combining the mask patterns and the output of the encoder according to the sequence in the original image, taking the combined mask patterns and the output of the encoder as the input of a decoder (ViT), wherein the last layer of the decoder is linear projection, mapping potential features back to a pixel space, and completing the reconstruction prediction of the whole image;
step S4, for the same image, a plurality of mask images with non-overlapping visible patches can be obtained by adding masks in the step S2, partial same mask patches exist in reconstruction prediction of any two mask images, and the average absolute error of reconstruction results of the same mask patches in different mask images is minimized so as to enhance the certainty of model prediction results;
and S5, calculating a Mean Square Error (MSE) between the reconstructed mask patterns and the original image patterns according to the reconstructed mask patterns, minimizing to optimize a model, wherein the model can directly execute an image reconstruction task, or can use different modules to replace a decoder, and execute a corresponding downstream task after fine adjustment.
The mask image modeling algorithm based on self-supervised learning uniformly scales the image data used in S1 to 224×224 size.
The size of the image patches divided in the S2 is 16×16, an image with the size of 224×224 can be divided into 14×14 image patches, the 14×14 image patches are randomly divided into 4 parts to be used as visible patches, masks are added to the rest patches, in this way, 4 mask images with the visible patches not overlapping each other and with the mask ratio of 75% can be obtained from one original image, each mask image uses a vector t with the length of 14×14 to represent position information, each element in the vector t satisfies {0,1} binary distribution, 0 represents mask patches,1 represents visible patches, element indexes are image patches position information, and each mask image contains a pair of complementary combinations x: visible patchesx v =x+.t and mask patterns x m =x+ (1-t), the random partitioning of image patches used the following strategy:
(1) Initializing a 14×14 vector d= [0,1, …,195];
(2) Randomly scrambling the order of the elements in vector d;
(3) Initializing 4 14 x 14 zero vectors t 0 ,t 1 ,t 2 ,t 3 For storing mask image information;
(4) Let i= {0,1,2,3}, t i The update process is shown in formula (1):
t i [d[4*i:4*(i+1)]]=1 (1)
thus, one original image can obtain 4 mask images with the mask ratio of 75% and the visible patches are not overlapped with each other. Wherein 75% is the optimal mask ratio, decreasing the mask ratio increases the image redundancy information, and too high a mask ratio results in poor image reconstruction due to too little image information.
The encoder uses ViT to input visible patches with position element 1 in vector t, as in standard ViT, the encoder embeds the patches by linear projection (one token is generated for each input patch), and adds position embeds, then processes the embedded sequence by a series of Transformer blocks, since the encoder processes only 25% of the visible patches in the whole set of patches, a larger encoder can be trained with less computational resources and hardware cost, and the full use of image data also greatly reduces the training difficulty of the model.
The decoder in S3 is lightweight and consists of a series of Transformer blocks, potential feature representation and mask latches (latches with position elements of 0 in vector t) are obtained by processing visible latches through an encoder, the mask latches is a shared learnable vector, a missing latch needing to be predicted exists and position embedding is added, the last layer of the decoder is linear projection, in order to reconstruct the mask latches conveniently, the number of output channels is the same as the number of pixels in one latch, each element output is a pixel value vector representing the latch, then reshape is obtained, the quality of the feature representation can be effectively improved by using the normalized pixel value of each mask latch as a reconstruction target, the decoder and the encoder adopt an asymmetric design, the calculated amount of each latch is less than 10% of the encoder, the model training time is greatly reduced, and the encoding and decoding processes are as shown in a formula (2):
y=g(f(x v )∪x m ) (2)
wherein x is v Representing visible patches, x m Representing masks patches, f (·) representing the encoder, g (·) representing the decoder, the encoded visible patches and the masks patches being aligned in accordance with the original image patches, the position information being recorded in the vector t.
In the reconstruction results of 4 different mask images from the same original image in S4, 50% of mask patches between every two are overlapped, overlap o ij The definition is as follows:
o ij =t i ∩t j (3)
wherein t is i And t j For masking image i and j vector representations, o ij At t i And t j And elements of both 0 represent overlapping masks patches.
Because of the difference of the visible patches, the reconstruction results of the same mask patches also have certain difference, the average absolute error of the reconstruction results of the same mask patches in different mask images is minimized, the mask patches are guided to reconstruct so as to enhance the certainty of the model prediction result, and the calculation process is shown in a formula (4):
wherein p is i And p j Representing the reconstruction result of mask images i and j.
The reconstruction result of the mask patches in S5 and the mean square error calculation process of the original image patches in the pixel space are as shown in formula (5):
wherein x is m Representing masks Patches, y m Represents x m The loss of use of the final training model can be expressed by equation (6).
L total =L c +L w (6)
The trained model can directly execute the image reconstruction task, can also be used as a pre-trained model, adopts different modules to replace decoders and then carries out fine adjustment, and downstream task tasks such as classification, target detection, instance segmentation and the like are executed.
Compared with the prior art, the invention has the beneficial effects that:
aiming at the defects of the self-supervision mask image modeling algorithm at the present stage, the invention provides a novel mask image modeling algorithm based on self-supervision learning, which is characterized in that an image is firstly divided into patches and randomly divided into 4 equal parts, each patch is used as a visible patch, the rest is used as a mask patch, the 4 mask images are obtained by taking the visible patches as encoder input, potential characteristic representation is obtained by taking the encoded visible patches and the mask patches as encoder input, image reconstruction is carried out, and the average absolute error of the prediction results of the overlapped parts of the mask patches in different mask images obtained by the same image is minimized, so that the certainty of the model reconstruction result is enhanced. By fully utilizing the data, training time and hardware resources are greatly saved, and the mask image modeling method is positioned at the front position on the open source data set.
Detailed Description
In the following, the technical solutions of the embodiments of the present invention will be clearly and completely described in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort in the embodiments of the present invention are included in the scope of protection of the present invention.
A mask image modeling algorithm based on self-supervised learning, comprising the steps of:
step S1: the open source data sets mainly used by MIM are ImageNet data set, COCO data set, places365 data set and the like, if training is needed on the data sets, the data format is needed to be prepared to be consistent with the data sets, firstly, the data is converted, the image is scaled to the same size, and normalization processing is carried out;
s2, dividing an image into patches with the same size, randomly adding a mask (high mask ratio) to part of the patches, coding visible patches without the mask as input of a coder (ViT), performing linear projection on the visible patches, adding position embedding, and then sending the visible patches into a transform block to obtain potential feature representation;
step S3, combining the mask patterns and the output of the encoder according to the sequence in the original image, taking the combined mask patterns and the output of the encoder as the input of a decoder (ViT), wherein the last layer of the decoder is linear projection, mapping potential features back to a pixel space, and completing the reconstruction prediction of the whole image;
step S4, for the same image, a plurality of mask images with non-overlapping visible patches can be obtained by adding masks in the step S2, partial same mask patches exist in reconstruction prediction of any two mask images, and the average absolute error of reconstruction results of the same mask patches in different mask images is minimized so as to enhance the certainty of model prediction results;
and S5, calculating a Mean Square Error (MSE) between the reconstructed mask patterns and the original image patterns according to the reconstructed mask patterns, minimizing to optimize a model, wherein the model can directly execute an image reconstruction task, or can use different modules to replace a decoder, and execute a corresponding downstream task after fine adjustment.
The image data used in S1 is uniformly scaled to 224×224 size.
The size of the image patches divided in the S2 is 16×16, an image with the size of 224×224 can be divided into 14×14 image patches, the 14×14 image patches are randomly divided into 4 parts to be used as visible patches, masks are added to the rest patches, in this way, 4 mask images with the visible patches not overlapping each other and with the mask ratio of 75% can be obtained from one original image, each mask image uses a vector t with the length of 14×14 to represent position information, each element in the vector t satisfies {0,1} binary distribution, 0 represents mask patches,1 represents visible patches, element indexes are image patches position information, and each mask image contains a pair of complementary combinations x: visible patches x v =x+.t and mask patchesx m =x+ (1-t), the random partitioning of image patches used the following strategy:
(1) Initializing a 14×14 vector d= [0,1, …,195];
(2) Randomly scrambling the order of the elements in vector d;
(3) Initializing 4 14 x 14 zero vectors t 0 ,t 1 ,t 2 ,t 3 For storing mask image information;
(4) Let i= {0,1,2,3}, t i The update process is shown in formula (1):
t i [d[4*i:4*(i+1)]]=1 (1)
thus, one original image can obtain 4 mask images with the mask ratio of 75% and the visible patches are not overlapped with each other. Wherein 75% is the optimal mask ratio, decreasing the mask ratio increases the image redundancy information, and too high a mask ratio results in poor image reconstruction due to too little image information.
The encoder uses ViT to input visible patches with position element 1 in vector t, as in standard ViT, the encoder embeds the patches by linear projection (one token is generated for each input patch), and adds position embeds, then processes the embedded sequence by a series of Transformer blocks, since the encoder processes only 25% of the visible patches in the whole set of patches, a larger encoder can be trained with less computational resources and hardware cost, and the full use of image data also greatly reduces the training difficulty of the model.
The decoder in S3 is lightweight and consists of a series of Transformer blocks, potential feature representation and mask latches (latches with position elements of 0 in vector t) are obtained by processing visible latches through an encoder, the mask latches is a shared learnable vector, a missing latch needing to be predicted exists and position embedding is added, the last layer of the decoder is linear projection, in order to reconstruct the mask latches conveniently, the number of output channels is the same as the number of pixels in one latch, each element output is a pixel value vector representing the latch, then reshape is obtained, the quality of the feature representation can be effectively improved by using the normalized pixel value of each mask latch as a reconstruction target, the decoder and the encoder adopt an asymmetric design, the calculated amount of each latch is less than 10% of the encoder, the model training time is greatly reduced, and the encoding and decoding processes are as shown in a formula (2):
y=g(f(x v )∪x m ) (2)
wherein x is v Representing visible patches, x m Representing masks patches, f (·) representing the encoder, g (·) representing the decoder, the encoded visible patches and the masks patches being aligned in accordance with the original image patches, the position information being recorded in the vector t.
In the reconstruction results of 4 different mask images from the same original image in S4, 50% of mask patches between every two are overlapped, overlap o ij The definition is as follows:
o ij =t i ∩t j (3)
wherein t is i And t j For masking image i and j vector representations, o ij At t i And t j And elements of both 0 represent overlapping masks patches.
Because of the difference of the visible patches, the reconstruction results of the same mask patches also have certain difference, the average absolute error of the reconstruction results of the same mask patches in different mask images is minimized, the mask patches are guided to reconstruct so as to enhance the certainty of the model prediction result, and the calculation process is shown in a formula (4):
wherein p is i And p j Representing the reconstruction result of mask images i and j.
The reconstruction result of the mask patches in S5 and the mean square error calculation process of the original image patches in the pixel space are as shown in formula (5):
wherein x is m Representing masks Patches, y m Represents x m The loss of use of the final training model can be expressed by equation (6).
L total =L c +L w (6)
The trained model can directly execute the image reconstruction task, can also be used as a pre-trained model, adopts different modules to replace decoders and then carries out fine adjustment, and downstream task tasks such as classification, target detection, instance segmentation and the like are executed.
Examples:
the algorithm comprises the following steps:
(1) Image reconstruction task
Inputting a 224×224 size defect image, dividing the defect image into 14×14 patches with the size of 16×16, enabling the perfect patches to be visible patches, enabling the patches with defects to be mask patches, inputting the visible patches into an encoder, performing linear projection, adding position embedding, then sending the mask patches and the encoded visible patches into a transform block to obtain potential feature representation, combining the mask patches and the encoded visible patches according to the sequence in an original image, taking the mask patches and the encoded visible patches as input of a decoder (ViT), enabling the final layer of the decoder to be linear projection, mapping the potential features back to a pixel space, and outputting an image after reconstructing the defect patches.
(2) Image classification task
In the detection of the downstream task, an image classification task is described as an example, and the image classification task is performed after fine adjustment using a multi-layer perceptron (MLP) header or a linear layer instead of a decoder. A224×224 natural image is input, the natural image is divided into 14×14 patches with the size of 16×16, all patches are input into an encoder, the encoder outputs a patch with the size of 196×1024, wherein each row represents a feature vector of one patch, and the obtained potential feature representation is input into a multi-layer perceptron (MLP) head or a linear layer to obtain a classification result.
The downstream task approaches such as object detection and instance segmentation are similar, except that only the modules employed by the decoder are replaced.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (6)

1. A mask image modeling algorithm based on self-supervision learning is characterized by comprising the following steps:
step S1: the open source data sets mainly used by MIM are ImageNet data set, COCO data set, places365 data set and the like, if training is needed on the data sets, the data format is needed to be prepared to be consistent with the data sets, firstly, the data is converted, the image is scaled to the same size, and normalization processing is carried out;
s2, dividing an image into patches with the same size, randomly adding a mask (high mask ratio) to part of the patches, coding visible patches without the mask as input of a coder (ViT), performing linear projection on the visible patches, adding position embedding, and then sending the visible patches into a transform block to obtain potential feature representation;
step S3, combining the mask patterns and the output of the encoder according to the sequence in the original image, taking the combined mask patterns and the output of the encoder as the input of a decoder (ViT), wherein the last layer of the decoder is linear projection, mapping potential features back to a pixel space, and completing the reconstruction prediction of the whole image;
step S4, for the same image, a plurality of mask images with non-overlapping visible patches can be obtained by adding masks in the step S2, partial same mask patches exist in reconstruction prediction of any two mask images, and the average absolute error of reconstruction results of the same mask patches in different mask images is minimized so as to enhance the certainty of model prediction results;
and S5, calculating a Mean Square Error (MSE) between the reconstructed mask patterns and the original image patterns according to the reconstructed mask patterns, minimizing to optimize a model, wherein the model can directly execute an image reconstruction task, or can use different modules to replace a decoder, and execute a corresponding downstream task after fine adjustment.
2. A mask image modeling algorithm based on self-supervised learning as claimed in claim 1, wherein the image data used in S1 is uniformly scaled to a 224 x 224 size.
3. The mask image modeling algorithm based on self-supervised learning as claimed in claim 1, wherein the size of the image patches divided in S2 is 16×16, an image of 224×224 is divided into 14×14 image patches, the 14×14 image patches are randomly divided into 4 parts, the visible patches are used as the visible patches, and masks are added to the rest patches, in such a way that one original image can obtain mask images of which 4 visible patches are not overlapped with each other and the mask ratio is 75%, each mask image represents position information by using a vector t of length 14×14, and each element in the vector t is fullWith a {0,1} binary distribution, 0 represents mask patches,1 represents visible patches, the element index is image patches location information, and each mask image contains a pair of complementary combinations x: visible patches x v =x+.t and mask patchesx m =x+ (1-t), the random partitioning of image patches used the following strategy:
(1) Initializing a 14×14 vector d= [0,1, …,195];
(2) Randomly scrambling the order of the elements in vector d;
(3) Initializing 4 14 x 14 zero vectors t 0 ,t 1 ,t 2 ,t 3 For storing mask image information;
(4) Let i= {0,1,2,3}, t i The update process is shown in formula (1):
t i [d[4*i:4*(i+1)]]=1 (1)
thus, one original image can obtain 4 mask images with the mask ratio of 75% and the visible patches are not overlapped with each other. Wherein 75% is the optimal mask ratio, decreasing the mask ratio increases the image redundancy information, and too high a mask ratio results in poor image reconstruction due to too little image information.
The encoder uses ViT to input visible patches with position element 1 in vector t, as in standard ViT, the encoder embeds the patches by linear projection (one token is generated for each input patch), and adds position embeds, then processes the embedded sequence by a series of Transformer blocks, since the encoder processes only 25% of the visible patches in the whole set of patches, a larger encoder can be trained with less computational resources and hardware cost, and the full use of image data also greatly reduces the training difficulty of the model.
4. The method for modeling mask image based on self-supervised learning as claimed in claim 1, wherein the decoder in S3 is lightweight and consists of a series of transform blocks, the inputs are potential feature representation and mask token (latches with position element 0 in vector t) obtained after the visible latches are processed by the encoder, the mask token is a shared learnable vector, which represents that there is a missing latch to be predicted, and position embedding is added, the last layer of the decoder is linear projection, in order to reconstruct the mask latches, the number of output channels is the same as the number of pixels in one latch, each element of the output is a pixel value vector representing the latch, then reshape is obtained, the quality of feature representation can be effectively improved by using normalized pixel value of each mask latch as a reconstruction target, the decoder and the encoder adopt asymmetric design, the calculation amount of each ton is less than 10% of the encoder, the decoding training time is greatly reduced, and the decoding training time is as shown in formula (2):
y=g(f(x v )∪x m ) (2)
wherein x is v Representing visible patches, x m Representing masks patches, f (·) representing the encoder, g (·) representing the decoder, the encoded visible patches and the masks patches being aligned in accordance with the original image patches, the position information being recorded in the vector t.
5. The mask image modeling algorithm based on self-supervised learning as claimed in claim 1, wherein 50% of mask patches between two of the reconstruction results of 4 different mask images from the same original image in S4 are overlapped by an overlap o ij The definition is as follows:
o ij =t i ∩t j (3)
wherein t is i And t j For masking image i and j vector representations, o ij At t i And t j And elements of both 0 represent overlapping masks patches.
Because of the difference of the visible patches, the reconstruction results of the same mask patches also have certain difference, the average absolute error of the reconstruction results of the same mask patches in different mask images is minimized, the mask patches are guided to reconstruct so as to enhance the certainty of the model prediction result, and the calculation process is shown in a formula (4):
wherein p is i And p j Representing the reconstruction result of mask images i and j.
6. The mask image modeling algorithm based on self-supervised learning as set forth in claim 1, wherein the process of calculating the mean square error between the reconstruction result of the mask patches in S5 and the original image patches in the pixel space is as shown in formula (5):
wherein x is m Representing masks Patches, y m Represents x m The loss of use of the final training model can be expressed by equation (6).
L total =L c +L w (6)
The trained model can directly execute the image reconstruction task, can also be used as a pre-trained model, adopts different modules to replace decoders and then carries out fine adjustment, and downstream task tasks such as classification, target detection, instance segmentation and the like are executed.
CN202310691621.8A 2023-06-12 2023-06-12 Mask image modeling algorithm based on self-supervision learning Pending CN116843830A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310691621.8A CN116843830A (en) 2023-06-12 2023-06-12 Mask image modeling algorithm based on self-supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310691621.8A CN116843830A (en) 2023-06-12 2023-06-12 Mask image modeling algorithm based on self-supervision learning

Publications (1)

Publication Number Publication Date
CN116843830A true CN116843830A (en) 2023-10-03

Family

ID=88164381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310691621.8A Pending CN116843830A (en) 2023-06-12 2023-06-12 Mask image modeling algorithm based on self-supervision learning

Country Status (1)

Country Link
CN (1) CN116843830A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173182A (en) * 2023-11-03 2023-12-05 厦门微亚智能科技股份有限公司 Defect detection method, system, equipment and medium based on coding and decoding network
CN118279151A (en) * 2024-06-03 2024-07-02 贵州大学 Self-supervision DW image super-resolution reconstruction method for any scale angle

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117173182A (en) * 2023-11-03 2023-12-05 厦门微亚智能科技股份有限公司 Defect detection method, system, equipment and medium based on coding and decoding network
CN117173182B (en) * 2023-11-03 2024-03-19 厦门微亚智能科技股份有限公司 Defect detection method, system, equipment and medium based on coding and decoding network
CN118279151A (en) * 2024-06-03 2024-07-02 贵州大学 Self-supervision DW image super-resolution reconstruction method for any scale angle

Similar Documents

Publication Publication Date Title
Parmar et al. Image transformer
Trinh et al. Selfie: Self-supervised pretraining for image embedding
Salimans et al. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications
CN109919204B (en) Noise image-oriented deep learning clustering method
CN116843830A (en) Mask image modeling algorithm based on self-supervision learning
CN109033095B (en) Target transformation method based on attention mechanism
CN113177882B (en) Single-frame image super-resolution processing method based on diffusion model
CN113934890B (en) Method and system for automatically generating scene video by characters
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
Akbari et al. Learned multi-resolution variable-rate image compression with octave-based residual blocks
CN115331073A (en) Image self-supervision learning method based on TransUnnet architecture
CN116600119B (en) Video encoding method, video decoding method, video encoding device, video decoding device, computer equipment and storage medium
Oza et al. Semi-supervised image-to-image translation
CN117237623B (en) Semantic segmentation method and system for remote sensing image of unmanned aerial vehicle
CN111343458B (en) Sparse gray image coding and decoding method and system based on reconstructed residual
Adigun et al. Training generative adversarial networks with bidirectional backpropagation
Ma et al. AFEC: adaptive feature extraction modules for learned image compression
CN114283181B (en) Dynamic texture migration method and system based on sample
Ren The advance of generative model and variational autoencoder
CN113781376B (en) High-definition face attribute editing method based on divide-and-congress
CN116630369A (en) Unmanned aerial vehicle target tracking method based on space-time memory network
CN113554569B (en) Face image restoration system based on double memory dictionaries
Im et al. FaceBERT: Face De-Identification Using VQGAN and BERT
CN111131834B (en) Reversible self-encoder, encoding and decoding method, image compression method and device
CN113936333A (en) Action recognition algorithm based on human body skeleton sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination