CN117671072B

CN117671072B - Cell fluorescence image generation method based on conditional diffusion model, model and application

Info

Publication number: CN117671072B
Application number: CN202410129759.3A
Authority: CN
Inventors: 徐莹莹; 李玉
Original assignee: Southern Medical University
Current assignee: Southern Medical University
Priority date: 2024-01-31
Filing date: 2024-01-31
Publication date: 2024-05-10
Anticipated expiration: 2044-01-31
Also published as: CN117671072A

Abstract

The invention relates to a cell fluorescence image generation method based on a conditional diffusion model, a model and application thereof, and relates to the technical field of biological information. The model comprises a size optimization module, an image generation module, a learning module and a weakening module; the image generation module includes a diffusion model, generating an countermeasure network. By using the model, immunofluorescence images of quantitative expression of proteins at different subcellular locations can be generated, and the indexes of fidelity and diversity can be simultaneously evaluated.

Description

Cell fluorescence image generation method based on conditional diffusion model, model and application

Technical Field

The invention relates to the technical field of biological information, in particular to a cell fluorescence image generation method, a cell fluorescence image generation model and application based on a conditional diffusion model.

Background

The locations where proteins are located within cells are called subcellular locations, which provide specific physicochemical environments for the biological functions of the proteins. The subcellular locations where proteins are located are not all static, they need to flow between different subcellular locations to perform their functions, and furthermore, some proteins have multiple functions and also need to have different conditions of physicochemical environment. Such subcellular structures present in two or more are known as multi-labeled proteins. More than half of human proteins are statistically multi-labeled proteins. The quantitative distribution ratio of the multi-labeled proteins in different subcellular structures is the basis for researching the dynamic behavior and abnormal expression of the proteins.

Immunofluorescence microscopy images are a widely used source of data that reveals the spatial distribution of proteins at the subcellular level. Many studies automatically resolve subcellular localization patterns using machine learning methods, however, most studies have only performed qualitative classification, and only a few have attempted to identify quantitative distribution of proteins. The main reason is that the lack of quantitative labeling data is high in cost and low in efficiency due to the fact that quantitative labeling is performed manually. While some current deep learning based generation models have been used to model cell structure and protein distribution, conditionally generate cell images, such as ：（1）U-Net（C. Ounkomol, S. Seshamani, M. M. Maleckar, F. Collman, and G. R. Johnson, "Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy," Nature methods, vol. 15, no. 11, pp. 917-920, Sep. 2018.）. the model is based on the U-Net deep network framework, it predicts immunofluorescence images .（2）cAAE（G. R. Johnson, R. M. Donovan-Maiye, and M. M. Maleckar, "Generative modeling with conditional autoencoders: building an integrated cell," 2017, arXiv:1705.00092.）. of multiple subcellular structures from bright field images the model builds two concatenated antagonistic self-encoders, learns the characterization of the cell nuclei and cell membranes first, and learns the relationship .（3）β-VAE（R. M. Donovan-Maiye et al., "A deep generative model of 3D single-cell organization," PLoS Comput. Biol., vol. 18, no. 1, pp. e1009155. 2022.）. between the characterization of the cell nuclei and cell membranes and subcellular structures based on the former model, which is a three-dimensional subcellular location model based on the variable self-encoders, it proposes to use the super-parametric beta balance characterization and reconstructed image .（4）BioGAN（A. Osokin, A. Chessel, R. E. Carazo Salas, and F. Vaggi, "GANs for biological image synthesis," in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Venice, Italy, 2017, pp. 2233-2242.）. the model is a generated antagonistic network .（5）cGAN（H. Yuan et al., "Computational modeling of cellular structures using conditional deep generative networks," Bioinformatics, vol. 35, no. 12, pp. 2141-2149, Jun. 2019.） mapping the latent codes into images referencing cAAE model, creating the relationship between cell nuclei and cell membranes and subcellular structures based on conditional generation antagonistic networks.

However, the fluorescent spots in the cell image generated by the model are focused on single subcellular locations, or the spatial relationship between subcellular locations and cell structure images is modeled. There is no method available to generate immunofluorescence images with quantitative expression of different subcellular locations.

Disclosure of Invention

In view of the above problems, the present invention provides a model based on a conditional diffusion model, by which immunofluorescence images of quantitative expression of proteins at different subcellular locations can be generated, and indices of fidelity and diversity can be simultaneously evaluated.

In order to achieve the above object, the present invention provides a model based on a conditional diffusion model, including a size optimization module, an image generation module, a learning module, and a weakening module;

The image generation module comprises a diffusion model and a generation countermeasure network, the generation countermeasure network comprises a generator and a discriminator, the generator is used for inputting a forward protein image, the forward protein image is obtained by processing a protein image through a forward noise adding process and a convolution layer of the diffusion model, the discriminator is used for inputting a real image and a false image, discriminating the real image and the false image, and the false image is obtained by processing a false 0-step protein image output by the generator through a reverse sampling process of the diffusion model;

The generator comprises an encoder, a decoder, a self-attention module and a wavelet transformation embedding module; the encoder comprises a plurality of conventional residual blocks, the decoder comprises a plurality of conventional residual blocks, the self-attention module is arranged between the encoder and the decoder, the wavelet transform embedding module comprises a wavelet transform downsampling network and a plurality of frequency bottleneck blocks, the frequency bottleneck blocks are respectively arranged at the rear end of the encoder and the front end of the decoder, the wavelet transform downsampling network comprises a plurality of wavelet transform downsampling layers, and the wavelet transform downsampling layers are correspondingly arranged with the conventional residual blocks of the encoder.

In one embodiment, the encoder is configured to input a forward protein image, the decoder is configured to output a false 0-step protein image, the wavelet transform downsampling network is configured to input a forward noise-added protein image, and the forward noise-added protein image is obtained by processing the protein image through a forward noise adding process of the diffusion model.

In one embodiment, the protein image is obtained by optimizing the protein image to be processed through the size optimizing module, and the optimization is realized through a wavelet decomposition method;

The learning module comprises a cell structure coding network, wherein the cell structure coding network comprises a frequency bottleneck block and a plurality of conventional residual blocks, the conventional residual blocks of the cell structure coding network are correspondingly arranged with the conventional residual blocks of the encoder, and the frequency bottleneck block of the cell structure coding network is correspondingly arranged with the frequency bottleneck block of the encoder;

The weakening module comprises a biased marking frame and is used for outputting weakening marks.

The invention also provides a construction method of the model, which comprises the following steps:

Optimizing an image: optimizing a protein image to be processed of a cell picture through the optimizing module to obtain the protein image; optimizing a cell structure image to be processed of the cell picture through the optimizing module to obtain a cell structure image;

and (3) image generation: inputting the protein image into the diffusion model, obtaining a forward noise-added protein image after the forward noise-added process of the diffusion model is processed, inputting the forward noise-added protein image into the generator for generating and processing, and obtaining a false 0-step protein image; the false 0-step protein image is processed through the reverse sampling process of the diffusion model to obtain a false image;

Image identification: inputting the real image, the false image and the forward noise-added protein image into a discriminator for discrimination, so that the countering loss, the content loss and the marking loss of a generator are minimized, and the countering loss and the marking loss of the discriminator are minimized, thereby obtaining the model.

When the loss is minimized, the model obtained at this time is considered to be a satisfactory model, and the false image generated by using the model can be regarded as equivalent to the real image.

In one embodiment, the real image is obtained by processing the protein image through a forward noise adding process of the diffusion model.

In one embodiment, the real image is obtained by performing sampling step length t-1 times of noise addition on the protein image through a forward noise adding process of the diffusion model.

In one embodiment, the counter loss of the generator, the counter loss of the discriminator is as follows:

Wherein, Is a real image,/>Is a false image,/>The real t-step noisy protein image is a forward noisy protein image; /(I)Representing a true t-step noisy protein image/>Distribution of (i.e. true image/>)Is provided for the distribution of (a),Representing false images/>Is a distribution of (3); /(I)，/>Representing the predicted true image/>, respectively, of the discriminatorAnd false image/>Is a true-false score of (2); /(I)Representing a real image/>Is equal to 1; /(I)Representing generator part false image/>Is equal to 1; /(I)Representing discriminator partial false image/>Is equal to 0.

In one embodiment, the content loss of the generator is as follows:

Wherein, Is a true 0-step protein image,/>Is a false 0-step protein image; /(I)Representing the true 0-step protein image/>Distribution of/>Representing false 0-step protein image/>Is a distribution of (a).

In one embodiment, the signature loss of the generator, the signature loss of the discriminator are as follows:

Wherein, Is a real image,/>Is a false image,/>The real t-step noisy protein image is a forward noisy protein image; /(I)Representing a true t-step noisy protein image/>Distribution of (i.e. true image/>)Is provided for the distribution of (a),Representing false images/>Is a distribution of (3); /(I)，/>Representing the predicted true image/>, respectively, of the discriminatorAnd false image/>Is a mark of (2); /(I)Representing a real image/>Is equal to y _p,/>Representing generator part false image/>Is equal to y _p,/>Representing discriminator partial false image/>Is an all zero vector of the same size as y _p.

The invention also provides a cell fluorescence image generation method based on the conditional diffusion model, which is realized by the model and comprises the following steps of:

and (3) image generation: inputting the protein image into the diffusion model, obtaining a forward noise-added protein image after the forward noise-added process of the diffusion model is processed, inputting the forward noise-added protein image into the generator for generating and processing, and obtaining a false 0-step protein image; and processing the false 0-step protein image through the reverse sampling process of the diffusion model to obtain a false image.

In one embodiment, the optimizing comprises the steps of: decomposing the protein image to be processed or the cell structure image to be processed through wavelet transformation to obtain a decomposed image, and combining the decomposed images according to the channel dimension to obtain a protein image or a cell structure image.

In one embodiment, the generating process includes the steps of:

encoding: the forward noise-added protein image is processed by a convolution layer to obtain the forward protein image, and the forward protein image is input into a first conventional residual block of the encoder to obtain a conventional residual result of the encoder; inputting the forward noise-added protein graph into a wavelet transform downsampling layer corresponding to a first conventional residual block to obtain a wavelet transform result; adding the encoder conventional residual result and the wavelet transform result, feeding the encoder conventional residual result into a conventional residual block of a next encoder, and feeding the wavelet transform result into a next wavelet transform downsampling layer; repeating the encoding process step to a frequency bottleneck block feeding the encoder;

Wavelet transformation: the frequency bottleneck block of the encoder carries out wavelet transform decomposition and wavelet inverse transform reduction on feed-in data to obtain reduction characteristics, the reduction characteristics are input into a full-connection layer to obtain protein subcellular position characteristics, the protein subcellular position characteristics, cell structure characteristics and weakening marks are combined and then sequentially input into the full-connection layer, the self-attention module and the frequency bottleneck block of the decoder to carry out wavelet transform decomposition and wavelet inverse transform reduction, and the frequency bottleneck block is input into the decoder to obtain false 0-step protein images.

In one embodiment, inputting the decoder includes: and inputting the output of the frequency bottleneck block of the decoder into a first conventional residual block of the decoder to obtain a conventional residual result of the decoder, inputting the conventional residual result of the decoder into a next conventional residual block of the decoder, repeating the steps until the last conventional residual block of the decoder is fed, and processing the conventional residual result of the decoder of the last conventional residual block of the decoder by a convolution layer to obtain a false 0-step protein image.

In one embodiment, the method of preparing a cellular structural feature comprises the steps of: inputting the cell structure image into a conventional residual block of the cell structure coding network to obtain a conventional residual result of the cell structure coding network, and inputting the conventional residual result of the cell structure coding network into a conventional residual block of the next cell structure coding network until the conventional residual block of the cell structure coding network is input into a frequency bottleneck block of the cell structure coding network to obtain the cell structure characteristics;

The preparation method of the weakening mark comprises the following steps: setting a candidate mark set by taking the partial mark frame as a protein image, wherein the candidate mark set is S _i = { [0.25, 0.75], [0.5, 0.5], [0.75, 0.25] }, and screening marks of the protein image through the partial mark frame to obtain a weakening mark; the screening comprises the following steps: when the protein image is a single-mark image, the single mark of the image is used as a weakening mark; and when the protein image is a double-mark image, selecting any candidate mark from the candidate mark set as a weakening mark.

In one embodiment, the decoder conventional residual result is data adjusted by the latent code feature map and the sampling feature map of the decoder.

In one embodiment, the preparation method of the latent code feature map and the sampling feature map includes the following steps: extracting a latent code from normal distribution, extracting a sampling step length from a sampling step number value range of the diffusion model, respectively inputting the latent code and the sampling step length into an encoder, a cell structure coding network and a decoder, and obtaining a latent code characteristic diagram and a sampling characteristic diagram of the encoder, a latent code characteristic diagram and a sampling characteristic diagram of the cell structure coding network and a latent code characteristic diagram and a sampling characteristic diagram of the decoder through full-connection layer coding.

In one embodiment, the adjusting comprises: adding the output of the conventional residual block of the encoder with the latent code feature map and the sampling feature map of the encoder to obtain a conventional residual result of the encoder;

Adding the output of the conventional residual block of the cell structure coding network with the latent code feature map and the sampling feature map of the cell structure coding network to obtain a conventional residual result of the cell structure coding network;

And adding the output of the conventional residual block of the decoder with the latent code feature map and the sampling feature map of the decoder to obtain the conventional residual result of the decoder.

In one embodiment, a skip connection is employed between the encoder and the decoder, the skip connection comprising: and carrying out feature fusion on the conventional residual result of the cell structure coding network and the conventional residual result of the corresponding encoder.

The invention also provides application of the model in generating a fluorescence image of cells.

Compared with the prior art, the invention has the following beneficial effects:

According to the cell fluorescence image generation method based on the conditional diffusion model, the model and the application, provided by the invention, immunofluorescence images of quantitative expression of proteins at different subcellular positions can be generated by using the model, and the indexes of fidelity and diversity can be simultaneously evaluated.

Drawings

FIG. 1 is a flow chart of a method for generating a cell fluorescence image based on a conditional diffusion model in an embodiment;

FIG. 2 is an immunofluorescence image generated using each depth generation model in the example.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The source is as follows:

The reagents, materials and equipment used in the examples are all commercially available sources unless otherwise specified; the experimental methods are all routine experimental methods in the field unless specified.

Examples

1. A dataset is constructed.

Immunofluorescence images in a human protein profile database containing only qualitative markers (in this example, immunofluorescence images were derived from online database The Human Protein Atlas), single cell datasets obtained after single cell segmentation were prepared at a ratio of 7:1:2 are randomly divided into three parts, training, validation and testing. The single cell image of the training part is used as a training set, the image of the testing part is verified, and a pixel-based image fusion algorithm is also needed (in this embodiment, the specific reference M.-Q. Xue, X.-L. Zhu, G. Wang, and Y.-Y. Xu, "DULoc: quantitatively unmixing protein subcellular location patterns in immunofluorescence images based on deep learning features," Bioinformatics, vol. 38, no. 3, pp. 827-833. 2022.） of the image fusion algorithm is used to obtain a verification set and a test set with quantitative marks, the training set, the verification set and the test set are used as data sets, and detailed information of the data sets is shown in the following table.

In the present invention, each single cell image in the dataset contains three images, a protein image, a nuclear image, and an endoplasmic reticulum image, respectively. The markers are descriptive of protein images and not of subcellular structures such as nuclei, endoplasmic reticulum, etc.; qualitative markers refer to the position markers of proteins in an immunofluorescence image at one or several subcellular structures (in this example, there are at most 2 subcellular structures), and quantitative markers refer to the expression levels of proteins in an immunofluorescence image at subcellular structures (in this example, the expression levels are described in terms of scale), so that quantitative markers include qualitative markers.

Table 1 detailed information of immunofluorescence image dataset

2. And constructing a cell fluorescence image generation method based on a conditional diffusion model.

In this embodiment, a conditional diffusion model is constructed, and training, verification and testing are performed on a dataset with a picture size of 256×256. The conditional diffusion model contains 4 parts: wavelet decomposition strategies reduce the picture size and conditional diffusion generation counter-network guided image generation, cell structure encoding networks assist in cell morphology learning, and biased marker frameworks attenuate marker deletion effects (as shown in fig. 1).

In this embodiment, the immunofluorescence images in the training set have only qualitative markers, while the images in the validation set and the test set have both qualitative and quantitative markers, which the inventors constructed by a pixel-based image fusion algorithm. In the training stage, all the marks input into the model are weakening marks y _p, the weakening marks y _p are qualitative marks of the training set, which are obtained after the qualitative marks of the training set are processed by the biased mark frame, the qualitative marks of the training set are directly derived from the database, in the embodiment, the weakening marks after being processed by the biased mark frame are marked as y _p, and in the verification and test stage, the marks of the verification set and the test set are not processed by the biased mark frame.

1. The wavelet decomposition strategy reduces the picture size.

The depth of the network can be influenced by the size of the picture, so that the use size of the video memory of the GPU and the training speed are influenced, the larger the size of the picture is, the deeper the depth of the network is, and the occupied video memory of the GPU is correspondingly increased. In order to reduce the memory occupation, a picture input to a conditional diffusion generation countermeasure network is decomposed by wavelet transformation to obtain a picture with a smaller size.

In this embodiment, the protein image size of each single cell image in the dataset is 256×256, the number of channels is 1 (the channel dimension is one of the dimensions used to describe the image size, in this embodiment, each image in the dataset is an RGB color image, with 3 channels, respectively a protein image, a nucleus image, an endoplasmic reticulum image, so the number of channels of the protein image is 1), the image is decomposed by using a wavelet transform that decomposes four 128×128 images according to a base 2 downsampling method, and the four images are combined in the channel dimension to become a protein image with a size of 128×128, and a channel number of 4Is input into a conditional flooding generation countermeasure network.

The cell nucleus image and the endoplasmic reticulum image of each single cell picture in the dataset are respectively 256×256 in size and 1 in channel number, wavelet transformation is respectively used for the cell nucleus image and the endoplasmic reticulum image, the decomposed pictures are combined in channel dimension to respectively obtain a cell nucleus image and an endoplasmic reticulum image with the size of 128×128 and 4 in channel number, and the two images are combined according to the channels to obtain a cell structure image X ^r with the size of 128×128 and 8 in channel number and are input into a cell structure coding network.

2. Conditional diffusion generation counter-network guidance image generation.

Because the conditional diffusion model has the defect of overlong model test time caused by the fact that the sampling steps are more, we propose to use a conditional diffusion generation countermeasure network as a basic framework to help reduce the sampling steps.

In this embodiment, the conditional flooding generation countermeasure network is divided into two parts, a flooding model and a generation countermeasure network GAN; the diffusion model comprises a forward noise adding process and a backward sampling process, which do not involve the network structure, and the generation of the antagonism network GAN comprises a generator G and a discriminator D. Protein imageAfter the forward noise adding process, a forward noise adding image/>, is obtainedThe forward protein image is obtained through convolution layer processing, then the forward protein image is input into a generator G, and the output of the generator G is input into a discriminator D after the reverse sampling process.

The generator G is a U-Net network architecture, the U-Net network architecture is obtained according to the network design of a denoising diffusion model DDPM, unlike the conventional U-Net design, the encoder and decoder of the conventional U-Net consist of a plurality of convolution layers, the encoder and decoder of the U-Net of the generator G use conventional residual blocks, and the conventional residual blocks adopt the residual blocks obtained by ResNet network design. The encoder and decoder regular residual blocks of the U-Net are in one-to-one correspondence (the encoder and decoder regular residual blocks are in reverse order of operation of the encoder and decoder to process the input data; e.g., the last regular residual block of the encoder is corresponding to the first regular residual block of the decoder). Furthermore, the U-Net of the present application adds a channel-based self-attention module between the last conventional residual block of the encoder and the corresponding conventional residual block of the decoder (i.e., the channel-based self-attention module is disposed between the encoder and the decoder), and it can be appreciated that the self-attention module is obtained by conventional design in the prior art.

Meanwhile, the inventor also adds a wavelet transformation embedding module to help learn the high-frequency component of the image, and the module is divided into two parts, namely a frequency bottleneck block and a wavelet transformation downsampling network. The frequency bottleneck block includes residual blocks based on wavelet transform, and in this embodiment, the wavelet transform embedding module includes 2 frequency bottleneck blocks (encoder frequency bottleneck block, decoder frequency bottleneck block), the encoder frequency bottleneck block is added at the rearmost end of the encoder (i.e., the rear end of the last conventional residual block of the encoder), and the decoder frequency bottleneck block is added at the foremost end of the decoder (i.e., the front end of the first conventional residual block of the decoder, which corresponds to the last conventional residual block of the encoder). Taking the frequency bottleneck block added at the rearmost end of the encoder as an example, the residual block based on wavelet transformation firstly carries out the summation of the output of the last conventional residual block of the encoder (the output is subjected to adjustment of a latent code feature diagram and a sampling feature diagram) and the output of the last wavelet transformation downsampling layer, the output is decomposed into a low-frequency component and a high-frequency component by using wavelet transformation, the low-frequency component is input into the residual block based on wavelet transformation, the output and the high-frequency component are restored into a feature by using wavelet inverse transformation (the wavelet transformation decomposition, the residual block processing based on wavelet transformation and the wavelet inverse transformation restoration are all realized by the encoder frequency bottleneck block), the feature is input into a full-connection layer, the protein subcellular location feature z ^s and the cell structure feature z ^r of the cell structure coding network (the output is the result of the cell structure coding network frequency bottleneck block after the output is subjected to the full-connection layer, the cell structure coding network is consistent with the encoder design, and consists of a plurality of conventional residual blocks and one frequency bottleneck block) and the mark y _p are combined, and then the cell structure coding network is input into the bottleneck block based on the decoder frequency bottleneck block through the self-processing channel after the full-connection layer. The design of the frequency bottleneck block at the forefront of the decoder is consistent with that of the encoder, and after wavelet transformation decomposition, residual block processing based on wavelet transformation and wavelet inverse transformation restoration, the output is input into the first conventional residual block of the decoder.

The wavelet transform downsampling network has several wavelet transform downsampling layers corresponding to the conventional residual block of the coder, and the first wavelet transform downsampling layer has input of real t-step noisy protein image（/>Is a protein image/>Forward noise-added protein image processed by forward noise adding process of diffusion model, protein image/>Is an image of a protein image directly derived from a database after processing by a wavelet decomposition process).

Each wavelet transform downsampling layer is designed as follows, first for the input (i.e. protein imageForward noise-added image/>, obtained after forward noise adding process) And (3) performing wavelet transformation to obtain a low-frequency component and a high-frequency component, merging the two components in the channel dimension, inputting a convolution layer (the convolution layer is an internal network design of a wavelet transformation downsampling layer, is irrelevant to the fact that the convolution layers of the encoder and the decoder of the U-Net are replaced by conventional residual blocks, and is a network structure additionally added on the basis of the conventional residual blocks), and outputting a result. This output is fed into the next wavelet transform downsampling layer and, at the same time, is added to the output of the encoder regular residual block (which is adjusted by the latent code feature map and the sampling feature map) and fed into the next encoder regular residual block until it is fed into the last regular residual block of the encoder. The output of the last wavelet transform downsampling layer is added to the output of the last conventional residual block of the encoder and fed into the frequency bottleneck block of the encoder.

In the jump connection part of the encoder and decoder, the output characteristics of each conventional residual block of the cell structure coding network (the output characteristics are adjusted by the latent code characteristic diagram and the sampling characteristic diagram) are firstly subjected to characteristic fusion with the output characteristics of each conventional residual block of the corresponding encoder (the output characteristics are adjusted by the latent code characteristic diagram and the sampling characteristic diagram). The fusion mode is that firstly, the output characteristics of the conventional residual block of the cell structure coding network are input into a convolution layer with the kernel size of 1 multiplied by 1 (the convolution layer is used for carrying out characteristic fusion on the outputs of the conventional residual blocks of the cell structure coding network and the encoder, and does not belong to the internal network design of the cell structure coding network and the encoder), the regulated characteristics are obtained after the output of the convolution layer is multiplied by the input of the convolution layer, the characteristics are added with the output characteristics of the conventional residual block of the corresponding encoder, and then the added output characteristics and the output characteristics of the conventional residual block of the corresponding decoder (the output characteristics are regulated by a latent code characteristic diagram and a sampling characteristic diagram) are combined by using the jump connection of conditional gating.

The jump connection method is that weakening marks y _p (weakening marks y _p weakened by a partial mark frame) corresponding to protein images are input into a full-connection layer for coding, and a mark characteristic diagram is obtained. After the added output characteristics of the encoder and the cell structure coding network are adjusted by using the marking characteristic diagram, the added output characteristics are combined with the output characteristics of the conventional residual block corresponding to the decoder, and the next conventional residual block of the decoder is input.

In this embodiment, the specific procedure of generating the countermeasure network guidance image by conditional diffusion is as follows:

first, for protein images (Protein image/>)Is an image of a protein image directly from a database after being processed by a wavelet decomposition process). The diffusion model requires that the maximum number of sampling steps N be set in advance for the input protein image/>The present inventors randomly extracted an integer from the value range [0, N ] as the sampling step length t, and then performed on the protein imageAdding t Gaussian noise to obtain a real t-step noisy protein image/>(I.e., forward noisy protein image), when t=0, no gaussian noise is added, so protein image/>Also known as true 0-step protein images. The variance of the added ith Gaussian noise isMean value is/>，/>For the real i-1 step noisy protein image,/>Is a manually set value, i=1, …, t, see in particular formula (1).

Since the true 0-step protein image is knownThe real t-step noisy protein image/>, as well as the variance and mean of the gaussian noise added each time, can be directly determined from equation (1)(I.e., forward noisy protein image).

Is a protein image before noise addition, also called a true 0-step protein image,/>Is a real t-step noisy protein image; noise is Noise randomly extracted from a standard normal distribution, and t is the sampling step size; c=0.1, b=20.

Next, we randomly extract a 100-dimensional vector from the normal distribution as the latent code z, which is input to the cell structure encoding network, encoder and decoder simultaneously with the sampling step t. The cell structure image X ^r (cell structure image X ^r is an image of the nucleus and endoplasmic reticulum images directly from the database, which is processed by the wavelet decomposition strategy) is input into the cell structure encoding network together with the latent code z and the sampling step t after passing through a convolution layer (which is used for preprocessing the cell structure image X ^r, and then inputting the cell structure encoding network). The latent code z and the sampling step length t are respectively encoded into a latent code characteristic diagram and a sampling characteristic diagram of the cell structure encoding network by different full-connection layers, and correspond to each conventional residual block of the cell structure encoding network. And adding the output characteristics of the conventional residual blocks of each cell structure coding network with the latent code characteristic diagram and the sampling characteristic diagram of the corresponding cell structure coding network, inputting the conventional residual blocks of the next cell structure coding network, and finally outputting the characteristics at the frequency bottleneck blocks of the cell structure coding network to obtain the cell structure characteristics z ^r through a full connection layer. At the same time, the real t-step noisy protein imageInputting to a wavelet transform downsampling network; in addition, the true t-step noisy protein image/>Is simultaneously input to a convolution layer (the convolution layer is used for carrying out noise protein image/>, with real t stepsAnd then input to the encoder), the output of which is input to the encoder together with the latent code z, the sampling step t. For each conventional residual block of the encoder, the input latent code z and the sampling step length t are encoded into a latent code feature map and a sampling feature map of the encoder, which are the same as the cell structure encoding network, so as to adjust the output features of the conventional residual block of the encoder, after adding the adjusted features and the output features of the wavelet transform downsampling layer in the corresponding wavelet transform downsampling network, the next conventional residual block is input, and in addition, the output features of the wavelet transform downsampling layer are input into the next wavelet transform downsampling layer. And finally, outputting the characteristic at the frequency bottleneck block of the encoder, and obtaining the protein subcellular location characteristic z ^s after the output characteristic passes through the full-connection layer.

Finally, combining the cell structure characteristic z ^r, the protein subcellular position characteristic z ^s and the weakening mark y _p, inputting the combined characteristics to a decoder corresponding to an encoder after passing through a full-connection layer and a channel-based self-attention module, and decoding a false 0-step protein image after passing through a convolution layer。

In addition, in the jump connection part, each conventional residual block output characteristic of the decoder is added with the latent code characteristic diagram and the sampling characteristic diagram of the encoded decoder, and then combined with the characteristic diagram adjusted by the encoder (the characteristic diagram adjusted by the encoder is the characteristic diagram adjusted by the marking characteristic diagram after the output characteristics of each layer of the cell structure coding network are added with the output characteristics of the conventional residual block corresponding to the encoder, and the output characteristics of each layer of the cell structure coding network are adjusted by the latent code characteristic diagram and the sampling characteristic diagram of the cell structure coding network, and the output characteristics of the conventional residual block corresponding to the encoder are adjusted by the latent code characteristic diagram and the sampling characteristic diagram of the encoder), and then the next conventional residual block is input to the decoder.

False 0-step protein image for generator G outputThe inverse sampling process is carried out to obtain false t-1 step noisy protein image, namely false image/>, which is input to a discriminator DThe process is represented by formula (2).

Is a false image,/>Is a true 0-step protein image,/>Is a real t-step noisy protein image; noise is Noise randomly extracted from a standard normal distribution, and t is the sampling step size; c=0.1, b=20.

Discriminator D is a full convolution network for discriminating between true and false pictures and predicting picture signatures. It respectively inputs real images(True image/>)Protein images obtained from protein images directly derived from databases via wavelet decomposition strategies/>Real t-1 step noisy protein image obtained after t-1 times of noise addition, and false image/>, obtained by reversely sampling the output result of the generator. Real image/>And false image/>Noisy protein image/>, respectively and truly t steps(I.e., forward noisy protein image) and the sampling step size t is input to the discriminator.

Real imageOr false image/>With real t-step noisy protein image/>After merging in the channel dimension, the samples are input to the first convolution layer of the discriminator, and the sampling step t is encoded into a sampling feature map by different full-concatenated layers, corresponding to each convolution layer of the discriminator. Each convolution layer output characteristic of the discriminator is added with the corresponding sampling characteristic image and then input into the next convolution layer, and the output of the last convolution layer of the discriminator passes through a full connection layer to respectively obtain a characteristic image f corresponding to the real image and a characteristic image/>, corresponding to the false imageFor the feature map f and/>Two different full-connection layer codes are used to obtain the predicted true-false scores and the predicted marks corresponding to the true image and the false image, the predicted true-false scores are used for calculating the countermeasures, and the predicted marks are used for calculating the marks.

The conditional diffusion generation countermeasure network contains two penalty functions, a countermeasure penalty L _adv (3) and a content penalty L _content (4). The false image is evaluated against loss L _adv Distribution and real image/>Distance between distributions, content loss L _content considers false 0-step protein images/>And true 0-step protein image/>Degree of similarity at the pixel level.

The countering loss L _adv is divided into a generator G part L _adv (G) and a discriminator D part L _adv(D).L_adv (G) to calculate false imagesIs predicted to be true-false score and target score/>Distance between them. L _adv (D) calculates the true image/>Is predicted to be true-false score and target score/>Distance between and false image/>Is a predictive false-true score and a target score of (2)The sum of the distances between them. /(I)Equal to 1. Because of the relationship between generator G and discriminator D as a mutual game, false image/>Target scores in L _adv (G) and L _adv (D) are different, and target score/>, in L _adv (G)Equal to 1, in L _adv (D), target score/>Equal to 0. The objective of the countering loss in both the generator G section and the discriminator D section is to minimize the loss function value.

Is a real image,/>Is a false image,/>Is a real t-step noisy protein image; /(I)Representing a true t-step noisy protein image/>Distribution of (i.e. true image/>)Distribution of/>Representing false imagesIs a distribution of (3); /(I)，/>Representing the predicted true image/>, respectively, of the discriminatorAnd false image/>True and false scores of (a). /(I)Representing a real image/>Is equal to 1; /(I)Representing generator part false image/>Is equal to 1; /(I)Representing discriminator partial false image/>Is equal to 0.

Is a true 0-step protein image,/>Is a false 0-step protein image; /(I)Representing real 0-step protein imagesDistribution of/>Representing false 0-step protein image/>Is a distribution of (a).

3. The cell structure encoding network assists in cell morphology learning.

To help the generator G learn the cell structure information, we designed a cell structure encoding network with the same structure as the encoder to help learn the cell structure image X ^r, and the conventional residual block and the frequency bottleneck block in the cell structure encoding network are in one-to-one correspondence with the encoder. The cell structure image X ^r is input to the cell structure coding network after being preprocessed by a convolution layer, the output characteristic of each conventional residual block of the cell structure coding network is input to the next conventional residual block of the cell structure coding network after being adjusted by the latent code characteristic map and the sampling characteristic map of the cell structure coding network, and at the same time, the adjusted output characteristic is subjected to characteristic fusion with the output characteristic of the conventional residual block of the corresponding encoder (the output characteristic is adjusted by the latent code characteristic map and the sampling characteristic map of the corresponding encoder). The output of the last conventional residual block of the cell structure coding network is input into the frequency bottleneck block of the cell structure coding network after being regulated by the corresponding latent code feature map and sampling feature map, and the output of the last conventional residual block is subjected to feature fusion with the output features (the latent code feature map and the sampling feature map regulation of the encoder) of the last conventional residual block of the encoder.

The specific characteristic fusion method comprises the following steps: firstly, inputting the output characteristics of a conventional residual block of a cell structure coding network (through adjustment of a latent code characteristic diagram and a sampling characteristic diagram) into a convolution layer with a kernel size of 1 multiplied by 1, learning a weight characteristic diagram with the same input size, multiplying the weight characteristic diagram output by the convolution layer with the input of the weight characteristic diagram to obtain the adjusted output characteristics of the conventional residual block, adding the characteristics with the output characteristics (the output characteristics are adjusted by the latent code characteristic diagram and the sampling characteristic diagram) of the conventional residual block corresponding to the encoder, and finally merging the added output characteristics with the output characteristics (the output characteristics are adjusted by the latent code characteristic diagram and the sampling characteristic diagram) of the conventional residual block corresponding to the decoder through jump connection.

In addition, after the last conventional residual block is input into the frequency bottleneck block of the cell structure coding network, the output of the frequency bottleneck block of the cell structure coding network and the output of the frequency bottleneck block of the encoder respectively pass through the full-connection layer to obtain a cell structure characteristic z ^r and a protein subcellular position characteristic z ^s, and then are combined with a weakening mark y _p adjusted by the partial mark frame, pass through a full-connection layer and a channel-based self-attention module, and are input into the decoder.

4. The offset mark frame weakens the mark deletion effect.

The quantitative distribution mark corresponding to the protein immunofluorescence image in the training set is missing, and the partial mark framework is adopted in the embodiment to alleviate the problem.

For a single marker protein image, since it has only 1 positional marker of subcellular structure, its marker y _p（y_p is a marker that is input into the conditional diffusion generation countermeasure network, i.e., a designatable marker, a quantitative marker) is determined, and the qualitative and quantitative markers are identical, belonging to [0, 1] or [1, 0]. For the double-labeled protein image, the quantitative markers were unknown because they had 2 subcellular structures of the positional markers at the same time (qualitative markers of dataset were Cytosol and Nucleoplasm; but the amount of protein expressed on 2 different subcellular structures was ambiguous, so the quantitative markers of the double-labeled protein image were unknown).

For this, we use the biased marker framework to predefine a candidate marker set S _i = { [0.25, 0.75], [0.5, 0.5], [0.75, 0.25] } for the dual marker protein image, and assume that there is only one true marker per protein image, and in the candidate marker set S _i. Because the model is expected to learn quantitative distribution markers, rather than qualitative markers in general, each element in the candidate marker set S _i is set to a double-marker form that represents a quantitative score.

During training, in each iteration, for each Shan Biaoji protein image, directly taking a mark of the protein image as a weakening mark y _p and inputting the weakening mark into a conditional diffusion generation countermeasure network; for each double-labeled protein image, an element is randomly extracted from S _i as its current true attenuation label y _p and input into the conditional diffusion generation countermeasure network.

In order to alleviate semantic ambiguity introduced by the partial markup framework, a markup penalty L _label is set as a regularization term (5). The marker loss L _label is divided into a generator G part L _label (G) and a discriminator D part L _label(D).L_label (G) to calculate a false imagePredictive markers and target markers/>Distance between them. L _label (D) calculates the true image/>Predictive markers and target markers/>Distance of (2) and false image/>Predictive markers and target markers/>The sum of the distances between them. /(I)Representing a real image/>Is equal to y _p.

False image because of the mutual opposition between generator G and discriminator DTarget markers in L _label (G) and L _label (D) are different, and in L _label (G) target marker/>Equal to y _p, in L _label (D), target markIs an all zero vector of the same size as y _p. The label loss is targeted at the generator G and discriminator D sections as a minimum loss function value. /(I)

Is a real image,/>Is a false image,/>Is a real t-step noisy protein image; /(I)Representing a true t-step noisy protein image/>Distribution of (i.e. true image/>)Distribution of/>Representing false imagesIs a distribution of (3); /(I)，/>Respectively representing the predicted true images of the discriminatorAnd false image/>Is a mark of (2); /(I)Representing a real image/>Is equal to y _p,/>Representing generator part false image/>Is equal to y _p,/>Representing discriminator partial false image/>Is an all zero vector of the same size as y _p.

5. Training the target.

The training object of the model is divided into two parts for the generator (6) and the discriminator (7), respectively. The training objectives of the generator include four parts, minimization of counter loss L _adv (G), content loss L _content, and marker loss L _label (G). The objective of the discriminator includes minimization of challenge loss L _adv (D) and marker loss L _label (D).

Training the model with the data in the dataset until there is no further change in the values of the 5 losses, the counter loss L _adv (G), content loss L _content and marker loss L _label (G) of the generator are determined to be minimized, and the counter loss L _adv (D) and marker loss L _label (D) of the discriminator are determined to be minimized.

3. The model performance was evaluated.

The model was evaluated from both fidelity and diversity.

1. Fidelity technical index: the fidelity index considers the similarity of the generated samples and the true samples. Structural Similarity (SSIM) is used herein.

SSIM is an index for measuring the similarity of structural information of a false protein image and a true protein image, and the larger the SSIM is, the more similar the false protein image and the true protein image are.

The specific formula is as follows:

Wherein mu _x,μ_y of the total number of the components, ，/>Σ _xy represents the mean, variance and covariance of the false protein image x and the true protein image y, respectively; c ₁,c₂ is a small normal number to avoid calculation errors caused by zero in the denominator.

2. The diversity technical indexes are as follows: for model diversity, a perceptual penalty (LPIPS) was used for evaluation.

LPIPS calculate the distance between the features of the two false protein images encoded by the trained VGG network. The higher LPIPS, the higher the representation diversity. The specific formula is as follows:

X ₁,X₂ represents an image; l represents the first layer of the VGG network; h, W represents height and width; h, w represents the index of the row and column; w _l represents a weight for scaling the number of active channels; f represents a feature map of the features extracted from the layer I after channel normalization; Representing the hadamard product.

3. Comprehensive index: the fidelity and diversity were evaluated simultaneously using Fre chet InceptionDistance (FID). FID measures the distance of real and false protein images on a feature obtained by inputting data into Inceptionv networks. The FID assumes that the data distribution of the real protein image and the false protein image accords with the multi-element Gaussian distribution, and the statistics of the two distributions are calculated and brought into the following formula to obtain the FID value. The smaller the FID, the better the generation performance.

Mu ₁,μ₂ represents the mean of all real protein images and all spurious protein image features, sigma ₁,∑₂ represents the covariance matrix of all real protein images and all spurious protein image features, tr is the trace of the matrix.

4. And verifying the result.

1. And (5) selecting a basic model.

The generated image is a false image generated after the training of the depth generation model, and various losses of the trained depth generation model are minimized, so that the generated false image can be regarded as being equivalent to a real image.

Table 2 shows the results of comparison of the three base models on the dataset. The three models are a conditional diffusion model cDDPM, a conditional diffusion generation countermeasure network cDDPM-GAN, and a conditional diffusion generation countermeasure network SLocDMGAN for the wavelet decomposition strategy, respectively. Comparing cDDPM with cDDPM-GAN results, it can be seen that using GAN as the network infrastructure helps to reduce the number of sampling steps of the diffusion model, but at the same time the memory occupation of the model becomes large, which results in a failure to train in limited computing resources, so that we can obviously see SLocDMGAN sampling steps and test time less than cDDPM using the SLocDMGAN model of wavelet decomposition strategy, and there is also an improvement in the fidelity index.

Table 2 comparison of three basic models on datasets

2. Validity of the module.

To demonstrate the effectiveness of the four key module wavelet transform embedding modules, the cell structure encoding network, the bias marker framework, and the conditional gate jump connection in the model, table 3 lists the ablation experiments for the four key modules. The unbiased marker framework refers to the markers of the dual marker image being fixed at [0.5, 0.5], rather than randomly extracted from the candidate marker set; unconditional gated skip connection refers to not using a signature graph to adjust the output characteristics resulting from the addition of the outputs of each conventional residual block of the encoder and the cell structure encoding network, but directly combining the added output characteristics with the output of the corresponding conventional residual block of the decoder. FID indicators show that all four key modules add to the model to perform best.

Table 3 ablation experiments for four key modules

3. Relation of sampling steps to model performance.

Table 4 lists the performance metrics of the model for different sampling steps. It can be seen that as the number of sampling steps increases, both training time and testing time increase; the fidelity index SSIM does not monotonically decrease, and when the sampling step number is 2, the SSIM is highest; the FID index also increases along with the sampling step number and monotonically decreases, so that the more the sampling step number is, the better the comprehensive performance of the model is.

TABLE 4 Performance index of models at different sampling steps

4. Comparing other works.

Here, the model of the present embodiment is compared with the existing depth generation model based on the model of U-Net,-VAE, cAAE, cGAN, bioGAN the comparison is made. Wherein, the U-Net based model uses cGAN generator part, and the other four models are introduced in the background. To ensure proportional fairness, all models employ a biased marker framework to alleviate the marker miss problem.

Fig. 2 shows the generated image of each depth generation model, and table 5 shows the quantitative evaluation results. The proposed model score of the present embodiment is optimal from the point of view of the fidelity index SSIM. The model of this embodiment is inferior to cAAE and BioGAN in terms of diversity index, because the index evaluates only the generated image, regardless of the relationship of the generated image to the real image. Finally, the FID score shows that the model of this embodiment is suboptimal in terms of overall metrics, which is inferior to the cAAE model. However, the fidelity index SSIM score of the cAAE model is very poor, which means that the cAAE model, while showing the superiority of FID, produces a very large gap between false and true images, as can be seen in the example diagram of fig. 2. In general, the model of the present embodiment has optimal fidelity and better overall performance than other models.

Table 5 results comparison with other existing depth generation models

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. The model based on the conditional diffusion model is characterized by comprising a size optimization module, an image generation module, a learning module and a weakening module;

The generator comprises an encoder, a decoder, a self-attention module and a wavelet transformation embedding module; the encoder comprises a plurality of conventional residual blocks, the decoder comprises a plurality of conventional residual blocks, the self-attention module is arranged between the encoder and the decoder, the wavelet transform embedding module comprises a wavelet transform downsampling network and a plurality of frequency bottleneck blocks, the frequency bottleneck blocks are respectively arranged at the rear end of the encoder and the front end of the decoder, the wavelet transform downsampling network comprises a plurality of wavelet transform downsampling layers, and the wavelet transform downsampling layers are correspondingly arranged with the conventional residual blocks of the encoder;

The protein image is obtained by optimizing a protein image to be processed through the size optimizing module, and the optimization is realized through a wavelet decomposition method;

2. The model of claim 1, wherein the encoder is configured to input a forward protein image, wherein the decoder is configured to output a false 0-step protein image, wherein the wavelet transform downsampling network is configured to input a forward noisy protein image, wherein the forward noisy protein image is obtained from a protein image after processing by a forward noisy process of the diffusion model.

3. The method of constructing a model according to any one of claims 1-2, comprising the steps of:

4. A method for generating a cellular fluorescence image based on a conditional diffusion model, characterized in that the method is implemented by the model according to any one of claims 1-2, comprising the steps of:

5. The method of generating a cellular fluorescent image of claim 4, in which the optimizing comprises the steps of: decomposing the protein image to be processed or the cell structure image to be processed through wavelet transformation to obtain a decomposed image, and combining the decomposed images according to the channel dimension to obtain a protein image or a cell structure image.

6. The method of generating a cellular fluorescent image according to claim 4, wherein the generating process comprises the steps of:

7. The method of generating a cellular fluorescent image of claim 6, wherein the method of preparing a cellular structural feature comprises the steps of: inputting the cell structure image into a conventional residual block of the cell structure coding network to obtain a conventional residual result of the cell structure coding network, and inputting the conventional residual result of the cell structure coding network into a conventional residual block of the next cell structure coding network until the conventional residual block of the cell structure coding network is input into a frequency bottleneck block of the cell structure coding network to obtain the cell structure characteristics;

8. The method of generating a cellular fluorescent image according to claim 7, wherein the encoder conventional residual result is data adjusted by a latent code feature map and a sampling feature map of the encoder, and the cellular structure encoding network conventional residual result is data adjusted by a latent code feature map and a sampling feature map of the cellular structure encoding network.