CN117952993B - Semi-supervised medical image segmentation method based on image text cooperative constraint - Google Patents

Semi-supervised medical image segmentation method based on image text cooperative constraint Download PDF

Info

Publication number
CN117952993B
CN117952993B CN202410353448.5A CN202410353448A CN117952993B CN 117952993 B CN117952993 B CN 117952993B CN 202410353448 A CN202410353448 A CN 202410353448A CN 117952993 B CN117952993 B CN 117952993B
Authority
CN
China
Prior art keywords
model
image
text
sam
clip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410353448.5A
Other languages
Chinese (zh)
Other versions
CN117952993A (en
Inventor
蔡青
曹子彦
鄢柯
张帆
刘治
徐勇
王珊珊
董军宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202410353448.5A priority Critical patent/CN117952993B/en
Publication of CN117952993A publication Critical patent/CN117952993A/en
Application granted granted Critical
Publication of CN117952993B publication Critical patent/CN117952993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a semi-supervised medical image segmentation method based on image text collaborative constraint, and belongs to the technical field of medical image processing. The invention comprises an encoder shared by a SAM model and a CLIP model, a CLIP model branch, and a SAM model branch. The SAM model is good at image segmentation and object positioning, and has good space understanding capability; the CLIP model has strong cross-modal understanding capability, and can combine text and image information to perform task processing. The model has good space positioning capability in the image understanding task by combining the model and the text information, and can be fully utilized for understanding, so that the image can be more comprehensively understood and processed. The invention combines the SAM and the CLIP model to improve the model performance in a mutually complementary mode, and can flexibly adjust the structure and parameters of the model according to task demands, thereby being better suitable for different image segmentation tasks and scenes.

Description

Semi-supervised medical image segmentation method based on image text cooperative constraint
Technical Field
The invention particularly provides a semi-supervised medical image segmentation method based on image text collaborative constraint, and belongs to the technical field of medical image processing.
Background
Medical image segmentation is a process of precisely marking and locating structures or regions in a medical image, dividing the medical image into different regions, each region corresponding to a particular structure, organ or lesion in the image. Medical image segmentation plays an important role in the field of medical imaging, provides accurate and important information for medical diagnosis, treatment and operation, and has profound effects on improving the medical care and treatment effects of patients.
In recent years, deep learning techniques have made remarkable progress in medical image segmentation, such as: the U-Net model is a classical deep learning architecture, is widely used for medical image segmentation, is characterized in that the image is input through a network structure of an encoder and a decoder, is beneficial to capturing characteristics of different layers, and is particularly suitable for small sample conditions. The generation of the countermeasure network GANs, which is used to generate more realistic medical images, can also be used to improve segmentation performance, training by generating images with more medically realistic features. These advances have led to deep learning with greater accuracy and robustness in medical image segmentation, providing a more promising solution for medical image analysis. Although deep learning has made significant progress in medical image segmentation, they still suffer from suboptimal performance in many medical tasks, mainly for the following reasons:
First, labeling is difficult and expensive: labeling of medical images typically requires a specialized doctor or a professional in the medical arts, which makes the labeling process expensive and time consuming. At the same time, the labeling of medical images requires high accuracy and reliability for each pixel.
Data imbalance: there may be a problem with class imbalance in the medical image dataset, i.e. the number of samples of some classes is far greater than for other classes. This may result in models that favor higher frequency categories during training, while performing poorly for other categories.
Data diversity: medical images cover a number of diseases, organs and scanning devices and are therefore of great diversity. This makes it more complex to design a model that is highly versatile for different scenarios and modalities, as the model needs to have sufficient generalization capability.
Disclosure of Invention
The invention aims to provide a semi-supervised medical image segmentation method based on image text cooperative constraint, so as to make up for the defects of the prior art.
According to the invention, the CLIP and the SAM are combined to form a dual-branch network model, and the advantages of the CLIP model and the SAM model are combined to establish a semi-supervised medical image segmentation model so as to solve the problems of insufficient data and easy network overfitting in a semi-supervised medical image segmentation task.
In order to achieve the aim of the invention, the invention adopts the following specific technical scheme:
A semi-supervised medical image segmentation method based on image text collaborative constraint comprises the following steps:
S1: collecting a medical image dataset and preprocessing the image data; dividing the data set into a training set and a testing set, wherein the training set comprises supervised image data and unsupervised image data;
S2: encoding the text description of the selected dataset using a text encoder of the large visual language model CLIP to obtain text features;
S3: constructing a network model, wherein the network model comprises two branches, namely a CLIP model branch and a SAM model branch, the two branches adopt shared encoders, the split network structures are UNet, and the parameter initialization is different;
S4: after the training set is input into the network model, for SAM model branches, the extracted image features and prompt embedding are spliced, and then the extracted image features and the prompt embedding are used as parameters to be added in subsequent operations as guidance to obtain a segmentation result; for the CLIP model branch, splicing the extracted image features and the acquired text features, and then adding the extracted image features and the acquired text features as parameters in subsequent operations as a guide to obtain a segmentation result;
S5: constructing a loss function in the network model, wherein the loss function comprises a supervision loss for marked data and a consistency loss for unmarked data and a total loss function combining the supervision loss and the consistency loss;
s6: respectively performing supervision training on the CLIP model branch and the SAM model branch by using marked data in the training set;
s7: performing unsupervised training on the CLIP model branch and the SAM model branch through consistency loss by using unlabeled data in the training set;
S8: and outputting a final image segmentation effect by using test data in the test set through SAM model branches.
Further, in the step S1: the pretreatment comprises the following steps: processing an image format, cutting an image, and normalizing the image; then construct a dataset of the training phaseComprising supervised parts/>And an unsupervised partI.e./>Wherein/> = />Wherein/>Representing a marked medical image,/>Is its corresponding real label, L represents the number of annotated medical images,/> = Total M/>, from L+1 to MThe L term represents the number of medical images without annotations.
Further, in the step S2: the text coder of the large visual language model CLIP is used for extracting the characteristics of the text description, fine adjustment is not carried out on the text description in the whole training process, training expenditure can be greatly reduced, accuracy is not excessively reduced, and the text description is related to organ name prompts to be segmented in a data set. Then, the text prompt is subjected to text encoder of the CLIP model to obtain text embedding, and the specific formula is shown in (1):
(1);
wherein the method comprises the steps of Representing the extracted feature vector,/>Representing the text encoder, t represents the textual description of the organ, wherein the textual description corresponding to each image is the same for each dataset.
Further, in the step S3: an overall segmentation network model is constructed, comprising a shared encoder and CLIP model branches and SAM model branches, wherein the CLIP model branches comprise a text encoder and CLIP image decoder. The SAM model branch contains nnunet networks, a hint encoder, and a SAM image decoder.
Further, in S4: in order to fully utilize the segmentation capability of the SAM model and the capability of the CLIP model for text prompt and simultaneously overcome the problem of lack of annotation data in medical image segmentation, two segmentation networks are adopted, namely/>Are UNet, and the initialization parameters are different. For the SAM model, because the SAM model needs to input the prompt (points, frames, texts and the like) of the picture, and the manual prompt is too high in cost and consumes a great amount of time, in order to overcome the defects, the invention firstly adopts nnunet network to coarsely divide the input picture to generate corresponding frame prompts, then inputs the picture with the frame prompts into a prompt encoder of the SAM model to generate prompt embedding, and specific formulas are shown as (2) and (3):
(2);
(3);
wherein the method comprises the steps of Representing the ith image data input to nnunet network for coarse segmentation,/>Representing image data with framed cues,/>Hint encoder representing SAM model,/>And embedding the corresponding prompt. While obtaining the frame prompt embedding, the SAM model which is also input to the image data is shared with the CLIP model to obtain the image embedding, and the specific formula is as shown in (4):
(4);
wherein the method comprises the steps of Representing the ith image input network at this time,/>Encoder section shared for SAM model and CLIP model,/>Then it is the extracted image feature. Due to the feature map/>Is of the size of (1) and the previously obtained hint embedding/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (5):
(5);
Where GAP represents a global average pooling operation, Representing vector collage operations,/>Is an intermediate parameter variable to be processed; while for the image features/>, obtained by the encoderContinuing through the upsampling portion of the decoder, the feature map is restored to the size of the original picture and is then processed/>And adding, namely carrying out convolution operation on the added result to obtain a final prediction mask, wherein the final prediction mask is shown in the following formula (6):
(6);
wherein the method comprises the steps of Representing the decoder part of a SAM splitting network,/>Representing convolution operations with the aim of making/>The number of channels of the feature map obtained after the decoder is consistent with the number of channels of the feature map obtained after the decoderIs an augmentation operation such that/>The feature images obtained after the decoder are consistent in size, and the addition operation can be performed through the two steps of processing. /(I)Is the convolution layer that yields the final segmentation result.
The above steps are procedures in which the SAM model branches process image data and obtain segmentation results. For CLIP model branching, it is also necessary to pass image features of the shared encoderAnd text feature obtained in S2/>Splicing is performed, similarly, because of the characteristic diagram/>And the size of the previously obtained text feature/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (7):
(7);
Where GAP represents a global average pooling operation, Representing vector collage operations,/>Is an intermediate parameter variable to be processed; while for the image features/>, obtained by the encoderContinuing through the upsampling portion of the decoder, the feature map is restored to the size of the original picture and is then processed/>And adding, namely carrying out convolution operation on the added result to obtain a final prediction mask, wherein the final prediction mask is shown in the following formula (8):
(8);
wherein the method comprises the steps of Representing the decoder part of a CLIP splitting network,/>Representing convolution operations with the aim of making/>The number of channels of the feature map obtained after the decoder is consistent with the number of channels of the feature map obtained after the decoderIs an augmentation operation such that/>The feature images obtained after the decoder are consistent in size, and the addition operation can be performed through the two steps of processing. /(I)Is the convolution layer that yields the final segmentation result.
Further, in S5: the loss functions used in the training process to construct the segmentation model include supervised loss of tagged data and total loss functions to further exploit the consistency loss of untagged data and combine the two. Wherein the loss is monitored) That is, to measure the gap between the prediction mask calculated by the partitioning network and the real label, the supervised loss of the present invention includes the Dice loss and the cross entropy loss, as shown in formulas (9), (10), (11):
wherein the method comprises the steps of Predictive labels representing network output,/>Representing the number of pixels of an image,/>Representing a real tag in the dataset; /(I)Is a supervision loss,/>Representing cross entropy loss,/>Representing the Dice loss; for unsupervised loss) I.e., for the consistency loss of unlabeled data between SAM model branches and CLIP model branches, the specific formula is shown in (12):
(12);
Wherein, Representing MSE loss,/>Representing the prediction result of the CLIP model,/>Representing the prediction result of the SAM model;
The total loss function definition of the model is shown in a formula (13) by combining the supervision loss function and the unsupervised loss function:
(13);
where λ is the weighting coefficient, λ= Where I represents the number of iteration cycles. Introduction of dynamic parameters/>The objective of (2) is that the optimization of parameters is mainly based on annotated labels in the early stage of network training, but many errors are accumulated at the same time, and as the network iterates continuously, the later stage should be given more weight to the unsupervised loss to correct the previously accumulated loss.
Further, in the step S6, the two branches are supervised and trained through the marked data and the objective function of the supervision loss, so that the accuracy of the model is improved, the supervision loss is used for improving the segmentation performance of the network by using the marked data, the real tag information in the marked data can be effectively utilized by using the supervision loss function, the model is guided to learn the correct feature representation, and the model is helped to better distinguish different types of tissue structures or lesions and segment more accurately.
Further, in the step S7, the unmarked data are respectively input into the CLIP model and the SAM model to generate a segmentation result, and the consistency loss is calculated according to the segmentation result of the CLIP model and the SAM model, so that the unmarked data are further utilized, and the accuracy of model segmentation is improved; the consistency loss function is generally used for ensuring consistency of output results of two models under different conditions, and simultaneously constrains the range of the features and the knowledge learned by the models in the training process, so that the generalization capability of the models on unseen data is improved; and by maintaining consistency between the two models, instability of the segmentation results due to data changes or model parameter changes can be reduced.
Further, the step S8 specifically includes: and selecting data from the test set, testing by utilizing SAM model branches, observing a final segmentation result, and adopting corresponding measurement indexes to measure segmentation accuracy.
The invention has the advantages and beneficial effects that:
The invention designs a semi-supervised medical image segmentation method based on image text collaborative constraint. The SAM model is good at image segmentation and object positioning, and has good space understanding capability; the CLIP model has strong cross-modal understanding capability, and can combine text and image information to perform task processing. The invention combines the two, so that the model has good space positioning capability in the image understanding task, and can fully utilize text information for understanding, thereby more comprehensively understanding and processing the image. And the automatic rough segmentation is carried out on the prompts required by the SAM model through nnunet to generate frame prompts, so that the cost of manual prompts is reduced. Meanwhile, the CLIP model can perform zero-sample learning, namely, tasks are executed without specific task data, and the SAM model can learn rich image understanding knowledge through multitasking pre-training. The invention combines the two to perform finer granularity image understanding by utilizing the pre-training knowledge of the SAM model on the basis of zero sample learning, thereby improving the performance and generalization capability of the model.
The invention combines the SAM and the CLIP model to improve the model performance in a mutually complementary mode, and can flexibly adjust the structure and parameters of the model according to task demands, thereby better adapting to different image segmentation tasks and scenes and achieving good medical image segmentation effect.
Drawings
Fig. 1 is an overall flow chart of the present invention.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings.
Example 1:
A semi-supervised medical image segmentation method based on image text cooperative constraint, the whole flow is shown in figure 1, comprises the following steps:
s1, firstly, the data set used for training needs to be divided, and the data set disclosed on the internet, such as an ACDC data set, is collected and preprocessed. The preprocessing operation comprises the following steps: an image with the suffix of the.nii.gz format is processed into the.h5 format, the method is convenient for subsequent processing, an image with less redundant information is obtained by cutting a larger image with more redundant information into a central area, and the image is normalized. Then construct a dataset of training phases, including supervised Part and unsupervised part/>I.e./>Wherein =/> = />Wherein/>Representing an image,/>Is its corresponding real label, L represents the annotated image quantity,/> = The subscript M-L total from L+1 to M represents the number of pictures without annotations.
S2: for the collected data set, an extremely simple sentence is used as a text description due to the lack of a diagnostic description provided by a professional doctor. In the experiment, the text description Of the segmented organ is only used as an auxiliary supervisory signal, so that the text encoder Of the large visual language model CLIP model is directly used for extracting the characteristics Of the text description, fine adjustment is not carried out in the whole training process, the training cost can be greatly reduced, the accuracy is not excessively reduced, the text description is a very simple sentence, namely 'a Photo Of a ____ (name Of the organ)', for example, the text description in the LA dataset is 'A Photo Of A Left Atrium', and the characteristics can be fully extracted by using the original CLIP model. As shown in formula (1):
wherein the method comprises the steps of Representing the extracted feature vector,/>Representing the text encoder, t represents the textual description of the organ, wherein the textual description corresponding to each image is the same for each dataset.
S3: after embedding the required data set and text prompt, the whole architecture of the segmentation model is required to be constructed, and the model comprises two total branches, namely a CLIP model branch and a SAM model branch. Wherein both models employ a shared encoderAnd is initialized by the encoder of the SAM model during the initialization process, gradually absorbing knowledge of the CLIP model during the training process. In addition to the shared encoder, the CLIP model branch includes a text encoder and CLIP decoder. Wherein the text encoder is used to encode the hint text into text vectors for hint the CLIP model segmentation process and the CLIP model decoder is responsible for generating the final segmentation prediction mask. In the SAM model branch, firstly, image data is input to nnunet networks to automatically generate frame prompts, manual labeling is omitted, and time and labor are saved. The input image with the box cues is then input into a cue encoder to generate cue embeddings to further guide the SAM model for segmentation. Also, the SAM model decoder is used to output the final prediction segmentation result.
And S4, after the whole network model is constructed, calculating a prediction mask of the picture contained in each small batch. Specifically, the input image first passes through an encoder shared by the SAM model and the CLIP modelObtaining image characteristics as shown in formula (2):
(2);
wherein the method comprises the steps of Representing the ith image input network at this time,/>Encoder section shared for SAM model and CLIP model,/>Then it is the extracted image feature. Meanwhile, the image with the frame prompt generated by nnunet is input into a prompt encoder to obtain prompt embedding, as shown in formulas (3) and (4):
(3);
(4);
wherein the method comprises the steps of Representing the ith image data input to nnunet network for coarse segmentation,/>Representing image data with framed cues,/>Hint encoder representing SAM model,/>And embedding the corresponding prompt. Due to the feature map/>Is of the size of (1) and the previously obtained hint embedding/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (5):
(5);
Where GAP represents a global average pooling operation, Representing vector collage operations,/>Is an intermediate parameter variable to be processed; while for the image features/>, obtained by the encoderContinuing through the upsampling portion of the decoder, the feature map is restored to the size of the original picture and is then processed/>And adding, namely carrying out convolution operation on the added result to obtain a final prediction mask, wherein the final prediction mask is shown in the following formula (6):
wherein the method comprises the steps of Representing the decoder part of a SAM splitting network,/>Representing convolution operations with the aim of making/>The number of channels of the feature map obtained after the decoder is consistent with the number of channels of the feature map obtained after the decoderIs an augmentation operation such that/>The feature images obtained after the decoder are consistent in size, and the addition operation can be performed through the two steps of processing. /(I)Is the convolution layer that yields the final segmentation result.
The above steps are procedures for the SAM model branch processing image data and hint data. For the CLIP model decoder part, the same is done with the SAM model process, which is also a feature mapWith text embedding/>Phase splicing to obtain the intermediate parameters/>, to be processedThe image features are then restored by the upsampling portion of the decoder and then combined with/>The convolution operation is carried out to obtain the final prediction segmentation mask, and the specific steps are as follows:
for CLIP model branching, it is also necessary to pass image features of the shared encoder And text feature obtained in S2/>Splicing is performed, similarly, because of the characteristic diagram/>And the size of the previously obtained text feature/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (7):
Where GAP represents a global average pooling operation, Representing vector collage operations,/>Is an intermediate parameter variable to be processed; while for the image features/>, obtained by the encoderContinuing through the upsampling portion of the decoder, the feature map is restored to the size of the original picture and is then processed/>And adding, namely carrying out convolution operation on the added result to obtain a final prediction mask, wherein the final prediction mask is shown in the following formula (8):
wherein the method comprises the steps of Representing the decoder part of a CLIP splitting network,/>Representing convolution operations with the aim of making/>The number of channels of the feature map obtained after the decoder is consistent with the number of channels of the feature map obtained after the decoderIs an augmentation operation such that/>The feature images obtained after the decoder are consistent in size, and the addition operation can be performed through the two steps of processing. /(I)Is the convolution layer that yields the final segmentation result.
S5: the foregoing explains the components of the various parts of the model network, and also requires the definition of objective functions and optimization objectives throughout the process. The prediction mask calculated by the segmentation network is obtained through the previous steps, and the model needs to be trained by constructing the loss to optimize the model parameters.
(1) Monitor loss):
Monitor loss) That is, to measure the gap between the prediction mask calculated by the partitioning network and the real label, the supervised loss of the present invention includes the Dice loss and the cross entropy loss, as shown in formulas (9), (10), (11):
wherein the method comprises the steps of Predictive labels representing network output,/>Representing the number of voxels of an image,/>Representing a real tag in the dataset; /(I)Is a supervision loss,/>Representing cross entropy loss,/>Representing the Dice loss;
(2) Unsupervised loss
For unsupervised loss) I.e., for the consistency loss of unlabeled data between SAM model branches and CLIP model branches, the specific formula is shown in (12):
(12);
Wherein, Representing MSE loss,/>Representing the prediction result of the CLIP model,/>Representing the prediction result of the SAM model;
The total loss function definition of the model is shown in a formula (13) by combining the supervision loss function and the unsupervised loss function:
(13);
where λ is the weighting coefficient, λ= Where I represents the number of iteration cycles. Introduction of dynamic parameters/>The objective of (2) is that the optimization of parameters is mainly based on annotated labels in the early stage of network training, but many errors are accumulated at the same time, and as the network iterates continuously, the later stage should be given more weight to the unsupervised loss to correct the previously accumulated loss.
Based on the previous steps, the results of each branch are already obtained, and the loss function and the main learning task are all clear, the whole model can be trained. After the model is trained, the model can be used for subsequent reasoning work. Firstly, in order to further improve the segmentation performance of the model, the supervision loss function can be used for effectively utilizing the real label information in the label data to guide the model to learn the correct characteristic representation, so that the model is facilitated to better distinguish different types of tissue structures or lesions, and segmentation is more accurately carried out.
And S7, after the model better learns the characteristic representation in the data through the supervision loss, the unmarked data are respectively input into the CLIP model and the SAM model to generate a segmentation result, and the consistency loss is calculated through the segmentation result of the CLIP model and the SAM model, so that the unmarked data is further utilized, and the accuracy of model segmentation is improved.
S8, through the previous training step, the model fully learns the characteristics in the data, and the SAM model is better in segmentation, so that the SAM model is selected as a model of final reasoning, and a corresponding medical picture is input into the model to obtain a prediction mask.
Example 2
The present example performs actual verification based on the method provided in example 1.
In order to verify the accuracy of the image segmentation proposed by the present invention, experiments were performed on a renal and cardiac dynamic magnetic resonance imaging dataset (Automatic Cardiac Diagnosis Challenge, ACDC), using Dice and HD95 as evaluation indices, with a Dice result of 85.89 and an hd95 result of 1.86. On an ACDC real data set, the Dice and the HD95 obtained by the semi-supervised medical image segmentation method provided by the invention are obviously superior to other methods under the same-pattern setting, including a Dual-task consistency model (Dual-task Consistency, DTC) and a mutual consistency (Mutual consistency) method, which shows that the model constructed by the invention is superior to other existing models, the unlabeled data can be better utilized, and the image segmentation accuracy is higher.
The above-mentioned plan is merely an implementation method in the present invention, but the scope of the present invention is not limited thereto, and all those skilled in the art should understand that the conceivable substitutions or alterations are included in the scope of the present invention.

Claims (6)

1. The semi-supervised medical image segmentation method based on image text cooperative constraint is characterized by comprising the following steps of:
S1: collecting a medical image dataset and preprocessing the image data; dividing the data set into a training set and a testing set, wherein the training set comprises supervised image data and unsupervised image data;
S2: encoding the text description of the selected dataset using a text encoder of the large visual language model CLIP to obtain text features;
S3: constructing a network model, wherein the network model comprises two branches, namely a CLIP model branch and a SAM model branch, the two branches adopt shared encoders, the split network structures are UNet, and the parameter initialization is different;
S4: after the training set is input into the network model, for SAM model branches, the extracted image features and prompt embedding are spliced, and then the extracted image features and the prompt embedding are used as parameters to be added in subsequent operations as guidance to obtain a segmentation result; for the CLIP model branch, splicing the extracted image features and the acquired text features, and then adding the extracted image features and the acquired text features as parameters in subsequent operations as a guide to obtain a segmentation result; the prompt embedding is that in the SAM model branch, firstly inputting image data into nnunet network to automatically generate frame prompt, omitting manual annotation, then inputting the input image with frame prompt into prompt encoder to generate prompt embedding;
S5: constructing a loss function in the network model, wherein the loss function comprises a supervision loss for marked data and a consistency loss for unmarked data and a total loss function combining the supervision loss and the consistency loss;
s6: respectively performing supervision training on the CLIP model branch and the SAM model branch by using marked data in the training set;
s7: performing unsupervised training on the CLIP model branch and the SAM model branch through consistency loss by using unlabeled data in the training set;
S8: and outputting a final image segmentation effect by using test data in the test set through SAM model branches.
2. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S1: the pretreatment comprises the following steps: processing an image format, cutting an image, and normalizing the image; a dataset D tr of the training phase is then constructed, comprising a supervised part D sup and an unsupervised part D unsu p, i.e. D tr=Dsup∩Dunsup, wherein D sup={(X1,Y1),(X2,Y2),......(XL,YL), wherein X L represents the marked medical image, Y L is its corresponding real label, L represents the number of annotated medical images, D unsup={(XL+1),(XL+2),......(XM), and a total of M-L from l+1 to M represents the number of non-annotated medical images.
3. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S2: feature extraction of text descriptions using a text encoder of the large visual language model CLIP, as shown in equation (1):
Te=Ft(t) (1);
Where T e represents the extracted feature vector, F t represents the text encoder, and T represents the textual description of the organ.
4. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S3: constructing two branch segmentation network models, namely a CLIP model branch and a SAM model branch; wherein both models employ a shared encoderInitializing by an encoder of the SAM model in the initializing process, and gradually absorbing the knowledge of the CLIP model in the training process; in addition to the shared encoder, the CLIP model branch includes a text encoder and CLIP decoder; wherein the text encoder is used for encoding the prompt text into text vectors for prompting the segmentation process of the CLIP model, and the CLIP model decoder is responsible for generating the final segmentation prediction mask; in the SAM model branch, firstly inputting image data into nnunet network to automatically generate frame prompt, and omitting manual annotation; then inputting the input image with the frame prompt into a prompt encoder to generate prompt embedding, and guiding the SAM model to divide; also, the SAM model decoder is used to output the final prediction segmentation result.
5. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S4: the input image first passes through an encoder shared by the SAM model and the CLIP modelObtaining image characteristics as shown in formula (2):
where X i represents the ith image input network at this time, Encoder section shared for SAM model and CLIP model,/>Then the extracted image features; meanwhile, the image with the frame prompt generated by nnunet is input into a prompt encoder to obtain prompt embedding, as shown in formulas (3) and (4):
where X i represents the i-th image data input to the nnunet network for coarse segmentation, Representing image data with box cues, P Q representing a cue encoder of the SAM model, and Q t representing a corresponding cue insert; due to the feature map/>The size of the hint embedded Q t is inconsistent with the size of the hint embedded Q t obtained before, so that the hint embedded Q t can be spliced with the Q t after global average pooling to obtain intermediate parameters, as shown in a formula (5):
Where GAP represents a global average pooling operation, concat (·, ·) represents a vector collage operation, θ i is an intermediate parameter variable to be processed; whereas for the image features obtained by the encoder The up-sampling part of the decoder is continued to restore the feature map to the size of the original picture, and the feature map is added to the processed θ i, and the addition result is subjected to convolution operation to obtain a final prediction mask, as shown in the following formula (6):
wherein the method comprises the steps of A decoder part for representing a SAM segmentation network, conv (-) for a convolution operation, wherein the aim is to make the channel number of theta i consistent with the channel number of a feature diagram obtained after the decoder, and Expand (-) for an expansion operation, so that the size of theta i is consistent with the size of the feature diagram obtained after the decoder, and the addition operation can be performed after the two steps of processing; conv is the convolution layer that yields the final segmentation result;
The same will be true for the CLIP model decoder section And splicing the image features with the text embedding T e to obtain an intermediate parameter theta i to be processed, restoring the image features through an up-sampling part of a decoder, and adding the restored image features with theta i to perform convolution operation to obtain a final prediction segmentation mask.
6. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein S5 is specifically as follows:
s5-1: supervision loss L sup:
The supervision loss L sup is used for measuring the difference between the prediction mask calculated by the segmentation network and the real label, and the supervision loss of the invention comprises a Dice loss and a cross entropy loss, as shown in formulas (7), (8) and (9):
Where p i denotes the prediction label of the network output, H x W denotes the number of voxels of the image, Representing a real tag in the dataset; l sup is the supervised loss, L ce represents the cross entropy loss, L dice represents the Dice loss;
S5-2: unsupervised loss L semi:
For the unsupervised loss L semi, i.e., the consistency loss between the SAM model branch and the CLIP model branch for unlabeled data, the specific formula is shown as (10):
Where L mse represents the MSE loss, Representing the prediction result of the CLIP model,/>Representing the prediction result of the SAM model;
the total loss function definition of the network model is shown in a formula (11) by combining a supervision loss function and an unsupervised loss function:
Lall=Lsup+λLsemi (11);
where lambda is the weighting coefficient of the coefficient, Where I represents the number of iteration cycles.
CN202410353448.5A 2024-03-27 2024-03-27 Semi-supervised medical image segmentation method based on image text cooperative constraint Active CN117952993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410353448.5A CN117952993B (en) 2024-03-27 2024-03-27 Semi-supervised medical image segmentation method based on image text cooperative constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410353448.5A CN117952993B (en) 2024-03-27 2024-03-27 Semi-supervised medical image segmentation method based on image text cooperative constraint

Publications (2)

Publication Number Publication Date
CN117952993A CN117952993A (en) 2024-04-30
CN117952993B true CN117952993B (en) 2024-06-18

Family

ID=90800282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410353448.5A Active CN117952993B (en) 2024-03-27 2024-03-27 Semi-supervised medical image segmentation method based on image text cooperative constraint

Country Status (1)

Country Link
CN (1) CN117952993B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118212490A (en) * 2024-05-15 2024-06-18 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium for image segmentation model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112204666A (en) * 2018-04-13 2021-01-08 格里尔公司 Multiple assay predictive model for cancer detection

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430946B1 (en) * 2019-03-14 2019-10-01 Inception Institute of Artificial Intelligence, Ltd. Medical image segmentation and severity grading using neural network architectures with semi-supervised learning techniques
US11488309B2 (en) * 2020-08-27 2022-11-01 The Chinese University Of Hong Kong Robust machine learning for imperfect labeled image segmentation
CN113129309B (en) * 2021-03-04 2023-04-07 同济大学 Medical image semi-supervised segmentation system based on object context consistency constraint
CN113077471B (en) * 2021-03-26 2022-10-14 南京邮电大学 Medical image segmentation method based on U-shaped network
CN115294038A (en) * 2022-07-25 2022-11-04 河北工业大学 Defect detection method based on joint optimization and mixed attention feature fusion
CN115187783B (en) * 2022-09-09 2022-12-27 之江实验室 Multi-task hybrid supervision medical image segmentation method and system based on federal learning
CN116051574A (en) * 2022-12-28 2023-05-02 河南大学 Semi-supervised segmentation model construction and image analysis method, device and system
CN116030044A (en) * 2023-03-01 2023-04-28 北京工业大学 Boundary-aware semi-supervised medical image segmentation method
CN117437423A (en) * 2023-11-29 2024-01-23 南京理工大学 Weak supervision medical image segmentation method and device based on SAM collaborative learning and cross-layer feature aggregation enhancement
CN117611601B (en) * 2024-01-24 2024-04-23 中国海洋大学 Text-assisted semi-supervised 3D medical image segmentation method
CN117690031B (en) * 2024-02-04 2024-04-26 中科星图数字地球合肥有限公司 SAM model-based small sample learning remote sensing image detection method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112204666A (en) * 2018-04-13 2021-01-08 格里尔公司 Multiple assay predictive model for cancer detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卷积神经网络图像语义分割技术;田启川;孟颖;;小型微型计算机***;20200529(第06期);全文 *

Also Published As

Publication number Publication date
CN117952993A (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN117952993B (en) Semi-supervised medical image segmentation method based on image text cooperative constraint
CN111091589B (en) Ultrasonic and nuclear magnetic image registration method and device based on multi-scale supervised learning
CN113314205B (en) Efficient medical image labeling and learning system
CN113393469A (en) Medical image segmentation method and device based on cyclic residual convolutional neural network
CN114782384B (en) Cardiac chamber image segmentation method and device based on semi-supervision method
CN111079901A (en) Acute stroke lesion segmentation method based on small sample learning
CN117611601B (en) Text-assisted semi-supervised 3D medical image segmentation method
WO2024104035A1 (en) Long short-term memory self-attention model-based three-dimensional medical image segmentation method and system
CN116596949A (en) Medical image segmentation method based on conditional diffusion model
CN115578427A (en) Unsupervised single-mode medical image registration method based on deep learning
CN115526829A (en) Honeycomb lung focus segmentation method and network based on ViT and context feature fusion
CN114972266A (en) Lymphoma ultrasonic image semantic segmentation method based on self-attention mechanism and stable learning
CN112200810B (en) Multi-modal automated ventricle segmentation system and method of use thereof
CN117710671A (en) Medical image segmentation method based on segmentation large model fine adjustment
CN117808834A (en) SAM-based cross-modal domain generalization medical image segmentation method
CN115496732B (en) Semi-supervised heart semantic segmentation algorithm
CN115565671A (en) Atrial fibrillation auxiliary analysis method based on cross-model mutual teaching semi-supervision
CN115205215A (en) Corneal nerve image segmentation method and system based on Transformer
CN114974522A (en) Medical image processing method and device, electronic equipment and storage medium
CN115409812A (en) CT image automatic classification method based on fusion time attention mechanism
Mazher et al. Multi-disease, multi-view and multi-center right ventricular segmentation in cardiac MRI using efficient late-ensemble deep learning approach
CN114820636A (en) Three-dimensional medical image segmentation model and training method and application thereof
CN114419015A (en) Brain function fusion analysis method based on multi-modal registration
CN113269815A (en) Deep learning-based medical image registration method and terminal
Dhiman et al. Brain Tumor Segmentation in MRI Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant