CN117952993B

CN117952993B - Semi-supervised medical image segmentation method based on image text cooperative constraint

Info

Publication number: CN117952993B
Application number: CN202410353448.5A
Authority: CN
Inventors: 蔡青; 曹子彦; 鄢柯; 张帆; 刘治; 徐勇; 王珊珊; 董军宇
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2024-03-27
Filing date: 2024-03-27
Publication date: 2024-06-18
Anticipated expiration: 2044-03-27
Also published as: CN117952993A

Abstract

The invention discloses a semi-supervised medical image segmentation method based on image text collaborative constraint, and belongs to the technical field of medical image processing. The invention comprises an encoder shared by a SAM model and a CLIP model, a CLIP model branch, and a SAM model branch. The SAM model is good at image segmentation and object positioning, and has good space understanding capability; the CLIP model has strong cross-modal understanding capability, and can combine text and image information to perform task processing. The model has good space positioning capability in the image understanding task by combining the model and the text information, and can be fully utilized for understanding, so that the image can be more comprehensively understood and processed. The invention combines the SAM and the CLIP model to improve the model performance in a mutually complementary mode, and can flexibly adjust the structure and parameters of the model according to task demands, thereby being better suitable for different image segmentation tasks and scenes.

Description

Semi-supervised medical image segmentation method based on image text cooperative constraint

Technical Field

The invention particularly provides a semi-supervised medical image segmentation method based on image text collaborative constraint, and belongs to the technical field of medical image processing.

Background

Medical image segmentation is a process of precisely marking and locating structures or regions in a medical image, dividing the medical image into different regions, each region corresponding to a particular structure, organ or lesion in the image. Medical image segmentation plays an important role in the field of medical imaging, provides accurate and important information for medical diagnosis, treatment and operation, and has profound effects on improving the medical care and treatment effects of patients.

In recent years, deep learning techniques have made remarkable progress in medical image segmentation, such as: the U-Net model is a classical deep learning architecture, is widely used for medical image segmentation, is characterized in that the image is input through a network structure of an encoder and a decoder, is beneficial to capturing characteristics of different layers, and is particularly suitable for small sample conditions. The generation of the countermeasure network GANs, which is used to generate more realistic medical images, can also be used to improve segmentation performance, training by generating images with more medically realistic features. These advances have led to deep learning with greater accuracy and robustness in medical image segmentation, providing a more promising solution for medical image analysis. Although deep learning has made significant progress in medical image segmentation, they still suffer from suboptimal performance in many medical tasks, mainly for the following reasons:

First, labeling is difficult and expensive: labeling of medical images typically requires a specialized doctor or a professional in the medical arts, which makes the labeling process expensive and time consuming. At the same time, the labeling of medical images requires high accuracy and reliability for each pixel.

Data imbalance: there may be a problem with class imbalance in the medical image dataset, i.e. the number of samples of some classes is far greater than for other classes. This may result in models that favor higher frequency categories during training, while performing poorly for other categories.

Data diversity: medical images cover a number of diseases, organs and scanning devices and are therefore of great diversity. This makes it more complex to design a model that is highly versatile for different scenarios and modalities, as the model needs to have sufficient generalization capability.

Disclosure of Invention

The invention aims to provide a semi-supervised medical image segmentation method based on image text cooperative constraint, so as to make up for the defects of the prior art.

According to the invention, the CLIP and the SAM are combined to form a dual-branch network model, and the advantages of the CLIP model and the SAM model are combined to establish a semi-supervised medical image segmentation model so as to solve the problems of insufficient data and easy network overfitting in a semi-supervised medical image segmentation task.

In order to achieve the aim of the invention, the invention adopts the following specific technical scheme:

A semi-supervised medical image segmentation method based on image text collaborative constraint comprises the following steps:

S1: collecting a medical image dataset and preprocessing the image data; dividing the data set into a training set and a testing set, wherein the training set comprises supervised image data and unsupervised image data;

S2: encoding the text description of the selected dataset using a text encoder of the large visual language model CLIP to obtain text features;

S3: constructing a network model, wherein the network model comprises two branches, namely a CLIP model branch and a SAM model branch, the two branches adopt shared encoders, the split network structures are UNet, and the parameter initialization is different;

S4: after the training set is input into the network model, for SAM model branches, the extracted image features and prompt embedding are spliced, and then the extracted image features and the prompt embedding are used as parameters to be added in subsequent operations as guidance to obtain a segmentation result; for the CLIP model branch, splicing the extracted image features and the acquired text features, and then adding the extracted image features and the acquired text features as parameters in subsequent operations as a guide to obtain a segmentation result;

S5: constructing a loss function in the network model, wherein the loss function comprises a supervision loss for marked data and a consistency loss for unmarked data and a total loss function combining the supervision loss and the consistency loss;

s6: respectively performing supervision training on the CLIP model branch and the SAM model branch by using marked data in the training set;

s7: performing unsupervised training on the CLIP model branch and the SAM model branch through consistency loss by using unlabeled data in the training set;

S8: and outputting a final image segmentation effect by using test data in the test set through SAM model branches.

Further, in the step S1: the pretreatment comprises the following steps: processing an image format, cutting an image, and normalizing the image; then construct a dataset of the training phaseComprising supervised parts/>And an unsupervised partI.e./>Wherein/> = />Wherein/>Representing a marked medical image,/>Is its corresponding real label, L represents the number of annotated medical images,/> = Total M/>, from L+1 to MThe L term represents the number of medical images without annotations.

Further, in the step S2: the text coder of the large visual language model CLIP is used for extracting the characteristics of the text description, fine adjustment is not carried out on the text description in the whole training process, training expenditure can be greatly reduced, accuracy is not excessively reduced, and the text description is related to organ name prompts to be segmented in a data set. Then, the text prompt is subjected to text encoder of the CLIP model to obtain text embedding, and the specific formula is shown in (1):

（1）；

wherein the method comprises the steps of Representing the extracted feature vector,/>Representing the text encoder, t represents the textual description of the organ, wherein the textual description corresponding to each image is the same for each dataset.

Further, in the step S3: an overall segmentation network model is constructed, comprising a shared encoder and CLIP model branches and SAM model branches, wherein the CLIP model branches comprise a text encoder and CLIP image decoder. The SAM model branch contains nnunet networks, a hint encoder, and a SAM image decoder.

Further, in S4: in order to fully utilize the segmentation capability of the SAM model and the capability of the CLIP model for text prompt and simultaneously overcome the problem of lack of annotation data in medical image segmentation, two segmentation networks are adopted, namely/>Are UNet, and the initialization parameters are different. For the SAM model, because the SAM model needs to input the prompt (points, frames, texts and the like) of the picture, and the manual prompt is too high in cost and consumes a great amount of time, in order to overcome the defects, the invention firstly adopts nnunet network to coarsely divide the input picture to generate corresponding frame prompts, then inputs the picture with the frame prompts into a prompt encoder of the SAM model to generate prompt embedding, and specific formulas are shown as (2) and (3):

（2）；

（3）；

wherein the method comprises the steps of Representing the ith image data input to nnunet network for coarse segmentation,/>Representing image data with framed cues,/>Hint encoder representing SAM model,/>And embedding the corresponding prompt. While obtaining the frame prompt embedding, the SAM model which is also input to the image data is shared with the CLIP model to obtain the image embedding, and the specific formula is as shown in (4):

（4）；

wherein the method comprises the steps of Representing the ith image input network at this time,/>Encoder section shared for SAM model and CLIP model,/>Then it is the extracted image feature. Due to the feature map/>Is of the size of (1) and the previously obtained hint embedding/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (5):

（5）；

Where GAP represents a global average pooling operation, Representing vector collage operations,/>Is an intermediate parameter variable to be processed; while for the image features/>, obtained by the encoderContinuing through the upsampling portion of the decoder, the feature map is restored to the size of the original picture and is then processed/>And adding, namely carrying out convolution operation on the added result to obtain a final prediction mask, wherein the final prediction mask is shown in the following formula (6):

（6）；

wherein the method comprises the steps of Representing the decoder part of a SAM splitting network,/>Representing convolution operations with the aim of making/>The number of channels of the feature map obtained after the decoder is consistent with the number of channels of the feature map obtained after the decoderIs an augmentation operation such that/>The feature images obtained after the decoder are consistent in size, and the addition operation can be performed through the two steps of processing. /(I)Is the convolution layer that yields the final segmentation result.

The above steps are procedures in which the SAM model branches process image data and obtain segmentation results. For CLIP model branching, it is also necessary to pass image features of the shared encoderAnd text feature obtained in S2/>Splicing is performed, similarly, because of the characteristic diagram/>And the size of the previously obtained text feature/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (7):

（7）；

Where GAP represents a global average pooling operation, Representing vector collage operations,/>Is an intermediate parameter variable to be processed; while for the image features/>, obtained by the encoderContinuing through the upsampling portion of the decoder, the feature map is restored to the size of the original picture and is then processed/>And adding, namely carrying out convolution operation on the added result to obtain a final prediction mask, wherein the final prediction mask is shown in the following formula (8):

（8）；

wherein the method comprises the steps of Representing the decoder part of a CLIP splitting network,/>Representing convolution operations with the aim of making/>The number of channels of the feature map obtained after the decoder is consistent with the number of channels of the feature map obtained after the decoderIs an augmentation operation such that/>The feature images obtained after the decoder are consistent in size, and the addition operation can be performed through the two steps of processing. /(I)Is the convolution layer that yields the final segmentation result.

Further, in S5: the loss functions used in the training process to construct the segmentation model include supervised loss of tagged data and total loss functions to further exploit the consistency loss of untagged data and combine the two. Wherein the loss is monitored) That is, to measure the gap between the prediction mask calculated by the partitioning network and the real label, the supervised loss of the present invention includes the Dice loss and the cross entropy loss, as shown in formulas (9), (10), (11):

；

wherein the method comprises the steps of Predictive labels representing network output,/>Representing the number of pixels of an image,/>Representing a real tag in the dataset; /(I)Is a supervision loss,/>Representing cross entropy loss,/>Representing the Dice loss; for unsupervised loss) I.e., for the consistency loss of unlabeled data between SAM model branches and CLIP model branches, the specific formula is shown in (12):

（12）；

Wherein, Representing MSE loss,/>Representing the prediction result of the CLIP model,/>Representing the prediction result of the SAM model;

The total loss function definition of the model is shown in a formula (13) by combining the supervision loss function and the unsupervised loss function:

（13）；

where λ is the weighting coefficient, λ= Where I represents the number of iteration cycles. Introduction of dynamic parameters/>The objective of (2) is that the optimization of parameters is mainly based on annotated labels in the early stage of network training, but many errors are accumulated at the same time, and as the network iterates continuously, the later stage should be given more weight to the unsupervised loss to correct the previously accumulated loss.

Further, in the step S6, the two branches are supervised and trained through the marked data and the objective function of the supervision loss, so that the accuracy of the model is improved, the supervision loss is used for improving the segmentation performance of the network by using the marked data, the real tag information in the marked data can be effectively utilized by using the supervision loss function, the model is guided to learn the correct feature representation, and the model is helped to better distinguish different types of tissue structures or lesions and segment more accurately.

Further, in the step S7, the unmarked data are respectively input into the CLIP model and the SAM model to generate a segmentation result, and the consistency loss is calculated according to the segmentation result of the CLIP model and the SAM model, so that the unmarked data are further utilized, and the accuracy of model segmentation is improved; the consistency loss function is generally used for ensuring consistency of output results of two models under different conditions, and simultaneously constrains the range of the features and the knowledge learned by the models in the training process, so that the generalization capability of the models on unseen data is improved; and by maintaining consistency between the two models, instability of the segmentation results due to data changes or model parameter changes can be reduced.

Further, the step S8 specifically includes: and selecting data from the test set, testing by utilizing SAM model branches, observing a final segmentation result, and adopting corresponding measurement indexes to measure segmentation accuracy.

The invention has the advantages and beneficial effects that:

The invention designs a semi-supervised medical image segmentation method based on image text collaborative constraint. The SAM model is good at image segmentation and object positioning, and has good space understanding capability; the CLIP model has strong cross-modal understanding capability, and can combine text and image information to perform task processing. The invention combines the two, so that the model has good space positioning capability in the image understanding task, and can fully utilize text information for understanding, thereby more comprehensively understanding and processing the image. And the automatic rough segmentation is carried out on the prompts required by the SAM model through nnunet to generate frame prompts, so that the cost of manual prompts is reduced. Meanwhile, the CLIP model can perform zero-sample learning, namely, tasks are executed without specific task data, and the SAM model can learn rich image understanding knowledge through multitasking pre-training. The invention combines the two to perform finer granularity image understanding by utilizing the pre-training knowledge of the SAM model on the basis of zero sample learning, thereby improving the performance and generalization capability of the model.

The invention combines the SAM and the CLIP model to improve the model performance in a mutually complementary mode, and can flexibly adjust the structure and parameters of the model according to task demands, thereby better adapting to different image segmentation tasks and scenes and achieving good medical image segmentation effect.

Drawings

Fig. 1 is an overall flow chart of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings.

Example 1:

A semi-supervised medical image segmentation method based on image text cooperative constraint, the whole flow is shown in figure 1, comprises the following steps:

s1, firstly, the data set used for training needs to be divided, and the data set disclosed on the internet, such as an ACDC data set, is collected and preprocessed. The preprocessing operation comprises the following steps: an image with the suffix of the.nii.gz format is processed into the.h5 format, the method is convenient for subsequent processing, an image with less redundant information is obtained by cutting a larger image with more redundant information into a central area, and the image is normalized. Then construct a dataset of training phases, including supervised Part and unsupervised part/>I.e./>Wherein =/> = />Wherein/>Representing an image,/>Is its corresponding real label, L represents the annotated image quantity,/> = The subscript M-L total from L+1 to M represents the number of pictures without annotations.

S2: for the collected data set, an extremely simple sentence is used as a text description due to the lack of a diagnostic description provided by a professional doctor. In the experiment, the text description Of the segmented organ is only used as an auxiliary supervisory signal, so that the text encoder Of the large visual language model CLIP model is directly used for extracting the characteristics Of the text description, fine adjustment is not carried out in the whole training process, the training cost can be greatly reduced, the accuracy is not excessively reduced, the text description is a very simple sentence, namely 'a Photo Of a ____ (name Of the organ)', for example, the text description in the LA dataset is 'A Photo Of A Left Atrium', and the characteristics can be fully extracted by using the original CLIP model. As shown in formula (1):

；

S3: after embedding the required data set and text prompt, the whole architecture of the segmentation model is required to be constructed, and the model comprises two total branches, namely a CLIP model branch and a SAM model branch. Wherein both models employ a shared encoderAnd is initialized by the encoder of the SAM model during the initialization process, gradually absorbing knowledge of the CLIP model during the training process. In addition to the shared encoder, the CLIP model branch includes a text encoder and CLIP decoder. Wherein the text encoder is used to encode the hint text into text vectors for hint the CLIP model segmentation process and the CLIP model decoder is responsible for generating the final segmentation prediction mask. In the SAM model branch, firstly, image data is input to nnunet networks to automatically generate frame prompts, manual labeling is omitted, and time and labor are saved. The input image with the box cues is then input into a cue encoder to generate cue embeddings to further guide the SAM model for segmentation. Also, the SAM model decoder is used to output the final prediction segmentation result.

And S4, after the whole network model is constructed, calculating a prediction mask of the picture contained in each small batch. Specifically, the input image first passes through an encoder shared by the SAM model and the CLIP modelObtaining image characteristics as shown in formula (2):

（2）；

wherein the method comprises the steps of Representing the ith image input network at this time,/>Encoder section shared for SAM model and CLIP model,/>Then it is the extracted image feature. Meanwhile, the image with the frame prompt generated by nnunet is input into a prompt encoder to obtain prompt embedding, as shown in formulas (3) and (4):

（3）；

（4）；

wherein the method comprises the steps of Representing the ith image data input to nnunet network for coarse segmentation,/>Representing image data with framed cues,/>Hint encoder representing SAM model,/>And embedding the corresponding prompt. Due to the feature map/>Is of the size of (1) and the previously obtained hint embedding/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (5):

（5）；

；

The above steps are procedures for the SAM model branch processing image data and hint data. For the CLIP model decoder part, the same is done with the SAM model process, which is also a feature mapWith text embedding/>Phase splicing to obtain the intermediate parameters/>, to be processedThe image features are then restored by the upsampling portion of the decoder and then combined with/>The convolution operation is carried out to obtain the final prediction segmentation mask, and the specific steps are as follows:

for CLIP model branching, it is also necessary to pass image features of the shared encoder And text feature obtained in S2/>Splicing is performed, similarly, because of the characteristic diagram/>And the size of the previously obtained text feature/>Is not uniform in size, so it is pooled by global averaging before it can be used with/>Splicing to obtain intermediate parameters, as shown in formula (7):

；

S5: the foregoing explains the components of the various parts of the model network, and also requires the definition of objective functions and optimization objectives throughout the process. The prediction mask calculated by the segmentation network is obtained through the previous steps, and the model needs to be trained by constructing the loss to optimize the model parameters.

(1) Monitor loss）：

Monitor loss) That is, to measure the gap between the prediction mask calculated by the partitioning network and the real label, the supervised loss of the present invention includes the Dice loss and the cross entropy loss, as shown in formulas (9), (10), (11):

；

wherein the method comprises the steps of Predictive labels representing network output,/>Representing the number of voxels of an image,/>Representing a real tag in the dataset; /(I)Is a supervision loss,/>Representing cross entropy loss,/>Representing the Dice loss;

(2) Unsupervised loss ）

For unsupervised loss) I.e., for the consistency loss of unlabeled data between SAM model branches and CLIP model branches, the specific formula is shown in (12):

（12）；

（13）；

Based on the previous steps, the results of each branch are already obtained, and the loss function and the main learning task are all clear, the whole model can be trained. After the model is trained, the model can be used for subsequent reasoning work. Firstly, in order to further improve the segmentation performance of the model, the supervision loss function can be used for effectively utilizing the real label information in the label data to guide the model to learn the correct characteristic representation, so that the model is facilitated to better distinguish different types of tissue structures or lesions, and segmentation is more accurately carried out.

And S7, after the model better learns the characteristic representation in the data through the supervision loss, the unmarked data are respectively input into the CLIP model and the SAM model to generate a segmentation result, and the consistency loss is calculated through the segmentation result of the CLIP model and the SAM model, so that the unmarked data is further utilized, and the accuracy of model segmentation is improved.

S8, through the previous training step, the model fully learns the characteristics in the data, and the SAM model is better in segmentation, so that the SAM model is selected as a model of final reasoning, and a corresponding medical picture is input into the model to obtain a prediction mask.

Example 2

The present example performs actual verification based on the method provided in example 1.

In order to verify the accuracy of the image segmentation proposed by the present invention, experiments were performed on a renal and cardiac dynamic magnetic resonance imaging dataset (Automatic Cardiac Diagnosis Challenge, ACDC), using Dice and HD95 as evaluation indices, with a Dice result of 85.89 and an hd95 result of 1.86. On an ACDC real data set, the Dice and the HD95 obtained by the semi-supervised medical image segmentation method provided by the invention are obviously superior to other methods under the same-pattern setting, including a Dual-task consistency model (Dual-task Consistency, DTC) and a mutual consistency (Mutual consistency) method, which shows that the model constructed by the invention is superior to other existing models, the unlabeled data can be better utilized, and the image segmentation accuracy is higher.

The above-mentioned plan is merely an implementation method in the present invention, but the scope of the present invention is not limited thereto, and all those skilled in the art should understand that the conceivable substitutions or alterations are included in the scope of the present invention.

Claims

1. The semi-supervised medical image segmentation method based on image text cooperative constraint is characterized by comprising the following steps of:

S4: after the training set is input into the network model, for SAM model branches, the extracted image features and prompt embedding are spliced, and then the extracted image features and the prompt embedding are used as parameters to be added in subsequent operations as guidance to obtain a segmentation result; for the CLIP model branch, splicing the extracted image features and the acquired text features, and then adding the extracted image features and the acquired text features as parameters in subsequent operations as a guide to obtain a segmentation result; the prompt embedding is that in the SAM model branch, firstly inputting image data into nnunet network to automatically generate frame prompt, omitting manual annotation, then inputting the input image with frame prompt into prompt encoder to generate prompt embedding;

2. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S1: the pretreatment comprises the following steps: processing an image format, cutting an image, and normalizing the image; a dataset D _tr of the training phase is then constructed, comprising a supervised part D _sup and an unsupervised part D _unsu p, i.e. D _tr＝D_sup∩D_unsup, wherein D _sup＝{(X₁,Y₁),(X₂,Y₂),......(X_L,Y_L), wherein X _L represents the marked medical image, Y _L is its corresponding real label, L represents the number of annotated medical images, D _unsup＝{(X_L+1),(X_L+2),......(X_M), and a total of M-L from l+1 to M represents the number of non-annotated medical images.

3. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S2: feature extraction of text descriptions using a text encoder of the large visual language model CLIP, as shown in equation (1):

T_e＝F_t(t) (1)；

Where T _e represents the extracted feature vector, F _t represents the text encoder, and T represents the textual description of the organ.

4. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S3: constructing two branch segmentation network models, namely a CLIP model branch and a SAM model branch; wherein both models employ a shared encoderInitializing by an encoder of the SAM model in the initializing process, and gradually absorbing the knowledge of the CLIP model in the training process; in addition to the shared encoder, the CLIP model branch includes a text encoder and CLIP decoder; wherein the text encoder is used for encoding the prompt text into text vectors for prompting the segmentation process of the CLIP model, and the CLIP model decoder is responsible for generating the final segmentation prediction mask; in the SAM model branch, firstly inputting image data into nnunet network to automatically generate frame prompt, and omitting manual annotation; then inputting the input image with the frame prompt into a prompt encoder to generate prompt embedding, and guiding the SAM model to divide; also, the SAM model decoder is used to output the final prediction segmentation result.

5. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein in S4: the input image first passes through an encoder shared by the SAM model and the CLIP modelObtaining image characteristics as shown in formula (2):

where X _i represents the ith image input network at this time, Encoder section shared for SAM model and CLIP model,/>Then the extracted image features; meanwhile, the image with the frame prompt generated by nnunet is input into a prompt encoder to obtain prompt embedding, as shown in formulas (3) and (4):

where X _i represents the i-th image data input to the nnunet network for coarse segmentation, Representing image data with box cues, P _Q representing a cue encoder of the SAM model, and Q _t representing a corresponding cue insert; due to the feature map/>The size of the hint embedded Q _t is inconsistent with the size of the hint embedded Q _t obtained before, so that the hint embedded Q _t can be spliced with the Q _t after global average pooling to obtain intermediate parameters, as shown in a formula (5):

Where GAP represents a global average pooling operation, concat (·, ·) represents a vector collage operation, θ _i is an intermediate parameter variable to be processed; whereas for the image features obtained by the encoder The up-sampling part of the decoder is continued to restore the feature map to the size of the original picture, and the feature map is added to the processed θ _i, and the addition result is subjected to convolution operation to obtain a final prediction mask, as shown in the following formula (6):

wherein the method comprises the steps of A decoder part for representing a SAM segmentation network, conv (-) for a convolution operation, wherein the aim is to make the channel number of theta _i consistent with the channel number of a feature diagram obtained after the decoder, and Expand (-) for an expansion operation, so that the size of theta _i is consistent with the size of the feature diagram obtained after the decoder, and the addition operation can be performed after the two steps of processing; conv is the convolution layer that yields the final segmentation result;

The same will be true for the CLIP model decoder section And splicing the image features with the text embedding T _e to obtain an intermediate parameter theta _i to be processed, restoring the image features through an up-sampling part of a decoder, and adding the restored image features with theta _i to perform convolution operation to obtain a final prediction segmentation mask.

6. The semi-supervised medical image segmentation method based on image text collaborative constraints of claim 1, wherein S5 is specifically as follows:

s5-1: supervision loss L _sup:

The supervision loss L _sup is used for measuring the difference between the prediction mask calculated by the segmentation network and the real label, and the supervision loss of the invention comprises a Dice loss and a cross entropy loss, as shown in formulas (7), (8) and (9):

Where p _i denotes the prediction label of the network output, H x W denotes the number of voxels of the image, Representing a real tag in the dataset; l _sup is the supervised loss, L _ce represents the cross entropy loss, L _dice represents the Dice loss;

S5-2: unsupervised loss L _semi:

For the unsupervised loss L _semi, i.e., the consistency loss between the SAM model branch and the CLIP model branch for unlabeled data, the specific formula is shown as (10):

Where L _mse represents the MSE loss, Representing the prediction result of the CLIP model,/>Representing the prediction result of the SAM model;

the total loss function definition of the network model is shown in a formula (11) by combining a supervision loss function and an unsupervised loss function:

L_all＝L_sup+λL_semi (11)；

where lambda is the weighting coefficient of the coefficient, Where I represents the number of iteration cycles.