CN117409206B

CN117409206B - Small sample image segmentation method based on self-adaptive prototype aggregation network

Info

Publication number: CN117409206B
Application number: CN202311715217.6A
Authority: CN
Inventors: 李群; 孙宝泉; 肖甫; 盛碧云; 沙乐天
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-02-20
Anticipated expiration: 2043-12-14
Also published as: CN117409206A

Abstract

The invention belongs to the technical field of computer vision and discloses a small sample image segmentation method based on a self-adaptive prototype aggregation network. Integrating three modules and constructing by using a universal decoder; the prior mask generated by the prior mask generating module provides rough positioning of the query object; the self-adaptive prototype aggregation module supports characteristics, inquires characteristics and class semantic information; the visual correlation modeling module obtains visual correlation characteristics and compensates the space details of the one-dimensional prototype vector loss. The invention can selectively utilize the support and query set information and semantic information to generate a prototype with more class specificity for the final feature matching process, and expands the representation content of the prototype vector, thereby alleviating the problem of model performance degradation caused by intra-class differences between support and query objects to a certain extent.

Description

Small sample image segmentation method based on self-adaptive prototype aggregation network

Technical Field

The invention relates to the field of deep learning and computer vision, in particular to a small sample image segmentation method based on a self-adaptive prototype aggregation network.

Background

Semantic segmentation is a fundamental task in computer vision, aimed at assigning a specific semantic class to each pixel in an image, thereby enabling pixel-level resolution of the image. The development of deep learning promotes the practical application of semantic segmentation in tasks such as scene understanding, target recognition and the like. However, conventional segmentation methods often require training with large amounts of labeling data, which can be challenging in cases where the data is scarce or expensive. In order to overcome the defects of data dependence, weak generalization and the like of the traditional semantic segmentation model, small-sample image segmentation is recently proposed and becomes an emerging field with great attention. The goal of small sample image segmentation is to achieve segmentation of new classes of objects in an image with a very small number of labeled samples, e.g., 1 or 5. The biggest difference with the traditional semantic segmentation method is that the training class and the test class of the small sample image segmentation task do not overlap. In this paradigm, the segmentation model may be human-like with the ability to learn and generalize from limited examples. In particular, existing small sample image segmentation methods follow meta-learning strategies based on "epoode", each comprising two parts, a support set and a query set. The model is trained and tested by an independent "epoode" so that it learns to predict the segmentation mask of the query image using the support set of side information. The model is trained to have a general knowledge of the target class from a small number of samples, and thus can generalize to the class that was not seen during the test phase.

Currently, a prototype-based small sample image segmentation method is the dominant paradigm. These methods represent the target class object by mining richer support information from a limited number of support samples, such as extracting one or more prototypes from the support features, which are then used to match the query features to predict the mask of the query object. Most of these approaches ignore a critical issue, namely the ubiquitous intra-class differences of small sample tasks. The intra-class differences are color, texture changes, gestures, illumination and the like among the similar objects, and have differences of different degrees. Even though the two sets belong to the same class, due to intra-class differences, the model has difficulty predicting the mask based on commonalities between the two instances. No matter whether the information extracted from the support set is sufficient, the feature bias between the support and query instance due to intra-class differences cannot be resolved. When the correlation between the support and the query instance is weak, the model does not naturally produce ideal prediction results. In addition, the prototype-matching-based method, while simple and effective, loses the original spatial structure in the process of prototype compression, so that the prototype lacks critical detail features when matching all query feature pixels.

Disclosure of Invention

In order to solve the problems of the small sample image segmentation algorithm based on the prototype, the invention provides a small sample image segmentation method based on an adaptive prototype aggregation network, which introduces an adaptive prototype aggregation module, integrates word vectors supporting prototypes, query prototypes and class names into an enhanced prototype representation, supplements some class information which is missing due to intra-class differences by using semantic information, designs a priori mask generation module and a visual association modeling module, and finally, combines a general decoder to construct a complete and efficient small sample image segmentation model.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

the invention relates to a small sample image segmentation method based on a self-adaptive prototype aggregation network, which mainly comprises the following steps:

step S1: a data set of small sample image segmentations is acquired and divided into a training set and a test set. The training or testing process is performed in terms of independent epodes, each comprising a support set and a query set and word embedding vectors t e R for that class ^1×1×d . One support set contains K image-mask sample pairs and one query set contains 1 homogeneous image to be segmented and its truth mask. All images and masks are subjected to unified data preprocessing;

step S2: extracting middle layer feature sets, i.e., intermediate support feature sets, of support images and query images, respectively, using a pre-trained backbone networkAnd query feature set and->

Step S3: an a priori mask generation module is constructed. Selecting a set of features in S2 selects advanced features that support and query the imageAnd->And inputs them to a priori mask generation module, which ultimately outputs a priori mask M of the query image ^pri ∈R ^h×w×1 。F _s

Step S4: and constructing an adaptive prototype aggregation module, wherein the adaptive prototype aggregation module consists of three parts, namely prototype extraction, visual-text aggregation and prototype fusion. Adaptive prototype aggregation module and intermediate support feature F _s Mid-level query feature F _q Word embedding vector t epsilon R of class ^1×1×d As input. Output enhanced prototype representation p ^aug ∈R ^1×1×c . The enhanced prototype is then expanded to a sum a priori mask M ^pri Equal space size, defined as P ^aug ∈R ^h×w×c 。

The prototype extraction section mainly extracts support and query prototypes. It uses a support mask M _s Support of feature F from mid-level through mask-averaging pooling operations _s Extraction of support prototype vector p _s ∈R ^1×1×c . Using the query image a priori mask M generated in step S3 ^pri Mid-level query feature F _q Foreground pixels of (1) are filtered and aggregated into a query prototype p _q ∈R ^1×1×c 。

The visual-text aggregation part comprises two branches, and the self-adaptive fusion of the support and query prototype and the semantic word vector is respectively realized. Taking query branching as an example, weighted query featuresKey (K) and Value (V) are generated through two independent 1X 1 convolutions respectively, and prototype p is queried _q Vector p spliced with semantic embedding t _q,t ∈R ^1×1×(d+c) After linear projection, query (Q) is obtained. Standard cross-attention is performed on Q, K, V, jump-connection is made at the output, i.e. with the original query prototype p _q Adding, outputting the enhanced support prototype +.>Similarly, an enhanced support prototype +.>

The prototype fusion part will enhance the support prototypesAnd enhanced query prototype->Weighted fusion is carried out to obtain an enhanced prototype representation p ^aug ∈R ^1×1×c It contains much more information than any original support prototype p _s And query prototype p _q And the information contained therein.

Step S5: and constructing a visual association generating module. It queries the intermediate support and query feature set extracted in step S2And->And calculating cosine similarity every two. Since these intermediate features are extracted from three different phases of the backbone network, the computed associative graph potentially contains hierarchical relationships. Fusing the hierarchical association graphs by using a multi-scale structure to obtain visual association features F ^hybrid ∈R ^h×w×δ 。

Further, the multi-scale structure includes three branches, each branch has a 1×1 convolution for fusing the association graphs generated in the corresponding stage, then the output of the third branch and the output of the second branch are added in element level, then the output of the cross-branch fusion module is added in element level with the output of the first branch after passing through a cross-branch fusion module composed of a convolution with 3×3 step length of 1, a group normalization layer and a ReLU activation layer, and then the output of the cross-branch fusion module is naturally fused with the context information of the three stages after passing through the cross-branch fusion module again.

Step S6: further, a decoder is constructed for predicting the query mask. The decoder adopts a feature enhancement module proposed by PFENT. M generated in steps S3 to S5 ^pri ，P ^aug ，F ^hybrid Will be the mid-level query feature F _q Will be equal to F _q After the channel dimensions are spliced, the channel dimensions are input and decoded, and the model generates 4-scale intermediate prediction resultsFinally predicted query mask +.>

Step S7: training the self-adaptive prototype aggregation network constructed in the steps S2-S6 by utilizing the training data constructed in the step S1, and losing according to the joint triplesIntermediate Classification loss->Final classification lossTotal loss of calculation model->And back propagation is performed, and a random gradient descent algorithm is adopted to optimize model parameters. In the training process, the parameters of the whole backbone network are frozen and not performedUpdating.

Further, the joint triplet is lostFor optimizing adaptive prototype-aggregation modules, by triplet loss of support branches>And triple loss of query branches->And the components are jointly formed.

Further, support for triplet loss of branchesWith enhanced support prototype->With the help of the support mask, the negative samples are the intermediate support features F _s Is averaged for a positive sample +.>The invention selects the most difficult positive sample, namely selects and anchors +>Furthest l ₂ Foreground pixels of the distance.

Support of triplet loss of branchesThe calculation formula of (2) is as follows:

further, the triplet loss of query branchesLikewise in the enhanced query prototype +.>As anchor point, but unlike supporting branches, here is used a priori mask M of query image ^pri Selectively aggregating query features F by setting different thresholds _q Obtain negative sample->And difficult positive sample->

finally, joint triplet lossBy putting->And->And adding to obtain the final product.

Further, the method comprises the steps of,and->Is to calculate the intermediate prediction result and the most result generated by the decoder by using the cross entropy functionLoss between final prediction results and query mask truth.

Further, total lossThe calculation formula of (2) is as follows:

step S8: and (3) loading the weight file trained in the step S7, and evaluating the performance of the model on the test data set constructed in the step S1.

The beneficial effects of the invention are as follows:

(1) Compared with the prior method, the prior mask generation module designed by the invention can more accurately position the position of the query object; at the same time, the module can also suppress false matches generated by the background pixels of the query and the supported foreground pixels.

(2) The self-adaptive prototype aggregation module designed by the invention integrates the support features, the query features and the category semantics into one enhanced prototype to relieve the category deviation between the query set and the support set, and utilizes the semantic information to make up the feature deviation between the support set and the query set, so that the information contained in the enhanced prototype is far richer than the original support prototype. Using the enhanced prototype to match query features, the method can robustly detect query pixels associated with the prototype even if the support image and the query image deviate significantly in appearance.

(3) The visual correlation modeling module designed by the invention can supplement the space detail information lost due to the aggregation of prototypes, and further improve the accuracy of classification.

(4) Compared with other traditional methods, the method has the advantages that information of two modes of vision and text is introduced into small sample image segmentation, alignment of the two modes is achieved, and when a very small number of marked samples are given, the model can accurately and efficiently predict masks of images to be segmented.

Drawings

Fig. 1 is a flow chart of an adaptive prototype aggregate network of the present invention.

Fig. 2 is a schematic structural diagram of an adaptive prototype aggregation network according to the present invention.

Fig. 3 is a schematic diagram of the structure of the prior mask generation module of the present invention.

Fig. 4 is a schematic structural diagram of a visual-text convergence module taking query branching as an example in the present invention.

Detailed Description

Embodiments of the invention are disclosed in the drawings, and for purposes of explanation, numerous practical details are set forth in the following description. However, it should be understood that these practical details are not to be taken as limiting the invention. That is, in some embodiments of the invention, these practical details are unnecessary.

As shown in fig. 1, the invention discloses a small sample image segmentation method based on an adaptive prototype aggregation network, which comprises the following steps:

step S1: a data set of small sample image segmentations is acquired and divided into a training set and a test set. The training or testing process is performed in terms of independent epodes, each comprising a support set of K image-mask sample pairs, a query set of 1 homogeneous image to be segmented and its truth mask, and the word embedding vector t e R of the class ¹ ^×1×d . All images and masks are subjected to unified data preprocessing; the image preprocessing comprises the following steps: [0.9,1.1]Random scaling of the times, [ -10 °,10 °]Is randomly turned horizontally.

Step S2: mid-layer feature sets for support images and query images, respectively, are extracted using a pre-trained ResNet50 as a feature extractorAnd->The middle layer feature refers to the output of each residual block of blocks 2-4 in the feature extractor. Wherein the blocks 2 to 4 are respectively provided with 3And 6, 4 residual blocks.

Step S3: an a priori mask generation module is constructed. As shown in FIG. 2, support and query image advanced features are selected from the feature set in S2And->And inputs them to a prior mask generation module, outputs a prior mask M of the query image ^pri ∈R ^h×w×1 。

Step S3-1: as shown in fig. 3, advanced support features are first supported by mask M _s Shielding interference of irrelevant areas such as background and the like. To obtain a local feature representation and avoid overfitting of the model to the basis class due to the introduction of learnable parameters. Different sized 2D averaging pooling windows are employed to pool the region pixels supported and queried, respectively. When performing 2D averaging pooling, all features are reasonably filled to ensure that the output is of the same size. The whole steps are described as follows:

wherein h, w, c _h Representing the height, width and channel dimensions of the advanced features respectively,the function is a bilinear interpolation operation, with the aim of adding M _s Transform to the same->The same space size>Is Hardman product, ->Representation of use d _h ×d _w Is averaged and pooled.

The invention adopts a pooling window with three directions, namely d _h ×d _w = {5×5,7×1,1×7}. Three types of region features for support and query can be obtained according to three different pooling windowsAnd->

Step S3-2: the similarity is calculated for the region features that support and query using the attention-based cosine similarity. Attention score is in particular l in terms of regional feature points ₂ The norm acts as a weighted score for cosine similarity. The cosine similarity based on attention is expressed as follows:

wherein,and->Representing regional characteristics R _s And R is _q Pixels in space. Obtaining a region-based similarity graph S through position-by-position calculation _r ∈R ^hw×hw Wherein the first dimension represents the query dimension and the second dimension is the support dimension, then for S _r The average is calculated and a min-max normalization is applied, the resulting tensor of hw x 1 size is deformed to the spatial size of hw x w. Three region feature pairs can be calculated to obtain a three-similarity map +.>Finally, a priori mask M of the query image is obtained by averaging all the similarity graphs ^pri 。

Step S4: and constructing an adaptive prototype aggregation module. As shown in fig. 2, the adaptive prototype aggregation module is composed of three parts, namely prototype extraction, visual-text aggregation and prototype fusion. Prototype adaptive module and mid-level support feature F _s Mid-level query feature F _q And a word embedding vector t for the class as input. Output enhanced prototype representation p ^aug ∈R ^1×1×c . The representation of the enhanced prototype will then expand to be as a priori mask M ^pri Equal space size, defined as P ^aug ∈R ^h×w×c 。

Step S4-1: the prototype extraction section extracts a support and query prototype. It uses a support mask M _s Support of feature F from mid-level through mask-averaging pooling operations _s Extraction of support prototype vector p _s ∈R ^1×1×c . Using the query image a priori mask M generated in step S3 ^pri Mid-level query feature F _q Foreground pixels of (1) are filtered and aggregated into a query prototype p _q ∈R ^1×1×c 。p _s And p _q The extraction process of (2) can be formulated as follows:

wherein 0.7 represents M ^pri Controls the screening range of the query feature prospects. By setting the threshold value relatively high in this way, the pixel which really belongs to the query object can be accurately selected.

Step S4-2: the visual-text aggregation part comprises two branches, and the self-adaptive fusion of the support and query prototype and the semantic word vector is respectively realized.Fig. 4 illustrates specific operational steps in the query branch. In the query branch, p _q Splicing with t to obtain spliced vector p _q,t ∈R ^1×1×(c+d) . To correlate visual features with text embedding, p _q,t Query (Q), key (K) and Value (V) that generate cross-attention are characterized by weighted mid-level queriesAnd (3) generating. Standard cross-attention is performed on Q, K, V, jump-connection is made at the output, i.e. with the original query prototype p _q Add, output enhanced query prototype +.>The process is formulated as follows:

wherein W is ^Q ∈R ^(c+d)×c Is a linear projection layer, W ^K And W is ^V Is a 1 x 1 convolution with the output channel c and the hidden dimension in FFN is c. Where d is the scaling factor, equal to the hidden dimension c. The whole process performs a single head of attention. Similarly, enhanced query prototypes can be obtained

Step S4-3: the prototype fusion part will enhance the support prototypesAnd enhanced queriesPrototype->Weighted fusion is carried out to obtain an enhanced prototype representation p ^aug ∈R ^1×1×c It contains much more information than any original support prototype p _s And query prototype p _q And the information contained therein. The specific fusion mode is as follows:

step S5: and constructing a visual association generating module. As shown in FIG. 2, it supports and queries the intermediate feature set extracted in step S2And->And calculating cosine similarity every two. Since these intermediate features are extracted from three different phases of the backbone network, the computed associative graph potentially contains hierarchical relationships. Fusing the association graphs by using a multi-scale structure to obtain visual association features F ^hybrid ∈R ^h×w×δ 。

The multi-scale structure comprises three branches, each branch has a 1 x 1 convolution, and the three branches are used for fusing the association diagrams generated by the corresponding stages. Then, the third branch and the output of the second branch are added in element level, and then pass through a cross-branch fusion module consisting of a convolution with 3×3 step length, a group normalization layer and a ReLU activation layer. And then, the output of the cross-branch fusion module is added with the output of the first branch in element level, and the output is output after passing through the cross-branch fusion module once. The output visual association features naturally fuse the context information of the three stages.

Step S6: a decoder is constructed for predicting the query mask. For simplicity, the decoder employs the feature enhancement module proposed by PFENet. M generated in steps S3 to S5 ^pri ，P ^aug ，F ^hybrid Will be the query feature F _q Will be equal to F _q And (5) inputting after splicing the channel dimensions. Upon decoding, the decoder will produce 4-scale intermediate prediction resultsFinally predicted query mask +.>

Step S7: and training the self-adaptive prototype aggregation network constructed in the steps S2 to S6 by utilizing the training data constructed in the step S1. Loss from joint triplesIntermediate Classification loss->Final classification lossTotal loss of calculation model->And back propagation is performed, and a random gradient descent algorithm is adopted to optimize model parameters. In the training process, the parameters of the whole backbone network are frozen and are not updated.

The joint triplet lossFor optimizing adaptive prototype-aggregation modules, by triplet loss of support branches>And triple loss of query branches->And the components are jointly formed.

Step S7-1: support of triplet loss of branchesWith enhanced support prototype->Is an anchor point. With the help of the support mask, the negative sample is for support feature F _s Is averaged for a positive sample +.>The invention selects the most difficult positive sample, namely selects and anchors +>Furthest l ₂ Foreground pixels of the distance.

step S7-2: triplet loss of query branchesLikewise in the enhanced query prototype +.>Is an anchor point. But unlike supporting branching, the present invention uses a priori mask M of the query image ^pri Aggregating query features F by setting different thresholds _q Obtain negative sample->And difficult positive sample->The specific implementation is expressed as follows:

wherein, 0-0.4 is used for controlling the screening range of the negative sample, and 0.4-0.55 is used for controlling the screening range of the difficult positive sample. Therefore, the triplet loss of query branchesThe calculation formula of (2) is as follows:

Step S7-3:and->The cross entropy function is used to calculate the intermediate prediction results generated by the decoder and the loss between the final prediction results and the query mask truth. />And->The calculation formula is as follows:

further, total lossThe calculation formula of (2) is as follows:

step S8: and S7, loading a network weight file trained in the step S7, and estimating the performance of the model by using the test data set constructed in the step S1.

To verify the validity of the method, the present embodiment is performed on the dataset PASCAL-5 ⁱ A4-fold cross contrast test is carried out, and the method has good and bad performances with some classical or advanced small sample image segmentation methods.

1. Test details

A. Testing hardware and software environments

All experiments of this example were performed on a computer with a Ubuntu 20.04LTS operating system installed, and all experiments were performed on 4 NVIDIA Tesla V100-32 GB graphics cards. The software environment includes Python-3.8, pyTorch-1.12, etc.

B. Introduction to data set

The data set is selected from PASCAL-5 ⁱ It is created from the additional annotation mask in the PASCAL VOC 2012 dataset and the SBD dataset. PASCAL-5 ⁱ Contains 20 object classes, and the specific class information is shown in table 1As shown. In the experiment, 20 object classes are equally divided into 4 Fold, each Fold contains 5 classes, and the four Fold classes are used for 4-Fold crossover experiments, namely 3 Fold classes are selected during training, and the rest of the Fold classes are used for testing the generalization performance of the model in the unseen classes.

TABLE 1

Fold	Object class
		Fold-0	Airplane, bicycle, bird, ship and bottle
Fold-1	Bus, car, cat, chair, cow
		Fold-2	Dining table, dog, horse, motorcycle and man
Fold-3	Potted plant, sheep, sofa, train and television set

C. Model training details

The backbone network for visual feature extraction selects a ResNet50 based on ImageNet pre-training and freezes the parameters. The data set is data enhanced, including random cropping, scaling, rotation, blurring, and flipping, and uniformly cropped to 473 x 473 dimensions. Class text embedding is obtained by a pre-trained Word2Vec text encoder, with dimension d=300. In the experiment, the adaptive prototype aggregate network was used as a meta learner and the PSPNet was used as a base learner. The model was trained for 200 epochs, the batch size was set to 8, the initial learning rate was set to 0.005 using the SGD optimizer, the learning rate was attenuated using the "poly" strategy, and the exponential factor power was set to 0.9.

D. Model test details

And adopting the class average cross-over ratio mIoU as an index for measuring the performance of the model. For stability of the results, 5 rounds of testing were performed by setting 5 different random seeds per test, with the average value as the reported value. In each test, 1000 sample pairs were randomly sampled for the dataset. In addition to reporting the mIoU for each Fold, the average of all classes of mIoU for the dataset is reported as an overall evaluation of the dataset.

2. Experimental results

TABLE 2 PASCAL-5 ⁱ Experimental results on data set

As shown in Table 2, in the 1-shot scenario, PASCAL-5 is used for the pair ⁱ And 4-fold cross validation is carried out on the self-adaptive prototype aggregation network model obtained by data set training. Wherein the bolded represents the optimal result for the indicator and the underlined represents the second optimal result for the indicator. The adaptive prototype aggregation network achieved optimal or sub-optimal values for the mlou at 3 different Fold compared to the previous small sample image segmentation model, while the results in Fold-3 differed from the optimal results by only 1.23. From the average mlou of 20 classes, the adaptive prototype aggregation network can flexibly select information closest to the current position of the query feature from support, query and semantics to match with the information by virtue of the enhanced prototype, so that the highest result 69.12 is obtained. Furthermore, the accuracy of the model by means of a single enhanced prototype is far superior to methods based on dense correlation matching, for example, more than 5.1 points of HSNet. In summary, the effectiveness of the invention in small sample image segmentation tasks is demonstrated.

The invention fuses information of both visual and semantic modalities to generate a prototype representation that guides the segmentation of query images. Unlike traditional method, the invention not only integrates the support prototype and the query prototype, but also introduces the semantic vector of the target class, which is helpful to solve the common intra-class difference problem in the task of the small sample; by introducing semantic information, the method and the device can obtain more information commonly used for different categories, thereby supplementing limited information between the support set and the query set.

In addition, the invention adopts a parameter-free prior mask generation method, and the generated prior mask can more accurately position the spatial position of the query object. The method is not only helpful for aggregating query prototypes, but also provides powerful priori information for the subsequent matching process, and the final visual correlation modeling module can capture rich space details, so that the capability of the model in terms of segmenting image details can be improved.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A small sample image segmentation method based on an adaptive prototype aggregation network is characterized in that: the small sample image segmentation method specifically comprises the following steps:

step 1, acquiring a data set of small sample image segmentation, and dividing the data set into a training set and a testing set;

step 2, constructing an adaptive prototype aggregation network, which specifically comprises the following steps:

step 2.1, using the pre-trained backbone network as a feature extractor to extract intermediate support feature sets of the support image and the query image, respectivelyAnd query feature set +.>

Step 2.2, constructionA priori mask generation module: from the support feature set in step 2.1And query feature set +.>Advanced features of the support image are selected +.>And high-level features of the query image +.>And inputs them to a priori mask generation module, which ultimately outputs a priori mask M of the query image ^pri ∈R ^h×w×1 ；

Step 2.3, constructing an adaptive prototype polymerization module: the self-adaptive prototype polymerization module consists of three parts, namely prototype extraction, visual-text polymerization and prototype fusion, and the self-adaptive prototype polymerization module supports the characteristic F in a medium level _s Mid-level query feature F _q Word embedding vector t epsilon R of class ^1×1×d As input, the prototype representation p of the output enhancement ^aug ∈R ^1×1×c The enhanced prototype is expanded to sum the a priori mask M of step 2.2 ^pri Equal space size, defined as P ^aug ∈R ^h×w×c ；

Step 2.4, constructing a visual association generating module: combining the intermediate support feature sets extracted in step 2.1And query feature set +.>Cosine similarity is calculated in pairs to obtain a correlation diagram, and the obtained correlation diagram is fused by using a multi-scale structure to obtain visual correlation characteristics F ^hybrid ∈R ^h×w×δ ；

Step 2.5, constructing a decoder for predicting the query mask: mask a priori M ^pri Enhanced prototype representation p ^aug And visual association feature F ^hybrid As a mid-level query feature F _q Guide information to be used with mid-level query feature F _q After the channel dimensions are spliced and input, after decoding, a decoder generates 4-scale intermediate prediction resultsFinally predicted query mask +.>

Step 3, training the self-adaptive prototype aggregation network constructed in the step 2 by utilizing the training set in the step 1: loss from joint triplesIntermediate Classification loss->Final Classification loss->Calculating the total loss of the adaptive prototype aggregate network +.>The back propagation is carried out, the random gradient descent algorithm is adopted to optimize the parameters of the self-adaptive prototype aggregation network, and the parameters of the whole main network are frozen and not updated in the training process;

and 4, loading the network weight file trained in the step 3, and evaluating the performance of the model by using the test set constructed in the step 1.

2. The small sample image segmentation method based on the adaptive prototype aggregation network according to claim 1, wherein: the construction of the prior mask generation module in the step 2.2 comprises the following specific steps:

step 2.2.1 support advanced features of the imageFirst supported mask M _s The interference of irrelevant areas is shielded, area pixels of a support image and a query image are respectively converged by adopting 2D average pooling windows with different sizes, and when the 2D average pooling is executed, all the characteristics are reasonably filled so as to ensure that the sizes of the output are the same, and the whole steps are described as follows:

wherein h, w, c _h Representing the height, width and channel dimensions of the advanced features respectively,the function is a bilinear interpolation operation, with the aim of adding M _s Transform to the same->The same space size>Is Hardman product, ->Representation of use d _h ×d _w Carrying out average pooling on windows of the plurality of windows;

step 2.2.2 Using attention-based cosine similarity as support image and query imageIs calculated by similarity of regional features, and attention score is specifically expressed by l of regional feature points ₂ The norm is used as a weighted score of cosine similarity, and the attention-based cosine similarity expression is as follows:

wherein,and->Representing regional characteristics R _s And R is _q Pixel points in the space are calculated position by position to obtain a similarity graph S based on the region _r ∈R ^hw×hw Wherein the first dimension represents the query dimension and the second dimension is the support dimension, then for S _r Average value is calculated for the second dimension of (a) and min-max normalization is applied, the resulting tensor of hw x 1 size is deformed to the spatial size of hw x w, and three region features are calculated to give a three-similarity diagram->

Step 2.2.3 obtaining the prior mask M of the query image by averaging all the similarity maps ^pri 。

3. The small sample image segmentation method based on the adaptive prototype aggregation network according to claim 1, wherein: the prototype extraction in step 2.3 mainly extracts support and query prototypes, using support mask M _s Support of feature F from mid-level through mask-averaging pooling operations _s Extraction of support prototype p _s ∈R ^1×1×c And use a priori mask M ^pri Mid-level query feature F _q Foreground pixels of (1) are filtered and aggregated into a query prototype p _q ∈R ^1×1×c Branch (B)Model vector p _s Query prototype p _q The extraction process of (2) is expressed as follows:

wherein 0.7 represents M ^pri Controls the screening range of the query feature prospects.

4. The small sample image segmentation method based on the adaptive prototype aggregation network according to claim 1, wherein: the visual-text aggregation in step 2.3 comprises two branches: support branches and query branches, which realize the self-adaptive fusion of support and query prototypes and semantic word vectors, and obtain enhanced support prototypesAnd query prototype->

5. The small sample image segmentation method based on the adaptive prototype aggregation network according to claim 4, wherein: the prototype fusion in the step 2.3 is to support the prototype obtained in the step 2.3And query prototype->Weighted fusion is carried out to obtain an enhanced prototype representation p ^aug ∈R ^1×1×c The specific fusion mode is that：

p ^aug ＝0.5p _q +0.5p _s 。

6. The small sample image segmentation method based on the adaptive prototype aggregation network according to claim 1, wherein: the multi-scale structure in the step 2.4 comprises three branches, each branch is provided with a 1 multiplied by 1 convolution for fusing the association graph generated in the corresponding stage, the output of the third branch and the output of the second branch are added in element level, then the three branches are subjected to a cross-branch fusion module consisting of a convolution with 3 multiplied by 3 step length, a group normalization layer and a ReLU activation layer, the output of the cross-branch fusion module is added in element level with the output of the first branch, and then the three branches are output after the one-time cross-branch fusion module, and the output visual association features naturally fuse the context information of the three stages.

7. The small sample image segmentation method based on the adaptive prototype aggregation network according to claim 1, wherein: the joint triplet loss in step 3Triple loss by support branches->And triple loss of query branches->Together, the total loss of the adaptive prototype aggregate network +.>The calculation process of (1) comprises the following steps:

step 3.1, triplet loss of the support branchWith enhanced supportIs->Support for triplet loss of branches for anchor point->The calculation formula of (2) is as follows:

wherein,is a negative sample, is a mid-level support feature F _s Is averaged over the background of ∈>The positive sample is the anchor point +.>Furthest l ₂ A foreground pixel of distance;

step 3.2, triplet loss of the query branchIs in the form of enhanced query prototype->For anchor point, a priori mask M of query image is used ^pri By setting different thresholds, the mid-level query feature F is aggregated _q Obtain negative sample->And difficult positive sample->The specific implementation is expressed as follows:

wherein 0-0.4 is used for controlling the screening range of the negative sample, 0.4-0.55 is used for controlling the screening range of the difficult positive sample, and the triplet loss of the branch is inquiredThe calculation formula of (2) is as follows:

step 3.3 loss of triplets by supporting branchesAnd triple loss of query branches->Summing to obtain a joint triplet penalty>

Step 3.4, calculating the loss between the final prediction result and the query mask truth value by using the cross entropy function to calculate the intermediate prediction result generated by the decoderAnd->The calculation formula is as follows:

step 3.5, total lossThe calculation formula of (2) is as follows: