CN116109819A

CN116109819A - Cascade instance segmentation method based on enhanced semantic segmentation head

Info

Publication number: CN116109819A
Application number: CN202210461048.7A
Authority: CN
Inventors: 苏荔; 黄薛蓉; 李国荣; 卿来云; 黄庆明
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2023-05-12

Abstract

The invention discloses a cascade instance segmentation method based on an enhanced semantic segmentation head, which comprises the following steps: extracting multi-scale features of the image, and fusing the multi-scale features to obtain single-scale features; obtaining semantic segmentation features according to the single-scale features; and carrying out instance segmentation according to the semantic segmentation features and the multi-scale features to obtain a single instance in the image, wherein the semantic segmentation features are obtained by fusion of a transducer model and a convolution network model. The cascade instance segmentation method based on the enhanced semantic segmentation head improves the resolvable property of semantic segmentation features and improves the accuracy of instances.

Description

Cascade instance segmentation method based on enhanced semantic segmentation head

Technical Field

The invention relates to a cascade instance segmentation method based on an enhanced semantic segmentation head, belonging to the technical field of computer vision.

Background

Instance segmentation refers to the segmentation and classification of individual instances in an image on a pixel-by-pixel basis, which is widely used in the fields of autopilot, medical image segmentation, remote sensing image analysis, and the like.

There are many existing instance segmentation methods, such as HTC, DSC cascade instance segmentation methods, which typically rely on only one full convolution network, such as FCN network, to extract semantic segmentation features, however, this approach lacks global information, resulting in existing methods facing instance segmentation incomplete problems, such as instance internal segmentation discontinuities, instance edge segmentation deletions, etc.

Therefore, it is necessary to study the example segmentation method to solve the above-described problems.

Disclosure of Invention

In order to overcome the problems, the inventor has conducted intensive research and designs a cascade instance segmentation method based on an enhanced semantic segmentation head, which is characterized by comprising the following steps:

s1, extracting multi-scale features of an image, and fusing the multi-scale features to obtain single-scale features;

s2, obtaining semantic segmentation features according to the single-scale features;

s3, performing instance segmentation according to the semantic segmentation features and the multi-scale features to obtain a single instance in the image.

In a preferred embodiment, in S1, multi-scale features in the image are extracted by a feature extractor overlaying a feature pyramid.

In a preferred embodiment, in S1, the fused multi-scale feature is achieved by:

inputting the multi-scale features extracted by the feature extractor into a feature pyramid, setting 1x1 convolution after each scale feature of the feature pyramid, performing up-sampling operation on the high-level features, performing down-sampling operation on the low-level features, enabling all the features output by the feature pyramid to be fixed into a uniform scale, and then fusing the features of the uniform scale to obtain single-scale features.

In a preferred embodiment, in S2, the single-scale features are input into an enhanced semantic segmentation head, the enhanced semantic segmentation head outputs semantic segmentation features,

the enhanced semantic segmentation head comprises a segmentation model, a transducer model, a convolution network model and a convolution layer,

wherein the segmentation model is used for segmenting the input single-scale feature into a plurality of blocks, inputting each segmentation block into a transducer model,

the transducer model generates global context features x from input single-scale feature segments _g ；

Convolution network model generates spatial context feature x from input single scale feature _s ；

Global context feature x _g With spatial context feature x _s After fusion, semantic segmentation features are generated through a convolution layer.

In a preferred embodiment, in S2, the convolutional network model is an FCN.

In a preferred embodiment, in S3, a single instance in the image is represented by a bounding box and an instance mask, the instance segmentation is implemented by a cascade predictor, which is a multi-stage paradigm structure, employing the output of the previous stage to train the bounding box b of the current stage ^t And instance mask m ^t Can be expressed as:

wherein F represents a plurality ofScale features, x _en Representing semantic segmentation features, t representing different phases,

boundary box feature representing t-phase, +.>

An instance mask feature representing the t-phase, P (·) represents the pooling function, B ^t Boundary box predictor, M, representing the t-phase ^t An instance mask predictor, b, representing the t-phase ^t Boundary box representing t stage, m ^t An instance mask representing the t-phase.

In a preferred embodiment, the classification supervised training process is added when training the enhanced semantic segmentation head and the cascade predictor, wherein the classification supervised training refers to multi-label training taking the classification of all the examples in the image as a supervision object.

In a preferred embodiment, in the classification supervised training, the loss function is set to:

wherein ,

for semantic segmentation loss, ++>

Classifying the loss for the multi-label; t represents the different phases of the cascade predictor, T is the total number of phases, +.>

Cross entropy loss for bounding boxes of the t-stage of the cascade predictor; />

Cross entropy loss, alpha and beta being weights, of an instance mask for the t-stage of a cascade predictorCoefficient lambda _t Training weights for different stages.

In addition, the invention also provides electronic equipment, which comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

Furthermore, the invention provides a computer readable storage medium storing computer instructions for causing the computer to execute the method.

The invention has the beneficial effects that:

(1) The cascade example segmentation method based on the enhanced semantic segmentation head provided by the invention improves the resolvable property of semantic segmentation features by utilizing the global modeling capability of a transducer, can be integrated with most of the existing cascade example methods, and improves the performance of the cascade example segmentation method;

(2) The cascade instance segmentation method based on the enhanced semantic segmentation head is easy to configure in other cascade instance networks, and only a small amount of parameters and calculation amount are increased.

Drawings

FIG. 1 is a flow chart of a cascading example segmentation method based on an enhanced semantic segmentation header according to one preferred embodiment of the present invention;

FIG. 2 is a flow diagram of merging multi-scale features in a cascade example segmentation method based on enhanced semantic segmentation heads according to a preferred embodiment of the present invention.

Detailed Description

The invention is further described in detail below by means of the figures and examples. The features and advantages of the present invention will become more apparent from the description.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The cascade instance segmentation method based on the enhanced semantic segmentation head provided by the invention is characterized by comprising the following steps as shown in fig. 1:

In S1, extracting multi-scale features in an image by superposing a feature pyramid through a feature extractor;

in the present invention, the specific structure of the feature extractor is not limited, and any feature extractor for instance segmentation may be used, for example, general ResNet-50, resNet-101, or more advanced Swin-transducer may be used.

The feature pyramid (Feature Pyramid Network) is a basic component commonly used for detecting objects with different dimensions, and the structure of the feature pyramid is not described in detail in the invention.

Preferably, empirically, the output of the feature extractor superimposed feature pyramid is set to 5 layers of multi-scale features, the steps of these 5 layers of multi-scale features relative to the artwork being 2, 4, 8, 16, 32, respectively.

Further, the fused multi-scale feature is achieved by the following means, as shown in fig. 2:

inputting the multi-scale features into the feature pyramid, setting 1x1 convolution after each scale feature of the feature pyramid, performing up-sampling operation on the high-level features, performing down-sampling operation on the low-level features, enabling all the features output by the feature pyramid to be fixed to be of a uniform scale, and then fusing the features of the uniform scale to obtain single-scale features, which can be expressed as:

wherein F represents a single scale feature obtained after fusion, P _i Representing multi-scale features of an input feature pyramid, subscript i representing different layers of the feature pyramid, emb _i (. Cndot.) represents a multiscale feature P _i Is embedded into function of Samp _i (. Cndot.) represents a multiscale feature P _i Is a sampling function of (a).

According to the present invention, in S2, a single scale feature is input into an enhanced semantic segmentation head, a semantic segmentation feature is output by the enhanced semantic segmentation head,

wherein the segmentation model is used for segmenting the input single-scale feature into a plurality of blocks to obtain single-scale feature segmentation blocks, inputting each single-scale feature segmentation block into a transducer model,

Preferably, the segmentation model segments the single scale feature F into S blocks, the segmented blocks being represented as

Features->

And sending into a transducer model.

The transducer model Is a model that uses the attention mechanism to increase the model training speed, and specific structures can be seen in the paper Vaswani A, shazer N, parmar N, et al, attention Is All You New [ J ]. ArXiv,2017. The self-attention mechanism in the transducer Is calculated as follows:

wherein

Query, key and value, d, respectively, of input feature _k The transform model output is a global context feature x representing the dimensions of the query and key _g ∈R ^S×S×C The invention generates the global context characteristics through the transducer model, so that the neural network can pay more attention to the internal area of the instance and the main part of the instance, and is helpful for the identification of the whole instance.

The convolution network model can be any known convolution model, such as R-CNN, resNet and the like, preferably, the convolution network model is a full convolution network FCN, more preferably, a full convolution network formed by four continuous convolutions, and through the full convolution network, semantic gaps among different scale features of the feature pyramid can be eliminated better, and spatial context information is encoded, so that the network focuses more on detailed parts of an image, and thus spatial context features are generated.

Global context feature x _g With spatial context feature x _s After fusion, semantic segmentation features are generated through a 1x1 convolution layer, and can be expressed as follows:

x _en ＝Emb(Up(x _g )+x _s )

wherein Up (-) and Emb (-) represent the Up-sampling function and the embedding function, respectively.

In the invention, the semantic segmentation features not only comprise space context but also global context, so that the current position is guided to be segmented by referring to the context information, the capability of global modeling of a transducer is fully utilized, the resolvable property of the semantic segmentation features is improved, the enhanced semantic segmentation features of the two features are fused, the details of the image are focused, and the whole instance is focused.

In S3, the individual instances in the image are represented by bounding boxes and instance masks, which may be implemented using any of the known instance partition predictors, e.g. the predictors designed in HTC, DSC.

In a preferred embodiment, the instance segmentation is implemented by a cascade predictor, which is a multi-stage paradigm structure, employing the output of a previous stage to train the bounding box b of the current stage ^t And instance mask m ^t Can be expressed as:

wherein F represents a multi-scale feature, x _en Representing semantic segmentation features, t representing different phases,

boundary box feature representing t-phase, +.>

In a preferred embodiment, when training the enhanced semantic segmentation head and the cascade predictor, a classification supervised training process is added, wherein the classification supervised training refers to multi-label training taking the classification of all the examples in the image as a supervision object, so as to train the transducer model better, and enable the transducer model to learn more semantic information.

More preferably, in the classification supervised training, the loss function is set to:

wherein ,

for semantic segmentation loss, ++>

Cross entropy loss of instance mask for t-stage of cascade predictor, alpha and beta are weight coefficients, lambda _t Training weights for different stages.

Further, α characterizes semantic segmentation penalty weights, and β characterizes multi-tag classification penalty weights. In a preferred embodiment, α=0.2, β=3.

In a preferred embodiment, 3 phases are set in total in the classification supervision training, and the training weight of each phase is set to λ= [1,0.5,0.25].

The various embodiments of the methods described above in this invention may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the methods and apparatus described herein may be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The methods and apparatus described herein may be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, so long as the desired result of the technical solution of the present disclosure is achieved, and the present disclosure is not limited herein.

Examples

Example 1

An instance segmentation experiment is carried out by adopting an MS COCO 2017,MS COCO stuff data set, wherein the data set is a general data set of an instance segmentation task, the MS COCO refers to (ECCV|Microsoft COCO: common Objects in Context. European Conference on Computer Vision ((gilthub. Com)), the data set comprises 118K training images and 5K verification images, the total number of instance categories is 80, and COCO stuff is a semantic segmentation label corresponding to the MS COCO.

Example segmentation is performed by:

In S1, the feature extractors are ResNet-50 and ResNet-101 respectively, the output of the feature pyramid superposition feature extractor is set to be 5 layers of multi-scale features, and the step sizes of the 5 layers of multi-scale features relative to the original image are 2, 4, 8, 16 and 32 respectively.

The fused multi-scale feature may be represented as:

/>

in S2, inputting the single-scale feature into an enhanced semantic segmentation head, outputting semantic segmentation features by the enhanced semantic segmentation head,

the enhanced semantic segmentation head comprises a segmentation model, a transducer model, a convolution network model and a convolution layer, wherein the convolution network model is a full convolution network formed by four continuous convolutions.

S3, the cascade predictor is of a multi-stage paradigm structure, and the output of the previous stage is adopted to train the boundary frame b of the current stage ^t And instance mask m ^t Expressed as:

when training the enhanced semantic segmentation head and the cascade predictor, adding a classification supervision training process, wherein a loss function is set as follows:

where α=0.2, β=3, λ= [1,0.5,0.25].

Comparative example

Comparative example 1

Example segmentation was performed using the same dataset as example 1, except that HTC methods were used, wherein HTC was performed in the literature "Chen, kai, jiangmiao Pang, jiaqi Wang, yu Xiong, xiaoo Li, shuyang Sun, wansen Feng, ziwei Liu, jianping Shi, wanli Ouyang, chen Change Loy and Dahua lin." Hybrid Task Cascade for Instance segment. "2019IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019): 4969-4978, "wherein the feature extractor likewise employs ResNet-50 and ResNet-101.

Comparative example 2

Example segmentation was performed using the same dataset as example 1, except that the DSC method was used, which was described in the literature "Ding, hao, siyuan Qiao, alan Loddon Yuille and Wei shen," deep Shape-guided Cascade for Instance segment, "2021IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 8274-8284, "wherein the feature extractor likewise employs ResNet-50 and ResNet-101.

Experimental example

Using the average Accuracy (AP) as a performance evaluation index, which calculates the average accuracy under all categories and all IoU thresholds, for bounding boxes and instance masks, APs can be classified as bounding boxes AP (AP _b ) And an instance mask AP (AP _m ) The method comprises the steps of carrying out a first treatment on the surface of the For example mask APs, APs with different IoU thresholds are APs ⁵⁰ ，AP ⁷⁵ The APs of different size examples are APs _S 、AP _M 、AP _L 。

The results of comparative example 1 and comparative examples 1 and 2 are shown in table 1.

TABLE 1

As can be seen from Table 1, the results in example 1 are superior to other example segmentation methods in terms of performance.

In the description of the present invention, it should be noted that the positional or positional relationship indicated by the terms such as "upper", "lower", "inner", "outer", "front", "rear", etc. are based on the positional or positional relationship in the operation state of the present invention, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," "third," "fourth," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected in common; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The invention has been described above in connection with preferred embodiments, which are, however, exemplary only and for illustrative purposes. On this basis, the invention can be subjected to various substitutions and improvements, and all fall within the protection scope of the invention.

Claims

1. The cascade instance segmentation method based on the enhanced semantic segmentation head is characterized by comprising the following steps of:

2. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,

in S1, multi-scale features in the image are extracted by a feature extractor overlaying the feature pyramid.

3. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,

in S1, the fused multi-scale feature is achieved by:

4. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,

5. The method for cascaded instance segmentation based on enhanced semantic segmentation header according to claim 4,

in S2, the convolutional network model is FCN.

6. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,

in S3, the single instance in the image is represented by a bounding box and an instance mask, the instance segmentation is implemented by a cascade predictor, which is a multi-stage paradigm structure, employing the output of the previous stage to train the bounding box b of the current stage ^t And instance mask m ^t Can be expressed as:

boundary box feature representing t-phase, +.>

7. The method for cascaded instance segmentation based on enhanced semantic segmentation header according to claim 6, wherein,

and adding a classification supervision training process when training the enhanced semantic segmentation head and the cascade predictor, wherein the classification supervision training refers to multi-label training taking the categories of all the examples in the image as supervision objects.

8. The method for cascaded instance segmentation based on enhanced semantic segmentation header according to claim 7,

in the classification supervision training, the loss function is set as:

wherein ,

for semantic segmentation loss, ++>

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.