CN116109819A - Cascade instance segmentation method based on enhanced semantic segmentation head - Google Patents

Cascade instance segmentation method based on enhanced semantic segmentation head Download PDF

Info

Publication number
CN116109819A
CN116109819A CN202210461048.7A CN202210461048A CN116109819A CN 116109819 A CN116109819 A CN 116109819A CN 202210461048 A CN202210461048 A CN 202210461048A CN 116109819 A CN116109819 A CN 116109819A
Authority
CN
China
Prior art keywords
features
instance
segmentation
feature
semantic segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210461048.7A
Other languages
Chinese (zh)
Inventor
苏荔
黄薛蓉
李国荣
卿来云
黄庆明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN202210461048.7A priority Critical patent/CN116109819A/en
Publication of CN116109819A publication Critical patent/CN116109819A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cascade instance segmentation method based on an enhanced semantic segmentation head, which comprises the following steps: extracting multi-scale features of the image, and fusing the multi-scale features to obtain single-scale features; obtaining semantic segmentation features according to the single-scale features; and carrying out instance segmentation according to the semantic segmentation features and the multi-scale features to obtain a single instance in the image, wherein the semantic segmentation features are obtained by fusion of a transducer model and a convolution network model. The cascade instance segmentation method based on the enhanced semantic segmentation head improves the resolvable property of semantic segmentation features and improves the accuracy of instances.

Description

Cascade instance segmentation method based on enhanced semantic segmentation head
Technical Field
The invention relates to a cascade instance segmentation method based on an enhanced semantic segmentation head, belonging to the technical field of computer vision.
Background
Instance segmentation refers to the segmentation and classification of individual instances in an image on a pixel-by-pixel basis, which is widely used in the fields of autopilot, medical image segmentation, remote sensing image analysis, and the like.
There are many existing instance segmentation methods, such as HTC, DSC cascade instance segmentation methods, which typically rely on only one full convolution network, such as FCN network, to extract semantic segmentation features, however, this approach lacks global information, resulting in existing methods facing instance segmentation incomplete problems, such as instance internal segmentation discontinuities, instance edge segmentation deletions, etc.
Therefore, it is necessary to study the example segmentation method to solve the above-described problems.
Disclosure of Invention
In order to overcome the problems, the inventor has conducted intensive research and designs a cascade instance segmentation method based on an enhanced semantic segmentation head, which is characterized by comprising the following steps:
s1, extracting multi-scale features of an image, and fusing the multi-scale features to obtain single-scale features;
s2, obtaining semantic segmentation features according to the single-scale features;
s3, performing instance segmentation according to the semantic segmentation features and the multi-scale features to obtain a single instance in the image.
In a preferred embodiment, in S1, multi-scale features in the image are extracted by a feature extractor overlaying a feature pyramid.
In a preferred embodiment, in S1, the fused multi-scale feature is achieved by:
inputting the multi-scale features extracted by the feature extractor into a feature pyramid, setting 1x1 convolution after each scale feature of the feature pyramid, performing up-sampling operation on the high-level features, performing down-sampling operation on the low-level features, enabling all the features output by the feature pyramid to be fixed into a uniform scale, and then fusing the features of the uniform scale to obtain single-scale features.
In a preferred embodiment, in S2, the single-scale features are input into an enhanced semantic segmentation head, the enhanced semantic segmentation head outputs semantic segmentation features,
the enhanced semantic segmentation head comprises a segmentation model, a transducer model, a convolution network model and a convolution layer,
wherein the segmentation model is used for segmenting the input single-scale feature into a plurality of blocks, inputting each segmentation block into a transducer model,
the transducer model generates global context features x from input single-scale feature segments g
Convolution network model generates spatial context feature x from input single scale feature s
Global context feature x g With spatial context feature x s After fusion, semantic segmentation features are generated through a convolution layer.
In a preferred embodiment, in S2, the convolutional network model is an FCN.
In a preferred embodiment, in S3, a single instance in the image is represented by a bounding box and an instance mask, the instance segmentation is implemented by a cascade predictor, which is a multi-stage paradigm structure, employing the output of the previous stage to train the bounding box b of the current stage t And instance mask m t Can be expressed as:
Figure BDA0003622228490000021
Figure BDA0003622228490000031
Figure BDA0003622228490000032
Figure BDA0003622228490000033
wherein F represents a plurality ofScale features, x en Representing semantic segmentation features, t representing different phases,
Figure BDA0003622228490000034
boundary box feature representing t-phase, +.>
Figure BDA0003622228490000035
An instance mask feature representing the t-phase, P (·) represents the pooling function, B t Boundary box predictor, M, representing the t-phase t An instance mask predictor, b, representing the t-phase t Boundary box representing t stage, m t An instance mask representing the t-phase.
In a preferred embodiment, the classification supervised training process is added when training the enhanced semantic segmentation head and the cascade predictor, wherein the classification supervised training refers to multi-label training taking the classification of all the examples in the image as a supervision object.
In a preferred embodiment, in the classification supervised training, the loss function is set to:
Figure BDA0003622228490000036
wherein ,
Figure BDA0003622228490000037
for semantic segmentation loss, ++>
Figure BDA0003622228490000038
Classifying the loss for the multi-label; t represents the different phases of the cascade predictor, T is the total number of phases, +.>
Figure BDA0003622228490000039
Cross entropy loss for bounding boxes of the t-stage of the cascade predictor; />
Figure BDA00036222284900000310
Cross entropy loss, alpha and beta being weights, of an instance mask for the t-stage of a cascade predictorCoefficient lambda t Training weights for different stages.
In addition, the invention also provides electronic equipment, which comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
Furthermore, the invention provides a computer readable storage medium storing computer instructions for causing the computer to execute the method.
The invention has the beneficial effects that:
(1) The cascade example segmentation method based on the enhanced semantic segmentation head provided by the invention improves the resolvable property of semantic segmentation features by utilizing the global modeling capability of a transducer, can be integrated with most of the existing cascade example methods, and improves the performance of the cascade example segmentation method;
(2) The cascade instance segmentation method based on the enhanced semantic segmentation head is easy to configure in other cascade instance networks, and only a small amount of parameters and calculation amount are increased.
Drawings
FIG. 1 is a flow chart of a cascading example segmentation method based on an enhanced semantic segmentation header according to one preferred embodiment of the present invention;
FIG. 2 is a flow diagram of merging multi-scale features in a cascade example segmentation method based on enhanced semantic segmentation heads according to a preferred embodiment of the present invention.
Detailed Description
The invention is further described in detail below by means of the figures and examples. The features and advantages of the present invention will become more apparent from the description.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The cascade instance segmentation method based on the enhanced semantic segmentation head provided by the invention is characterized by comprising the following steps as shown in fig. 1:
s1, extracting multi-scale features of an image, and fusing the multi-scale features to obtain single-scale features;
s2, obtaining semantic segmentation features according to the single-scale features;
s3, performing instance segmentation according to the semantic segmentation features and the multi-scale features to obtain a single instance in the image.
In S1, extracting multi-scale features in an image by superposing a feature pyramid through a feature extractor;
in the present invention, the specific structure of the feature extractor is not limited, and any feature extractor for instance segmentation may be used, for example, general ResNet-50, resNet-101, or more advanced Swin-transducer may be used.
The feature pyramid (Feature Pyramid Network) is a basic component commonly used for detecting objects with different dimensions, and the structure of the feature pyramid is not described in detail in the invention.
Preferably, empirically, the output of the feature extractor superimposed feature pyramid is set to 5 layers of multi-scale features, the steps of these 5 layers of multi-scale features relative to the artwork being 2, 4, 8, 16, 32, respectively.
Further, the fused multi-scale feature is achieved by the following means, as shown in fig. 2:
inputting the multi-scale features into the feature pyramid, setting 1x1 convolution after each scale feature of the feature pyramid, performing up-sampling operation on the high-level features, performing down-sampling operation on the low-level features, enabling all the features output by the feature pyramid to be fixed to be of a uniform scale, and then fusing the features of the uniform scale to obtain single-scale features, which can be expressed as:
Figure BDA0003622228490000051
wherein F represents a single scale feature obtained after fusion, P i Representing multi-scale features of an input feature pyramid, subscript i representing different layers of the feature pyramid, emb i (. Cndot.) represents a multiscale feature P i Is embedded into function of Samp i (. Cndot.) represents a multiscale feature P i Is a sampling function of (a).
According to the present invention, in S2, a single scale feature is input into an enhanced semantic segmentation head, a semantic segmentation feature is output by the enhanced semantic segmentation head,
the enhanced semantic segmentation head comprises a segmentation model, a transducer model, a convolution network model and a convolution layer,
wherein the segmentation model is used for segmenting the input single-scale feature into a plurality of blocks to obtain single-scale feature segmentation blocks, inputting each single-scale feature segmentation block into a transducer model,
the transducer model generates global context features x from input single-scale feature segments g
Convolution network model generates spatial context feature x from input single scale feature s
Global context feature x g With spatial context feature x s After fusion, semantic segmentation features are generated through a convolution layer.
Preferably, the segmentation model segments the single scale feature F into S blocks, the segmented blocks being represented as
Figure BDA0003622228490000061
Features->
Figure BDA0003622228490000062
And sending into a transducer model.
The transducer model Is a model that uses the attention mechanism to increase the model training speed, and specific structures can be seen in the paper Vaswani A, shazer N, parmar N, et al, attention Is All You New [ J ]. ArXiv,2017. The self-attention mechanism in the transducer Is calculated as follows:
Figure BDA0003622228490000063
wherein
Figure BDA0003622228490000064
Query, key and value, d, respectively, of input feature k The transform model output is a global context feature x representing the dimensions of the query and key g ∈R S×S×C The invention generates the global context characteristics through the transducer model, so that the neural network can pay more attention to the internal area of the instance and the main part of the instance, and is helpful for the identification of the whole instance.
The convolution network model can be any known convolution model, such as R-CNN, resNet and the like, preferably, the convolution network model is a full convolution network FCN, more preferably, a full convolution network formed by four continuous convolutions, and through the full convolution network, semantic gaps among different scale features of the feature pyramid can be eliminated better, and spatial context information is encoded, so that the network focuses more on detailed parts of an image, and thus spatial context features are generated.
Global context feature x g With spatial context feature x s After fusion, semantic segmentation features are generated through a 1x1 convolution layer, and can be expressed as follows:
x en =Emb(Up(x g )+x s )
wherein Up (-) and Emb (-) represent the Up-sampling function and the embedding function, respectively.
In the invention, the semantic segmentation features not only comprise space context but also global context, so that the current position is guided to be segmented by referring to the context information, the capability of global modeling of a transducer is fully utilized, the resolvable property of the semantic segmentation features is improved, the enhanced semantic segmentation features of the two features are fused, the details of the image are focused, and the whole instance is focused.
In S3, the individual instances in the image are represented by bounding boxes and instance masks, which may be implemented using any of the known instance partition predictors, e.g. the predictors designed in HTC, DSC.
In a preferred embodiment, the instance segmentation is implemented by a cascade predictor, which is a multi-stage paradigm structure, employing the output of a previous stage to train the bounding box b of the current stage t And instance mask m t Can be expressed as:
Figure BDA0003622228490000071
Figure BDA0003622228490000072
Figure BDA0003622228490000081
Figure BDA0003622228490000082
wherein F represents a multi-scale feature, x en Representing semantic segmentation features, t representing different phases,
Figure BDA0003622228490000083
boundary box feature representing t-phase, +.>
Figure BDA0003622228490000084
An instance mask feature representing the t-phase, P (·) represents the pooling function, B t Boundary box predictor, M, representing the t-phase t An instance mask predictor, b, representing the t-phase t Boundary box representing t stage, m t An instance mask representing the t-phase.
In a preferred embodiment, when training the enhanced semantic segmentation head and the cascade predictor, a classification supervised training process is added, wherein the classification supervised training refers to multi-label training taking the classification of all the examples in the image as a supervision object, so as to train the transducer model better, and enable the transducer model to learn more semantic information.
More preferably, in the classification supervised training, the loss function is set to:
Figure BDA0003622228490000085
wherein ,
Figure BDA0003622228490000086
for semantic segmentation loss, ++>
Figure BDA0003622228490000087
Classifying the loss for the multi-label; t represents the different phases of the cascade predictor, T is the total number of phases, +.>
Figure BDA0003622228490000088
Cross entropy loss for bounding boxes of the t-stage of the cascade predictor; />
Figure BDA0003622228490000089
Cross entropy loss of instance mask for t-stage of cascade predictor, alpha and beta are weight coefficients, lambda t Training weights for different stages.
Further, α characterizes semantic segmentation penalty weights, and β characterizes multi-tag classification penalty weights. In a preferred embodiment, α=0.2, β=3.
In a preferred embodiment, 3 phases are set in total in the classification supervision training, and the training weight of each phase is set to λ= [1,0.5,0.25].
The various embodiments of the methods described above in this invention may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the methods and apparatus described herein may be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The methods and apparatus described herein may be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, so long as the desired result of the technical solution of the present disclosure is achieved, and the present disclosure is not limited herein.
Examples
Example 1
An instance segmentation experiment is carried out by adopting an MS COCO 2017,MS COCO stuff data set, wherein the data set is a general data set of an instance segmentation task, the MS COCO refers to (ECCV|Microsoft COCO: common Objects in Context. European Conference on Computer Vision ((gilthub. Com)), the data set comprises 118K training images and 5K verification images, the total number of instance categories is 80, and COCO stuff is a semantic segmentation label corresponding to the MS COCO.
Example segmentation is performed by:
s1, extracting multi-scale features of an image, and fusing the multi-scale features to obtain single-scale features;
s2, obtaining semantic segmentation features according to the single-scale features;
s3, performing instance segmentation according to the semantic segmentation features and the multi-scale features to obtain a single instance in the image.
In S1, the feature extractors are ResNet-50 and ResNet-101 respectively, the output of the feature pyramid superposition feature extractor is set to be 5 layers of multi-scale features, and the step sizes of the 5 layers of multi-scale features relative to the original image are 2, 4, 8, 16 and 32 respectively.
The fused multi-scale feature may be represented as:
Figure BDA0003622228490000111
/>
in S2, inputting the single-scale feature into an enhanced semantic segmentation head, outputting semantic segmentation features by the enhanced semantic segmentation head,
the enhanced semantic segmentation head comprises a segmentation model, a transducer model, a convolution network model and a convolution layer, wherein the convolution network model is a full convolution network formed by four continuous convolutions.
S3, the cascade predictor is of a multi-stage paradigm structure, and the output of the previous stage is adopted to train the boundary frame b of the current stage t And instance mask m t Expressed as:
Figure BDA0003622228490000121
Figure BDA0003622228490000122
Figure BDA0003622228490000123
Figure BDA0003622228490000124
when training the enhanced semantic segmentation head and the cascade predictor, adding a classification supervision training process, wherein a loss function is set as follows:
Figure BDA0003622228490000125
where α=0.2, β=3, λ= [1,0.5,0.25].
Comparative example
Comparative example 1
Example segmentation was performed using the same dataset as example 1, except that HTC methods were used, wherein HTC was performed in the literature "Chen, kai, jiangmiao Pang, jiaqi Wang, yu Xiong, xiaoo Li, shuyang Sun, wansen Feng, ziwei Liu, jianping Shi, wanli Ouyang, chen Change Loy and Dahua lin." Hybrid Task Cascade for Instance segment. "2019IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019): 4969-4978, "wherein the feature extractor likewise employs ResNet-50 and ResNet-101.
Comparative example 2
Example segmentation was performed using the same dataset as example 1, except that the DSC method was used, which was described in the literature "Ding, hao, siyuan Qiao, alan Loddon Yuille and Wei shen," deep Shape-guided Cascade for Instance segment, "2021IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 8274-8284, "wherein the feature extractor likewise employs ResNet-50 and ResNet-101.
Experimental example
Using the average Accuracy (AP) as a performance evaluation index, which calculates the average accuracy under all categories and all IoU thresholds, for bounding boxes and instance masks, APs can be classified as bounding boxes AP (AP b ) And an instance mask AP (AP m ) The method comprises the steps of carrying out a first treatment on the surface of the For example mask APs, APs with different IoU thresholds are APs 50 ,AP 75 The APs of different size examples are APs S 、AP M 、AP L
The results of comparative example 1 and comparative examples 1 and 2 are shown in table 1.
TABLE 1
Figure BDA0003622228490000131
As can be seen from Table 1, the results in example 1 are superior to other example segmentation methods in terms of performance.
In the description of the present invention, it should be noted that the positional or positional relationship indicated by the terms such as "upper", "lower", "inner", "outer", "front", "rear", etc. are based on the positional or positional relationship in the operation state of the present invention, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," "third," "fourth," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected in common; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
The invention has been described above in connection with preferred embodiments, which are, however, exemplary only and for illustrative purposes. On this basis, the invention can be subjected to various substitutions and improvements, and all fall within the protection scope of the invention.

Claims (10)

1. The cascade instance segmentation method based on the enhanced semantic segmentation head is characterized by comprising the following steps of:
s1, extracting multi-scale features of an image, and fusing the multi-scale features to obtain single-scale features;
s2, obtaining semantic segmentation features according to the single-scale features;
s3, performing instance segmentation according to the semantic segmentation features and the multi-scale features to obtain a single instance in the image.
2. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,
in S1, multi-scale features in the image are extracted by a feature extractor overlaying the feature pyramid.
3. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,
in S1, the fused multi-scale feature is achieved by:
inputting the multi-scale features extracted by the feature extractor into a feature pyramid, setting 1x1 convolution after each scale feature of the feature pyramid, performing up-sampling operation on the high-level features, performing down-sampling operation on the low-level features, enabling all the features output by the feature pyramid to be fixed into a uniform scale, and then fusing the features of the uniform scale to obtain single-scale features.
4. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,
in S2, inputting the single-scale feature into an enhanced semantic segmentation head, outputting semantic segmentation features by the enhanced semantic segmentation head,
the enhanced semantic segmentation head comprises a segmentation model, a transducer model, a convolution network model and a convolution layer,
wherein the segmentation model is used for segmenting the input single-scale feature into a plurality of blocks, inputting each segmentation block into a transducer model,
the transducer model generates global context features x from input single-scale feature segments g
Convolution network model generates spatial context feature x from input single scale feature s
Global context feature x g With spatial context feature x s After fusion, semantic segmentation features are generated through a convolution layer.
5. The method for cascaded instance segmentation based on enhanced semantic segmentation header according to claim 4,
in S2, the convolutional network model is FCN.
6. The cascaded instance segmentation method based on enhanced semantic segmentation head according to claim 1, characterized in that,
in S3, the single instance in the image is represented by a bounding box and an instance mask, the instance segmentation is implemented by a cascade predictor, which is a multi-stage paradigm structure, employing the output of the previous stage to train the bounding box b of the current stage t And instance mask m t Can be expressed as:
Figure FDA0003622228480000021
Figure FDA0003622228480000022
Figure FDA0003622228480000023
Figure FDA0003622228480000024
wherein F represents a multi-scale feature, x en Representing semantic segmentation features, t representing different phases,
Figure FDA0003622228480000025
boundary box feature representing t-phase, +.>
Figure FDA0003622228480000026
An instance mask feature representing the t-phase, P (·) represents the pooling function, B t Boundary box predictor, M, representing the t-phase t An instance mask predictor, b, representing the t-phase t Boundary box representing t stage, m t An instance mask representing the t-phase.
7. The method for cascaded instance segmentation based on enhanced semantic segmentation header according to claim 6, wherein,
and adding a classification supervision training process when training the enhanced semantic segmentation head and the cascade predictor, wherein the classification supervision training refers to multi-label training taking the categories of all the examples in the image as supervision objects.
8. The method for cascaded instance segmentation based on enhanced semantic segmentation header according to claim 7,
in the classification supervision training, the loss function is set as:
Figure FDA0003622228480000031
wherein ,
Figure FDA0003622228480000032
for semantic segmentation loss, ++>
Figure FDA0003622228480000033
Classifying the loss for the multi-label; t represents the different phases of the cascade predictor, T is the total number of phases, +.>
Figure FDA0003622228480000034
Cross entropy loss for bounding boxes of the t-stage of the cascade predictor; />
Figure FDA0003622228480000035
Cross entropy loss of instance mask for t-stage of cascade predictor, alpha and beta are weight coefficients, lambda t Training weights for different stages.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
10. A computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202210461048.7A 2022-04-28 2022-04-28 Cascade instance segmentation method based on enhanced semantic segmentation head Pending CN116109819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210461048.7A CN116109819A (en) 2022-04-28 2022-04-28 Cascade instance segmentation method based on enhanced semantic segmentation head

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210461048.7A CN116109819A (en) 2022-04-28 2022-04-28 Cascade instance segmentation method based on enhanced semantic segmentation head

Publications (1)

Publication Number Publication Date
CN116109819A true CN116109819A (en) 2023-05-12

Family

ID=86256743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210461048.7A Pending CN116109819A (en) 2022-04-28 2022-04-28 Cascade instance segmentation method based on enhanced semantic segmentation head

Country Status (1)

Country Link
CN (1) CN116109819A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117593716A (en) * 2023-12-07 2024-02-23 山东大学 Lane line identification method and system based on unmanned aerial vehicle inspection image

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117593716A (en) * 2023-12-07 2024-02-23 山东大学 Lane line identification method and system based on unmanned aerial vehicle inspection image

Similar Documents

Publication Publication Date Title
CN109117876B (en) Dense small target detection model construction method, dense small target detection model and dense small target detection method
US20190156144A1 (en) Method and apparatus for detecting object, method and apparatus for training neural network, and electronic device
CN112633276B (en) Training method, recognition method, device, equipment and medium
WO2022227770A1 (en) Method for training target object detection model, target object detection method, and device
EP3204871A1 (en) Generic object detection in images
JP7393472B2 (en) Display scene recognition method, device, electronic device, storage medium and computer program
WO2019167784A1 (en) Position specifying device, position specifying method, and computer program
CN112488999A (en) Method, system, storage medium and terminal for detecting small target in image
CN111553351A (en) Semantic segmentation based text detection method for arbitrary scene shape
CN113724286A (en) Method and device for detecting saliency target and computer-readable storage medium
CN114898111B (en) Pre-training model generation method and device, and target detection method and device
Dong et al. Learning regional purity for instance segmentation on 3d point clouds
CN114267375B (en) Phoneme detection method and device, training method and device, equipment and medium
CN115861400A (en) Target object detection method, training method and device and electronic equipment
CN116109819A (en) Cascade instance segmentation method based on enhanced semantic segmentation head
Jayanthiladevi et al. Text, images, and video analytics for fog computing
CN116246287B (en) Target object recognition method, training device and storage medium
JPH11250106A (en) Method for automatically retrieving registered trademark through the use of video information of content substrate
CN104504162A (en) Video retrieval method based on robot vision platform
CN114913330B (en) Point cloud component segmentation method and device, electronic equipment and storage medium
CN114863450B (en) Image processing method, device, electronic equipment and storage medium
CN116109874A (en) Detection method, detection device, electronic equipment and storage medium
CN114282583A (en) Image classification model training and classification method and device, road side equipment and cloud control platform
CN113344121A (en) Method for training signboard classification model and signboard classification
Kong et al. A doubt–confirmation-based visual detection method for foreign object debris aided by assembly models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination